Datasets

Access each dataset utilised in the text below:

Home Sales in Milwaukee, USA in 2012

The City of Milwaukee maintains a database of home sales that is available to the public, a selection of which is provided here. The dataset is an Excel file consisting of the 1,449 home sales reported in calendar year 2012. Each record, or row of the dataset, consists of the following column variables:

  1. number of stories;
  2. size of house, as measured by the number of finished square feet;
  3. number of bedrooms;
  4. number of bathrooms;
  5. size of lot (in square feet);
  6. date of sale;
  7. sales price, in dollars;
  8. political district (based on alderman district and represented by a number from 1 to 15;
  9. age of house;
  10. whether there is a full basement – 1 = yes; 0 = no;
  11. whether there is an attic – 1 = yes; 0 = no;
  12. whether there is a fireplace – 1 = yes; 0 = no;
  13. whether the house is air-conditioned – 1 = yes; 0 = no;
  14. whether the house has a garage – 1 = yes; 0 = no;
  15. x-coordinate of house location;
  16. y-coordinate of house location.

The complete dataset covers the period from 2002 to present (2018), and includes information on a small number of other variables as well (taxkey number, address, style, type of exterior, number of half and full baths, and neighborhood). It can be downloaded from https://data.milwaukee.gov/dataset/property-sales-data [accessed 27 June 2019]. Please note that the 2012 dataset differs somewhat from the dataset linked above and used in this book.

 

Singapore Census Data for 2010 

This dataset consists of various census variables for the 36 planning areas of Singapore. The variables, which constitute the columns of the dataset, are as follows:

  • Name of planning area;
  • total population, male population, female population, population over 65;
  • Chinese  population, Malay population, Indian population;
  • population 5 and over, percent 5 and over who speak English;
  • students 5 and over, students 5 and over with a long commute;
  • population 15 and over, population over 15 who are in the labor force;
  • unemployment rate;
  • illiteracy rate among those 15 and over;
  • population in various religious categories: not religious, Buddhist, Tao, Islam, Hindu, Sikh, Catholic, Other Christian;
  • percent of those over 15 with no schooling, percent with university degree;
  • population of working persons with monthly income under 1,000 Singapore dollars, population of working persons with monthly income over 8,000 Singapore dollars;
  • fraction of those in the labor force with a long commute;
  • total households;
  • percentage of householders who rent.

Many more variables are accessible through https://www.tablebuilder.singstat.gov.sg/publicfacing/mainMenu.action [accessed 27 June 2019].

 

Hypothetical Housing Prices

This file is an SPSS formatted file consisting of 500 cases (rows) and 16 variables (columns). Each row represents a hypothetical home that has been sold, and variables consist of a mixture of regional location, housing attributes, and census attributes for the subregion that the house is located in.

Definition of variables

1.region

a number between one and six, representing the region in which the house is located.

2.price

price the house sold for in £.

3.garage

a dummy variable that takes the value:

 

1 if a garage is present

 

0 if a garage is not present.

4.bedrooms

number of bedrooms.

5.bathrooms

number of bathrooms.

6.datebuilt

year in which house was built.

7.floor area

floor area of the house in square meters.

8.detached

a dummy variable that takes the value:

 

1 if the house is detached

 

0 othewise.

9.fireplace

a dummy variable that takes the value:

 

1 if a fireplace is present

 

0 if a fireplace is not present.

10.age<15

percentage of subregional population age less than 15.

11.age65+

percentage of subregional population age 65 and over.

12.nonwhite

percentage of subregional population non-white.

13.unemploy

percentage of subregional population unemployed.

14.ownocc

percentage of subregional housing that is owner-occupied.

15.carsperhh

average number of cars per household in subregion.

16.manuf

percentage of subregional population employed in manufacturing.

 

1990 Census Data for Erie County, New York

A 235 × 5 data table was constructed by collecting (from the 1990 US Census) and deriving the following information for the 235 census tracts in Erie County, New York (variable labels are in parentheses):

  1. Median household income (medhsinc).
  2. Percentage of households headed by females (femaleh).
  3. Percentage of high-school graduates who have a professional degree (educ).
  4. Percentage of housing occupied by owner (tenure).
  5. Percentage of residents who moved into their present dwelling before 1959 (lres).

 

Monthly Rain Gauge Accumulations for Seattle

The data consist of monthly rain gauge accumulations (in inches) at fifteen separate locations in Seattle, Washington. Coverage is from November, 2002 through May, 2017, a period of 175 months. Figure 1.10 shows the rain gauge locations; the dataset contains information for a subset of those locations shown in the figure. The data are provided by Seattle Public Utilities and are a part of the Seattle Open Data Program (see https://data.seattle.gov/City-Business/Observed-Monthly-Rain-Gauge-Accumulations-Oct-2002/rdtp-hzy3 [last accessed 27 June 2019]).

 

PM2.5 Particulate Matter Data for Buffalo, New York (2018)

The data are from the US Environmental Protection Agency and consist of daily values of PM2.5 fine particulate matter for five locations in the Buffalo, New York region. Data are freely downloadable at https://www.epa.gov/outdoor-air-quality-data/download-daily-data [last accessed 27 June 2019]. Daily mean levels of PM2.5 are given in units of micrograms per cubic meter.

The file has 1578 rows and it consists of data for the year 2018. The data file has eight columns: (2) Site ID, (3) instrument ID (different instruments at the same site are sometimes used for measurements (4) daily mean level of PM2.5, (5) air quality index (the index ranges from good (0-50) to moderate (51-100) to more severe categories and a maximum value of 500; the highest value in the dataset is 86), (6) a code (88101 or 88502), referring to different types of data collection, (7) latitude of site and (8) longitude of site.