Answers to Exercise

10.1 Visualizing grouped data

The first interactive graph shows the relationship between global temperature change and CO2 emission over time. The graph is a smoothed scatter plot at three periods: between 1880 and 1929; 1930-1979; 1980-2010. We can see that the relationship between CO2 emission and global temperature is dynamic. Early records (the first period), where the emission was relatively low have seen a slight negative association between the two variables, with quite a bit of unexplained variation around the regression line. In the second period, the relationship turned marginally positive, though again with a lot of unexplained variation. Lastly, in the most recent period the effect has become visibly positive, sizeable with much less unexplained variance. It coincided with a significant increase in CO2 emission.

The second graph shows over time tendencies in the incidence of crime in London across different boroughs. The think black line is the average tendency when all boroughs are taken into account. The graph shows that though overall the trend line is pretty flat, suggesting the number of registered crimes is relatively stable, there are considerable differences between the boroughs, with some boroughs registering considerably more crime than others. Furthermore, in some boroughs we can detect a slight positive trend over time. The graph thus allows us to see important variation in crime levels, but it makes it almost impossible to assess what is going on in individual counties.

Neither of the graphs can answer the question why the observed patterns and tendencies take place and whether they are representative of the population. However important, visualization techniques are only the first step in data analysis.

10.2 Research hypotheses

The exercise asks students to look at two graphs introduced earlier in the chapter and formulate research hypotheses as well as null and alternative hypotheses.

1.   Global temperature change – CO2

A brief look at the scatter plot below suggests a strong positive relationship between CO2 emissions and global temperature change. With this in mind, our research hypothesis can be formulated as follows: Increases in carbon dioxide emissions cause an accelerating rise of the global temperature. The specific wording can vary, but the two points should feature prominently: first, the assumption of causality between the two variables, which is one of the fundamental differences between a research hypothesis and statistical hypothesis testing; second, the notion of an accelerating rise is important because it reflects that we are looking at the temperature change rather than its absolute value.

The null hypothesis then will be as follows: there is no association between CO2 emissions and the global temperature change.

The alternative hypothesis: the true association between CO2 emissions and the global temperature change is not equal to 0. We can be more specific and hypothesize that the true association is above 0, which would assume a one-tailed statistical hypothesis. However, we do not go in such detail in the book.

2.   London crime data

The violin plot below shows that crime incidence varies across the boroughs of London. We can thus say that our main research hypothesis will test the assumption that crime incidence is unequally distributed among local authorities.

The null hypothesis: the average number of registered crimes is equal for all local authorities.

The alternative hypothesis: the average number of crimes is not the same for all local authorities (i.e. at least one authority is different from the rest of the data).

It is worth noting that while statistical hypothesis testing in this case is about comparing the means, the research hypothesis is more complicated than that. Underlying the research hypothesis are causal mechanisms that explain the variation of crime incidence (poverty level within local authorities, police density, etc.). We should emphasize that such assumptions fall outside the remit of the null and alternative hypotheses.

10.3 Reading regression outputs

The idea behind this exercise is to identify the most important parameters in a regression output and to be able to describe them in plain language. The first output is a standard logistic regression summary produced in R. The key bits of output to look at:

  • Estimate: shows the raw effects of an independent variable on a dependent variable in logarithm of odds ratios. Logarithm of odds ratios or logit is difficult to interpret. All we can see is that the effect of pay appears to be negative; that is, the higher the relative pay (relative to the median salary) the less likely the person to support Brexit. Conversely, one's age is positively related to the likelihood of supporting Brexit.
  • Standard error and z value are essential metrics for calculating p-values. We can skip them and look at p-values.
  • Both independent variables are statistically significant at p<0.001. This means that under the null hypothesis (true effects = 0) the likelihood of the relationships observed in our data occurring purely by chance is close to 0. Thus, we have enough confidence to claim that the effects of pay and age on the probability of supporting Brexit are statistically significant.
  • Part of the problem with the standard regression output is that it is virtually impossible to present the relationship between the variables in an easily digestible format. That’s why researchers exponentiate logit coefficients to produce odds ratios.
  • The second output contains odds ratios for each predictor in the model alongside 95 per cent confidence intervals.
  • Odds ratios for the effect of pay are below 1, which indicates a negative effect (the probability of supporting Brexit dwindles as pay increases). A one unit increase in relative pay (say from pay equal to the median salary to twice the size of the median salary) will only slightly shift the odds in favour of remain.
  • Conversely, if age goes up by one year, the odds of supporting Brexit will increase, but only slightly (by 1.02 on average).
  • 95 per cent confidence intervals are incredibly important. They tell us that the effect of age is stable in that the lower bound and the upper bound of the effect change only slightly, by a third decimal point not captured in our output. The odds ratios for pay are likely to range from 0.92 to 0.95, which is again a sign of a marginal albeit stable effect.