Answers to Exercises and questions for Discussion
To what extent can treating codes allocated to ordered category measures as if they are numeric values be justified in the data analysis process?
Ordered category variables define the relationships between values in terms of categories that not only are exhaustive and mutually exclusive, but are also arranged in relationships of greater than or less than, although there is no metric that will indicate by how much. For data entry purposes, these categories will usually be coded such that the numbers allocated preserve the order of the categories with the highest code allocated to the highest or most positive category. If, for example, there are five categories, these will normally be coded 5 down to 1. If, and this is a big ‘if’, the categories can be considered as more or less equally spaced, then researchers will often treat the codes as numeric values and, accordingly, the variable as metric. Researchers may then calculate (or, rather, get SPSS to calculate, see Chapter 3) an arithmetic mean by totalling the values and dividing by the number of cases used in that calculation. This may be done separately for different groups so that means may be compared. Whether or not this can be justified really depends on the legitimacy of the assumptions made about the ‘distances’ between the categories. Likert categories from ‘strongly agree’ to ‘strongly disagree’ are generally accepted as legitimate for this purpose. The categories used for degrees of satisfaction or dissatisfaction are often more problematic and may be best treated as ordered category variables. Where the categories are to be a component of a summated rating scale, then the totalling is done across the items (using the Compute procedure on SPSS, see Box 2.4). The totals may then be averaged across cases. The legitimacy of this data analysis process tends to be generally accepted on the basis that treating ordered categories as if they are metric results in relatively little error. Researchers should – and sometimes do – check out the reliability and validity of their summated rating scales by, for example, taking repeat measures, checking for internal consistency by getting SPSS to calculate Cronbach’s coefficient alpha (see Box 1.1) or reviewing the content and adequately satisfying themselves that the items included in the scale adequately sample the domain of features that should be included.
When transforming variables, researchers make many decisions for which there are no ‘rules’ or even rough guidelines. What impact might these decisions have on the validity of the data?
There are many ways in which variables might be transformed before analysis begins, or even after it begins, for example regrouping values on a nominal or ordered category measure to create fewer categories, creating class intervals from metric measures, computing totals or other scores from combinations of several values of variables, treating groups of variables as a single multiple response question, upgrading or downgrading measures, handling missing values and ‘Don’t know’ responses, coding open-ended questions, or creating crisp or fuzzy set memberships from nominal, ordered category, ranked or metric measures.
Data transformation is an important part of the data analysis process. There are no ‘right’ or ‘wrong’ ways of engaging in data transformation and there are usually several different ways in which it can be done. Perhaps the best strategy is what is sometimes called ‘sensitivity analysis’ whereby transformations may be tried in different ways to see how sensitive the results are to such processes. There is still the difficult question, however, of the degree of sensitivity that is considered to undermine the validity of the data.
What are the key circumstances in which missing values might be a severe problem for the data analyst?
In any survey, not all respondents will, for a variety of reasons, answer all the questions. The result is that some values will always be missing from some of the cells in the data matrix. Where this is a result of questionnaire design whereby not all the questions are relevant to all the respondents, then this is not so much a problem unless it leaves too few cases to analyse.
Where a question would be appropriate to a given respondent, but an answer is not recorded, then such ‘item non-response’ may be a more serious issue. Most researchers are inclined just to accept that there will be item non-response for some of the variables and will simply exclude them from the analysis. This is fine when the number of cases entered into the data matrix is large or at least sufficient for the kinds of analyses that are required. However, there is always the danger that this approach may reduce the number of cases used in a particular analysis to such an extent that meaningful analysis is not possible. Many techniques have been suggested in the literature for ways of dealing with this situation, most of which involve filling the gaps caused by missing values by finding a replacement value.
Most of the techniques assume, however, that question items not responded to are done so at random, but it is quite possible that certain types or categories of people are not responding. Furthermore, when the amount of item non-response is small – less than about 5 per cent – then applying any of the methods is unlikely to make any significant difference to the interpretation of the data. Ideally, of course, researchers should, in reporting their findings, communicate the nature and amount of item non-response in the dataset and describe the procedures used to remedy or cope with it.
Open IBM SPSS on whatever system you are using and enter the nine key variables for the first 12 cases for the alcohol marketing dataset that are illustrated in the next chapter in Figure 3.1. The procedures for doing so are explained in Box 2.1.
This is just an exercise in entering data into SPSS. It is best to begin by naming your variables and entering labels as appropriate. Select Variable View and follow the instructions in Box 2.1. Notice that there are several missing values for Initiation. This is because those who say they have never had a proper alcoholic drink will not have an age at which they first had such a drink. They should have been coded as 0 under Drinkstatus.
Figure 2.10 shows the total scores for the importance of well-known brands in choosing products. Try creating class intervals in various different ways using SPSS. The procedures for doing so are explained in Boxes 2.2 and 2.3.
First, you need to decide how many intervals you want. To create two intervals, for example, then, from the Cumulative Percent column in Figure 2.10, you can see that nearly half had total scores of up to 26 and the rest 27 or more. To create more intervals it is usually preferable to make them as equal in size as possible, for example 0–9, 10–19, 20–29, 30–39, 40–45. These are not exactly equal, but you could have intervals of 9 rather than 10.
Access the full alcohol marketing dataset (available at https://study.sagepub.com/kent). Total importance of brands has already been created under Totbrand. Select Transform then Recode into Different Variables. Scroll down to Total importance of brands and move across to the Input Variable box. Now follow the instructions in Boxes 2.2 and 2.3.
Go to the website www.surveyresearch.weebly.com. Here you will find lots of interesting information about social surveys created by John Hall, previously Senior Research Fellow at the UK Social Science Research Council (1970–6) and Principal Lecturer in Sociology and Unit Director at the Survey Research Unit, Polytechnic of North London (1976–92).
Download the Trinians dataset. To do this, select Survey Unit, Social Science Research Council, then Surveys by SSRC Survey Unit and then the ‘Trinians’ survey. Read the background to the survey, download the article in Folio and the questionnaire. Finally download and save the dataset from trinians.sav. Not all the questions in the questionnaire appear as variables and they are not all in the same order as in the questionnaire, but the question numbers are clearly marked. Check out the values being used from the Values column. Under Measure, they are all indicated as Scale. This is the default if researchers do not change any of these. Go down the variables and change to Ordinal or Nominal as appropriate (left click on Scale and the other two options will appear).
The variables that should be changed to Nominal are:
The items in Q14
The variables that should be changed to Ordinal are:
The items in Q11
Approval of political protest methods
The rest are Scale (discrete or continuous metric in the terminology used in this text). Month of birth is not entirely equal interval, but can still be measured in terms of number of days. Q33 is a semantic differential and has been treated as metric.