An overview of data analysis packages

Statistical and survey analysis packages

Computer programs relevant to the analysis of non-experimental datasets fall into two main groups: statistical packages and survey analysis packages. There is a bewildering variety of both. The former handle numerical datasets by applying variable-based and, less frequently, case-based analyses. Wikipedia, for example, lists over 50 different commercial statistical packages, many of which are general statistical packages, while some are more specialized, for example on data mining, visual analytics, statistical modelling, econometric analysis or time series. In addition there are over 30 open-source packages and 10 are either public domain or freeware. Some of the packages, like IBM SPSS Statistics, use a graphical user interface (GUI) that allows users to interact with the program through graphical icons, pull-down menus, dialog boxes, and so on. Others, such as SAS offered by the SAS Institute, have a command line interface (CLI) that requires commands to be typed in using the keyboard. These tend to be more flexible, but have a much steeper leaning curve. However, some of these have an optional GUI front-end available. Others, like Stata, genuinely integrate both a GUI and a CLI. R, developed at the University of Auckland, New Zealand, is an open-source, free programming language that is constantly being developed by users themselves. It compares well with SPSS, Stata and SAS, but requires the user to have some facility with programming, although there are many add-on GUIs available. Both Stata and R have a downloadable or purchasable option to conduct fuzzy set analysis, which is explained in Chapter 7.

Survey analysis packages give you the tools you need to create a survey, design a questionnaire, organize the results and prepare a detailed report. However, the statistical techniques that are available tend to be more limited than for the statistical packages. The more popular ones include Snap Surveys, The Survey System, KeyPoint, Fluid Surveys, StatPac, SurveyGold, SurveyMonkey, Survey Crafter Professional, SurveyGizmo and SurveyPro. These packages are reviewed at http://survey-software-review.toptenreviews.com/. Some, like SurveyMonkey, are designed specifically for online use, while others are geared more to paper questionnaires, but there are usually optional extras to facilitate online use.

An introduction to SPSS

In 1968 three young men from disparate professional backgrounds developed a software system based on the idea of using statistics to turn raw data into information essential to decision-making. These three innovators were pioneers in their field, visionaries who recognized that data and how you analyze them is the driving force behind sound decision-making. This revolutionary statistical software system was called SPSS, which stood for Statistical Package for the Social Sciences. This software is now one of the most widely used survey analysis computer programs. It has gone through many versions, the latest is version 22.0 and is called IBM SPSS Statistics or just SPSS for short. This text uses version 19.0. If you have later or earlier versions, the guidelines here should work perfectly well. For more information and a free download of a demo version visit: www.spss.com/spss. These guidelines are by no means a full introduction to SPSS. It focuses just on the essential procedures you may need for a basic analysis of a dataset. For more detail, refer to Andy Field (2013) Discovering Statistics Using IBM SPSS Statistics, 4th edn. London: Sage.

You can almost certainly obtain access to SPSS by logging on to your own university or college network applications. The first window you will see is the Data Editor window. Before getting to the Data Editor, however, you need to tell SPSS what you want to do – open an existing data source, type in new data, and so on. If you are entering data for the first time, check the Type in data radio button. The Data Editor offers a data matrix whose rows represent cases (no row should contain data on more than one case) while the columns list the variables. The cells created by the intersection of rows and columns will contain the values of the variables for each case. No cell can contain more than one value.

Data entry

Before entering any data, it is advisable first to name the variables (if you do not, you will be supplied with exciting names like var00001 and var00002). These names must begin with a letter and must not end with a full stop/period. There must be no spaces and the names chosen should not be one of the key words that SPSS uses as special computing terms, for example and, not, eq, by, all.

To enter variable names, click on the Variable View tab at the bottom left of the Data Editor window. Each variable now occupies a row rather than a column as in the Data Editor window. Enter the name of your first variable in the top left box. As soon as you hit Enter or the down arrow or right arrow, the remaining boxes will be filled with default settings, except for Label. It is usually advisable to enter labels, since these will be printed out in your tables and graphs (SPSS will otherwise use the variable names as labels). Labels can be the actual wording of the questions asked or a further explanation of the variable name.

For categorical variables, it is advisable also to enter Value Labels (otherwise SPSS will use whatever have been entered as value codes as labels for each category). Click on the appropriate cell under Values and click again on the little blue box to the right of the cell. This will produce the Value Labels dialog box. Enter an appropriate code value (e.g. 1) and corresponding label (e.g. Yes) and click on Add. Repeat for each value. Note that, in SPSS, allocated codes are called ‘values’, while the values in words are ‘labels’.

The default under Decimals is usually two decimal places. If all the variables are integers, then it is worthwhile changing this to no decimal places. Simply click on the cell and use the little down arrow to reduce to 0. Under Measure, you can put in the correct type of measure – Nominal, Ordinal or Scale. Note that Nominal includes binary measures, Ordinal does not distinguish between ordered category and ranked measures, and Scale refers to what have been called metric measures in this text. The default setting is Scale. Changing Measure to Nominal or Ordinal as appropriate creates a useful icon against each listed variable, making them easy to spot; it makes a difference to some operations in SPSS and forces you to think about what kind of measure is attained by each variable.

To copy any variable information to another variable, like value labels, just use Edit/Copy and Paste. SPSS does not have an automatic timed backup facility. You need to save your work regularly as you go along. Use the File|Save sequence as usual for Windows applications. The first time you go to save, you will be given the Save As dialog box. Make sure this indicates the drive you want. File|Exit will get you out of SPSS and back to the Program Manager or Windows desktop. SPSS will ask you if you want to save before exiting if unsaved changes have been made. Always save any changes to your data, but saving output is less important because it can quickly be recreated.

For more detail on SPSS in general and the data editor in particular, see Field (2013: Chapter 3).

Transforming data

To regroup categories on a categorical variable, you need to use the Recode procedure. From the Menu bar, select Transform|Recode Into Different Variables. From the list of variables, select the variable you wish to regroup and transfer to the Input Variable -> Output Variable box. Now click on Old and New Values. If, for example, you wish SPSS to add together the frequencies for categories that have been coded as 1 and 2, in the Old Value dialog area on the left, click on the first Range radio button and enter 1 then through and 2. In the New Value dialog area on the right, enter the code you wish the new combined category to become and click on Add. This instruction will now be entered into the Old -> New box. Repeat for any other categories you wish to combine. Click on Continue. Give the new Output Variable a name in the Name box, and click on Change then OK. The new variable will appear as the last column. To add value labels for categories of the new variable, change to the Variable View and proceed as above.

To create class intervals you need the Recode procedure again. In Old and New Values enter the ranges you want (e.g. 0–9 and 10–19), giving these the new codes of 1 and 2 for a two-value solution.

To get SPSS to compute totals from two or more metric variables, select Transform then Compute Variable. You will obtain the Compute Variable dialog box. Notice that there are lots of functions that you could perform on the variables. If all you want to do is get SPSS to add together the numeric values for each variable, highlight the first variable and put it into the Numeric Expression box by clicking on the arrow. Now click on the + button and bring over the next variable, then click on + again, and so on until you have all the variables you wish added together. Enter a variable name in the Target Variable box and click on OK. A new variable will appear in your data matrix, giving the total scores for each case. You can now, if you wish, use Recode to group the responses into, say, high-, medium- and low-score categories.

For a multiple response question, where respondents can select more than one category, each response needs to be treated as a separate variable in which each item is either ticked or not ticked. SPSS then needs to be told to treat these as a single multiple response question. Select Analyze|Multiple Response|Define sets. Bring the variables across to the Variables in Set box. If a code of 1 was entered for those who had ticked the item, enter 1 in the Counted Value box. Make sure the Dichotomies radio button is clicked under Variables Are Coded As. You will also need to give the new variable a name. Click on the Add button to add the name to the Multiple Response Sets box, then on Close. The new variable, however, does not appear in the data matrix. To access it, click on Analyze|Multiple Response and either Frequencies or Crosstabs depending on whether you want univariate or bivariate analysis.

SPSS makes a distinction between two kinds of missing value: system missing values and user-defined missing values. The former result when the person entering the data has no value to enter for a particular variable (for whatever reason) for a particular case. In this situation the data analyst will just skip the cell and SPSS will enter a full stop in that cell to indicate that no value has been recorded. For most non-graphical outputs, SPSS will list in a separate Case Processing Summary the number of valid and the number of missing cases. In some tables, valid and missing cases are shown in the printed output table itself. Percentages are then calculated both for the total number of cases entered into the data matrix and for the total of non-missing cases for that variable – what SPSS calls the Valid Percent.

User-defined missing values are ones that have been entered into the data matrix, but the researcher decides to exclude them from the analysis. To create them for any particular variable, from the Variable View select the little blue box in the Missing column against the variable you want and obtain the Missing Values dialog box. This enables you either to pick out particular codes to be treated as missing values by clicking on the Discrete missing values radio button and entering up to three codes, or to select a range of missing values.

Producing frequency tables, charts and histograms 

To obtain univariate frequency tables for categorical variables, you will need the Frequencies procedure. This is in the Analyze|Descriptive Statistics drop-down menu from the menu bar at the top. In the Frequencies dialog box, all variables are listed in the left box. To obtain a frequency count for any variable, simply transfer it to the Variable(s) box by highlighting it, then clicking on the direction button in the middle. To highlight blocks of adjacent variables or all the variables, hold down the shift key when highlighting the first variable, scroll down to the last variable in the block and click again. To change the order of the categories presented in a table, from the Frequencies box select Format and then select the Descending values radio button, then click on Continue and OK.

Where there are no missing cases the Percent and Valid Percent are the same. You can edit the table to remove these columns if you wish. Just double-click on the table. This puts you into editing mode and gives you the Formatting Toolbar. Highlight the figures in the column (drag the pointer down the figures) and hit Delete. The column will disappear and the table will close up. You can do the same with the Cumulative Percent column. To get out of Edit mode just left click outside the table area. With the table highlighted (single click on the table – there will be a frame around it and a red arrow to the left) you can select Edit and Copy and then Paste it into any other application like Word or PowerPoint. Use the Paste Special procedures in PowerPoint or Word, otherwise these programs will try to edit the tables.

The Frequencies procedure will produce a separate table for each variable entered into the Variables box. To produce a multi-variable table, select Analyze|Tables|Custom Tables. Drag your first variable into the Rows box, then drag the next and subsequent variables to the foot of the lowest table shown. Alternatively, if the variables to be tabled are listed next to one another, highlight them all (by holding down the shift key) before dragging across. If the response categories for a number of variables are all the same and you want a table that sets out the responses as a matrix, create a multi-variable table as above, then under Category Position, select Row Labels in Columns.

To obtain bar charts and pie charts, click on Charts in the Frequencies dialog box. Simply click on Bar Chart and indicate whether you want the axis label to display frequencies or percentages, click on Continue and then OK. This will give you a basic default bar chart in addition to the frequencies table. To obtain other kinds of bar chart like ‘stacked’ or ‘clustered’ bar charts or three-dimensional charts, you will need to select the Graphs drop-down menu. You can then choose between the Chart Builder, which gives you a kind of chart wizard, or the dialog boxes that were in previous versions of SPSS and which are now called Legacy Dialogs to give you the Bar Charts dialog box. If you choose Chart Builder, drag the kind of chart you want into the Chart preview box, then drag the variable you want across to the X-Axis. The Y-Axis will change to Count. Click on OK.

Once you have obtained your chart you can edit it by double-clicking in the chart area. This will give you the Chart Editor. You can change the colours and a number of other chart features from the editor. Close it when you have finished. If you single-click on the chart area, you highlight it and it can be copied into other applications.

The Chart Builder can be used to produce both histograms and line graphs for metric variables. You can impose a normal curve on the histogram by clicking in the Display normal curve box in the Element Properties dialog box and clicking on Apply. The normal curve is explained in Chapter 4.

Data summaries

Data summaries for categorical variables can be obtained by using the Frequencies procedure to get percentages of each category and to obtain the modal category. However, note that for ordered category variables, SPSS will give you the median value, not the median category. It does this by treating the codes allocated to each category as metric values – which may be inappropriate.

There are two ways of obtaining univariate data summaries for metric variables in SPSS. One is to use the Statistics button in the Frequencies dialog box. Select Analyze|Descriptive Statistics|Frequencies. Put one or more metric variables in the Variable(s) box and then click on the Statistics button. Just put a tick in the box against the statistics you want by clicking with the left mouse button, then Continue and OK. The other procedure is found under Analyze|Descriptive Statistics|Descriptives and gives a quick summary of each variable that includes the minimum and maximum scores and the mean and standard deviation. This is a more useful layout if there are many variables since they are listed by column rather than across the page.

Univariate statistical inference

SPSS provides confidence intervals for the estimation of metric variables under Analyze|Descriptive Statistics|Explore. This gives you the Explore dialog box. Put a metric variable in the Dependent List and click on OK. The output provides many different statistical summaries, but the confidence interval for the mean at the 95 per cent level is the default. If you click on the Statistics box you can change the level of confidence, for example to 99 per cent. All these statistics can be split by a number of factors, for example gender of respondent, in which case you get separate tables for each. Just put the variable in the Factor List box.

Unfortunately, one thing that SPSS does not do is calculate the standard error of the proportion, which means that it cannot give you the corresponding confidence intervals for categorical variables. It can, however, test for differences between achieved sample proportions and hypothesized values. Under Analyze|Nonparametric Tests|One Sample, SPSS uses either a one-sample binomial test or a one-sample chi-square test as appropriate to assess the p-value for the difference between the sample result and the hypothesized values, which SPSS assumes are equal proportions in each category. To change the hypothesized proportions, select Legacy dialogs instead of One Sample and either Binomial, if the variable is binary, and change the Test Proportion or select Chi-square, if nominal with three or more categories, and change the Expected Values.

For metric variables, SPSS offers a one-sample t-test. This is found under Analyze|Compare Means|One-Sample T-Test. Put the metric variable into the Test Variable(s)box. Enter the Test Value, which is the hypothesized value, and click on OK. 

Crosstabulations

Select Analyze|Descriptive Statistics|Crosstabs to obtain the Crosstabs dialog box. Enter your dependent variable (e.g. intention to drink alcohol in the next year) in the Rows box so it will appear at the side, and enter the independent variable (e.g. brand importance for alcohol) in the Columns box. If you put several variables in each box, then you will obtain a crosstabulation of each combination. To obtain column percentages, click on the Cells button in the Crosstabs dialog box to obtain Crosstabs: Cell Display. Click on Column in the Percentages check box, uncheck the Observed box under Counts, then on Continue, then on OK. If you wish to change the order in which the categories of the dependent variable are listed, click on the Format button in the Crosstabs dialog box and change Row Order to Descending. Click on Continue, then on OK.

To obtain summary measures of association for two categorical variables, click on the Statistics button in the Crosstabs dialog box and you will obtain the Crosstabs:Statistics dialog box. SPSS features no fewer than nine coefficients of association that may be used to create bivariate summaries for categorical data. However, some of these are for situations where both variables are nominal, some where both are ordered category, and one for a mixture of nominal and metric. Select any of the statistics you require, click on Continue and then OK.

Correlation and regression 

To correlate two metric variables select Statistics|Correlate|Bivariate. Put your two variables into the Variables box. Under Correlation Coefficients you will find the Pearson box already ticked. Spearman’s rho can also be obtained from this dialog box. Click on OK.

To obtain a regression analysis, select Analyze|Regression|Linear. Notice that you now have to make a selection of dependent and independent variables. The dependent variable goes along the Y-axis; it is the one the researcher is trying to predict. The independent variable goes along the X-axis. Click on OK.

Testing bivariate relationships for statistical significance

To test the statistical significance of a bivariate hypothesis linking two categorical variables, SPSS provides the statistical significance of chi-square. Check the tick-box against chi-square on the Crosstabs:Statistics dialog box (select Analyze|Descriptive Statistics|Crosstabs|Statistics).

To test the statistical significance of a bivariate hypothesis linking two metric variables, SPSS provides a p-value for Pearson’s r which appears automatically on the Pearson Correlation procedure.

One-way analysis of variance is to be found under Analyze|Compare Means|One-Way ANOVA. Put two or more metric dependent variables into the Dependent List box and the one categorical variable under Factor. Click on OK.

Multivariate relationships

For analysing the relationships between three or more categorical variables, it is possible to get SPSS to produce three-way, four-way, up to n-way tables using the Crosstabs procedure. The control variables are called layers in SPSS. The original bivariate table is broken down by each category of the layer variable. Select Analyze|Descriptive Statistics|Crosstabs. Put your original dependent variable into the Row(s) box and the original independent variable into the Column(s) box. Put your ‘control’ variable into the Layer 1 of 1 box. Click on Statistics and tick Phi and Cramer’s V in the Nominal box and/or Gamma in the Ordinal box. Click on Continue then OK. Further layers may be added by clicking on the Next button.

An alternative to n-way tables for categorical variables is log-linear analysis. Select Analyze|Loglinear|Model Selection to obtain the Model Selection Loglinear Analysis dialog box. Transfer the variables whose relationships are to be studied into the Factor(s) box. SPSS needs to be told what codes have been used to define the categorical variables, so, for each variable, highlight the variable, click on Define Range and enter the Minimum and Maximum values. Click on Continue. The default model-building procedure is backwards elimination, which is the one normally used. Click on OK. There are several components to the output, but focus on the backwards elimination statistics. For a detailed explanation of all the components of the output see Field (2013: Chapter 18).

To analyse the relationships between three or more metric variables, multiple regression is the standard procedure. Select Analyze|Regression|Linear to produce the Linear Regression dialog box. Move your metric dependent variable across to the Dependent box. Put the metric predictor variables into the Independent(s) box and click on OK. If the dependent variable is binary, use logistic regression. Select Analyze|Regression|Binary Logistic. This will give you the main Logistic Regression dialog box. Transfer the dependent variable into the Dependent box. Transfer the metric independent variables into the Covariates box. There are various methods of logistic regression, including stepwise procedures, but the default is what SPSS calls Entry. This is the forced entry method: all the covariates are placed into the regression model in one block. Click on OK.

If the dependent variable is metric, but the independent variables are categorical, then use multivariate analysis of variance. Select Analyze|General Linear Model|Univariate to obtain the Univariate dialog box. Put the metric variable into the Dependent Variable box. Transfer the categorical independent variables to the Fixed Factor(s) box. Click on OK.

If no distinction is to be made between dependent and independent variables, but all are to remain of equal status, then, if the variables are metric, use factor analysis. Select Analyze|Dimension Reduction|Factor to produce the Factor Analysis dialog box. Transfer the variables to the Variables box. Click on Rotation and select Varimax, then Continue and OK.

Case-based procedures on SPSS

SPSS does not handle configurational analysis. The only procedures it offers that may be considered case-based are cluster analysis and discriminant analysis. Cluster analysis is a way of grouping cases into clusters when there is no a priori information or hypotheses about which cluster any case belongs to. If the number of cases is over 100 or so, the analysis becomes very difficult to interpret, so a sub-set may need to be selected. To select, for example, the first 50 cases in a dataset, go to Data|Select Cases. In the Select Cases dialog box choose Based on time or case range and enter range 1–50. Any analyses will now be carried out on these 50 cases. To return to the original cases, go back to Select Cases and select the All cases radio button.

Cluster analysis is found under Analyze|Classify and offers a choice of Hierarchical Cluster, K-Means Cluster and TwoStep Cluster. Hierarchical clustering is probably the most flexible. It can handle both metric and categorical data, it has a variety of methods of clustering available (these are algorithms for defining how similarity between multi-member clusters is measured) and it can standardize metric variables if different units have been used. Click on Hierarchical Cluster. In the Hierarchical Cluster Analysis box click on Method. The default is Between-groups linkage, which iteratively combines cases in a way that maximizes the between-group variance between clusters. Other options include Within-groups linkage, Nearest neighbor, Furthest neighbor, Centroid clustering, Median clustering and Ward’s method. There is also the option of standardizing variables in various ways if the clustering variables have used different metrics. The default distance measure for metric variables (SPSS refers to metric variables in this procedure as Interval) is Squared Euclidean distance. This is the measure normally used, but Pearson correlation may be an alternative. For categorical variables (SPSS calls them Counts) Chi-square is normally used, but Phi-square is an alternative. Note that for Binary variables, it is still possible to use squared Euclidean distance and a range of other measures. To obtain a dendrogram, click on Plots and then tick the Dendrogram box. Orientation is normally Vertical but Horizontal is an alternative.

K-Means Cluster requires that all the variables selected are metric and that the researcher specifies the number of clusters required. The algorithm uses Euclidean distance measures and results are sensitive to the order in which the cases are placed into the analysis, so it is advisable to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution. The procedure assumes that the researcher has selected the appropriate number of clusters and has included all relevant variables. Having decided on the number of clusters, it is now possible to use K-Means Cluster to analyse the cases and to ‘fine-tune’ the allocation of cases to those clusters. Under Options it is possible to check the box Cluster information for each case. The output shows the number of cases in each cluster.

In discriminant analysis, the groupings and each case’s membership are known. The objective of discriminant analysis is to know which specific linear relationships of selected metric variables best predicts each case’s group membership. The procedure is found under Analyze|Classify|Discriminant. Transfer into the Grouping Variable box the key nominal variable that contains the groupings you wish to be able to predict (e.g. the three groupings of intention to drink alcohol in the next year). You will need to define the range of codes; click on the button and enter the highest and lowest code values. Into Independents, transfer the metric variables to be used to predict the grouping. SPSS offers two methods of discriminant analysis: direct, in which the independent variables are entered into the discriminant function together; and stepwise, in which statistical criteria are used to determine the order of entry. The default is direct. To obtain an actual prediction for each case, click on Save and then the box Predicted group membership. The predicted group membership will appear as a new column in the data matrix labelled Dis_1. Click on OK.