Answers to Exercises and questions for Discussion

Select the first 50 cases from the SPSS alcohol marketing dataset and undertake a hierarchical cluster analysis on the brand importance items. Try crosstabulating cluster memberships against gender.

Cluster analysis is a range of descriptive, exploratory techniques usually for grouping cases rather than variables into different clusters such that members of any cluster are more similar to each other in some way than they are to members in other clusters. Each case is assigned to a cluster suggested by the data, not defined beforehand, based on specified properties as variables (rather than being assigned to a given configuration based on properties as set memberships, as in Chapter 7). Hierarchical clustering develops a tree-like structure, usually based on taking individual cases and combining them on the basis of some measure of similarity, such as the degree of correlation between the cases on a number of variables.

To select the first 50 cases in the alcohol marketing dataset, go to Data|Select Cases. In the Select Cases dialog box choose Based on time or case range and enter range 1–50. Any analyses will now be carried out on these 50 cases. Now select Analyze|Classify| Hierarchical Cluster. To obtain a dendrogram, click on Plots and then tick the Dendrogram box. Put the brand importance items into the Variable(s) box and click on OK. Eight of the first 50 cases are missing, leaving 42 included in the analysis. In the Agglomeration Schedule there are n − 1 or 41 stages of agglomeration for 42 cases. In the first stage, cases 45 and 49 have been combined; in stage 2, cases 18 and 31; and so on. These have the shortest ‘distance’ between them and reappear in the dendrogram with the distance indicated along the horizontal scale. The agglomeration Coefficient measures the increase in heterogeneity (reduction in within-cluster similarity) that occurs when two clusters are combined. There is no sudden jump in the coefficient, but the biggest jump occurs at stage 30 when cases 22 and 30 are combined. From the dendrogram, we can see that there are about six clusters at this stage (working from right to left), so a six-cluster solution may be appropriate.

Now return to the Hierarchical Cluster Analysis screen, select Save, then under Cluster Membership select Single solution and enter 6 for number of clusters. Click on Continue and OK. A new variable should appear on your data matrix giving cluster membership for each of the 42 cases in the analysis. SPSS will have given the variable a name like CLU6, which you can change if you wish. This variable can now be crosstabulated against gender if you wish to compare cluster memberships between males and females.

Undertake a discriminant analysis, taking Drink status as the grouping variable and Total importance of brands, Number of channels seen and Total involvement as the independents.

In discriminant analysis the groupings and each case’s membership are already known. The objective of discriminant analysis is to know which specific linear relationships of selected metric variables best predicts each case’s group membership. Discriminant analysis is useful for situations where researchers want to build a predictive model of group membership based on observed characteristics of each case. The discriminant functions are generated from a sample of cases for which group membership is known; these can then be applied to new cases with measurements for the predictor variables but unknown group membership.

Select Analyze|Classify|Discriminant. Transfer into the Grouping Variable box the key nominal variable that contains the groupings you wish to be able to predict – in this exercise Drink status. You will need to define the range of codes; click on the button and enter the highest and lowest code values 0 and 1 in this exercise. Into Independents, transfer the metric variables to be used to predict the grouping, for example Total importance of brands, Number of channels seen and Total involvement.

To obtain an actual prediction for each case, click on Save and then the box Predicted group membership. Click on Continue and OK. The predicted group membership will appear as a new column in the data matrix labelled Dis_1.

The SPSS output includes a case processing summary that shows that there are 518 valid cases (402 had at least one discriminating variable missing). Since there are only two groupings for drink status, there is only one function, which explains all the variance. Total involvement makes the biggest contribution, but the degree of discrimination between the two categories of drink status is fairly limited. However, since the categorical dependent variable is binary, then binary logistic regression might be better since its assumptions are somewhat less stringent.

In what circumstances would you employ discriminant analysis rather than cluster analysis?

Both cluster and discriminant analysis are largely exploratory techniques that focus on classification; they are not usually used to ‘test’ or evaluate models. They focus on either the generation of taxonomies or figuring out what way variables may be ‘responsible’ for a pre-specified taxonomy. Cluster analysis is used for a variety of purposes including the identification of ‘natural’ clusters in the data, the construction of useful conceptual schemes or taxonomies for classifying entities, data reduction, generating hypotheses or testing hypothesized groupings believed to be present. Although the variables used for clustering techniques may be metric or categorical, it is best to stick to metric variables since the inclusion of binary or dummy variables, while quite appropriate for regression models, creates a problem in cluster analysis since no amount of recoding will transform nominal items into metric ones whose differences imply distances.

In discriminant analysis the clusters and each case’s membership of them are known in advance. The objective of discriminant analysis is to know which specific linear relationships of selected metric variables best predict each case’s group membership.

Discriminant analysis is useful for situations where researchers want to build a predictive model of group membership based on observed metric characteristics of each case.

What assumptions are being made by cluster analysis and discriminant analysis?

Cluster analysis assumes that there are always clusters even when there are, in fact, no natural groupings in the data. The various techniques work by imposing a cluster structure on the data rather than allowing the structure to emerge from the analysis. Accordingly, researchers should always have a strong conceptual basis for why such groups should exist in the first place and the results of cluster analysis should always be seen as tentative rather than final. Cluster analysis is essentially an exploratory rather than a verificational method and it is descriptive rather than inferential. Cluster solutions, furthermore, are not generalizable because they are totally dependent on the variables used as the basis for similarity measures.

Cluster analysis, furthermore, assumes that there is limited multicollinearity and that there are no outliers, which can severely distort the results.

Discriminant analysis assumes that the dependent variable is binary or nominal and represents group differences of interest. Furthermore, it assumes that the metric independent variables are normally distributed with more or less equal variances and that, once again, multicollinearity is limited.

Why is one approach a dependence technique and the other an interdependence technique?

Cluster analysis is an interdependence technique in that no distinction is made between dependent and independent variables. It cannot, furthermore, be used for prediction, but only to profile the clusters that are generated. Discriminant analysis is a dependence technique in that its objective is precisely to make predictions of a dependent categorical variable from specific linear relationships of selected metric variables.

Analysing Quantitative Data

Variable-based and Case-based Approaches to Non-experimental Datasets

Student Resources

Answers to Exercises and questions for Discussion