SAGE Journal Articles

Click on the following links. Please note these will open in a new window.

Franke, T. M., Ho, T., & Christie, C. A. (2012). The chi-square test: Often used and more often misinterpreted. American Journal of Evaluation, 33(3), 448–458. doi:10.1177/1098214011426594.

The examination of cross-classified category data is common in evaluation and research, with Karl Pearson’s family of chi-square tests representing one of the most utilized statistical analyses for answering questions about the association or difference between categorical variables. Unfortunately, these tests are also among the more commonly misinterpreted statistical tests in the field. The problem is not that researchers and evaluators misapply the results of chi-square tests, but rather they tend to over interpret or incorrectly interpret the results, leading to statements that may have limited or no statistical support based on the analyses preformed. This paper attempts to clarify any confusion about the uses and interpretations of the family of chi-square tests developed by Pearson, focusing primarily on the chi-square tests of independence and homogeneity of variance (identity of distributions). A brief survey of the recent evaluation literature is presented to illustrate the prevalence of the chi-square test and to offer examples of how these tests are misinterpreted. While the omnibus form of all three tests in the Karl Pearson family of chi-square tests – independence, homogeneity, and goodness-of-fit – use essentially the same formula, each of these three tests is, in fact, distinct with specific hypotheses, sampling approaches, interpretations and options following rejection of the null hypothesis. Finally, a little known option, the use and interpretation of post hoc comparisons based on Goodman’s (1963) procedure following the rejection of the chi-square test of homogeneity, is described in detail.

Questions to Consider

1. Explain why these tests are also among the more commonly misinterpreted statistical tests in the field.

Cognitive Domain: Knowledge

Difficulty Level: Medium

 

2. How and why do researchers tend to over interpret or incorrectly interpret the results?

Cognitive Domain: Comprehension

Difficulty Level: Medium

 

3. How do these misinterpretations lead to statements that may have limited or no statistical support based on the analyses preformed? Explain

Cognitive Domain: Comprehension, Application

Difficulty Level: Medium–Hard

 

Tang, M., Pei, Y., Wong, W., & Li, J. (2012). Goodness-of-fit tests for correlated paired binary data. Statistical Methods in Medical Research, 21(4), 331–345. doi:10.1177/0962280210381176.

We review a few popular statistical models for correlated binary outcomes, present maximum likelihood estimates for the model parameters, and discuss model selection issues using a variety of goodness-of-fit test statistics. We apply bootstrap strategies that are computationally efficient to evaluate the performance of goodness-of-fit statistics and observe that generally the power and the type I error rate of the goodness-of-fit statistics depend on the model under investigation. Our simulation results show that careful choice of goodness-of-fit statistics is an important issue especially when we have a small sample and the outcomes are highly correlated. Two biomedical applications are included.

Questions to Consider

1. Briefly describe the chronological development of models for paired discrete data.

Cognitive Domain: Comprehension, Analysis

Difficulty Level: Medium–Hard

 

2. How does goodness-of-fit relate to small sample sizes? Explain.

Cognitive Domain: Comprehension

Difficulty Level: Hard

 

Algina, J., Keselman, H. J., & Penfield, R. D. (2006). Confidence interval coverage for Cohen’s effect size statistic. Educational and Psychological Measurement, 66(6), 945–960. doi:10.1177/0013164406288161.

Kelley compared three methods for setting a confidence interval (CI) around Cohen’s standardized mean difference statistic: the noncentral-t-based, percentile (PERC) bootstrap, and biased-corrected and accelerated (BCA) bootstrap methods under three conditions of nonnormality, eight cases of sample size, and six cases of population effect size (ES) magnitude. Kelley recommended the BCA bootstrap method. The authors expand on his investigation by including additional cases of nonnormality. Like Kelley, they find that under many conditions, the BCA bootstrap method works best; however, they also find that in some cases of nonnormality, the method does not control probability coverage. The authors also define a robust parameter for ES and a robust sample statistic, based on trimmed means and Winsorized variances, and cite evidence that coverage probability for this parameter is good over the range of nonnormal distributions investigated when the PERC bootstrap method is used to set CIs for the robust ES.

Questions to Consider

1. Explain why the authors find that under many conditions, the BCA bootstrap method works best. What is the rationale?

Cognitive Domain: Comprehension, Analysis

Difficulty Level: Medium

 

2. The authors find that in the BCA bootstrap method, in some cases of nonnormality, the method does not control probability coverage. Why?

Cognitive Domain: Comprehension, Analysis

Difficulty Level: Medium–Hard

 

3. One important finding the author noted was that a noncentral-t (NCT)-based CI has inaccurate coverage probability when data are nonnormal. Explain this.

Cognitive Domain: Comprehension, Analysis

Difficulty Level: Hard

 

Gingerich, A. C., & Lineweaver, T. T. (2014). OMG! Texting in class = U fail:( empirical evidence that text messaging during class disrupts comprehension. Teaching of Psychology, 41(1), 44–51.

In two experiments, we examined the effects of text messaging during lecture on comprehension of lecture material. Students (in Experiment 1) and randomly assigned participants (in Experiment 2) in a text message condition texted a prescribed conversation while listening to a brief lecture. Students and participants in the no-text condition refrained from texting during the same lecture. Postlecture quiz scores confirmed the hypothesis that texting during lecture would disrupt comprehension and retention of lecture material. In both experiments, the no-text group significantly outscored the text group on the quiz and felt more confident about their performance. The classroom demonstration described in Experiment 1 provides preliminary empirical evidence that texting during class disrupts comprehension in an actual classroom environment. Experiment 2 addressed the selection bias and demand characteristic issues present in Experiment 1 and replicated the main findings. Together, these two experiments clearly illustrate the detrimental effects of texting during class, which could discourage such behavior in students.

Questions to Consider

1. Gingerich and Lineweaver did not use a chi-square to test their hypothesis about text messaging in class but could have. How could they set up their data to compute a chi-square test?

Learning Objective: Categorizing participants

Cognitive Domain: Synthesis

Difficulty Level: Hard

 

2. If Gingerich and Lineweaver had calculated letter grades – A, B, C, D, F – and used those to conduct a chi-square test, what would be the degrees of freedom? ( a) 2, (b) 3, (c) 4, (d) 5.

Learning Objective: Categorizing participants

Cognitive Domain: Analysis

Difficulty Level: Medium

 

3. In Experiment 1, Gingerich and Lineweaer had 67 student participants. According to the power analysis information in your text book, that would have been adequate for detecting: (a) A large effect size. (b) A medium effect size. (c) A small effect size. (d) None of the above.

Learning Objective: Effect size

Cognitive Domain: Application

Difficulty Level: Medium

 

Verberne, F. M. F., Ham, J., & Midden, C. J. H. (2015). Trusting a virtual driver that looks, acts, and thinks like you. Human Factors, 57(5), 895–909.

Objective: We examined whether participants would trust an agent who was similar to them more than an agent who was dissimilar to them.

Background: Trust is an important psychological factor determining the acceptance of smart systems. Because smart systems tend to be treated like humans, and similarity has been shown to increase trust in humans, we expected that similarity would increase trust in a virtual agent.

Methods: In a driving simulator experiment, participants (N = 111) were presented with a virtual agent who was either similar to them or not. This agent functioned as their virtual driver in a driving simulator, and trust in this agent was measured. Furthermore, we measured how trust changed with experience.

Results: Prior to experiencing the agent, the similar agent was trusted more than the dissimilar agent. This effect was mediated by perceived similarity. After experiencing the agent, the similar agent was still trusted more than the dissimilar agent.

Conclusion: Just as similarity between humans increases trust in another human, similarity also increases trust in a virtual agent. When such an agent is presented as a virtual driver in a self-driving car, it could possibly enhance the trust people have in such a car.

Application: Displaying a virtual driver that is similar to the human driver might increase trust in a self-driving car.

Questions to Consider

1. Why did the authors use a one-way chi-square to test their manipulation regarding the awareness of the research question?

Learning Objective: One-way chi-square

Cognitive Domain: Evaluation

Difficulty Level: Medium

 

2. The researcher found that between conditions there were differences in the frequency of participants aware of: (a) similarity manipulation, (b) mimicry manipulation, (c) shared goals manipulation, (d) virtual driver condition.

Learning Objective: One-way χ2 with equal expected frequencies

Cognitive Domain: Comprehension

Difficulty Level: Easy

 

3. At one point, the researcher froze the simulation and asked participants if they trusted Bob to complete the scenario successfully and safely. This data would be appropriate for the one-way chi-square: (a) if the researchers wanted to compare the number of yes–no responses, (b) if the researchers wanted to compare yes–no responses across conditions, (c) if they computed an average trust score, (d) if the expected frequency was known in advance.

Learning Objective: Research design

Cognitive Domain: Analysis

Difficulty Level: Medium