Morse et al 2012

  • View

  • Download

Embed Size (px)

Text of Morse et al 2012

  • Article

    Applied Psychological Measurement36(2) 122146

    The Author(s) 2012Reprints and permission: 10.1177/0146621612438725

    Using the Graded ResponseModel to Control SpuriousInteractions in ModeratedMultiple Regression

    Brendan J. Morse1, George A. Johanson2, and Rodger W. Griffeth2


    Recent simulation research has demonstrated that using simple raw score to operationalizea latent construct can result in inflated Type I error rates for the interaction term of a mod-erated statistical model when the interaction (or lack thereof) is proposed at the latent vari-able level. Rescaling the scores using an appropriate item response theory (IRT) model canmitigate this effect under similar conditions. However, this work has thus far been limited todichotomous data. The purpose of this study was to extend this investigation to multicate-gory (polytomous) data using the graded response model (GRM). Consistent with previousstudies, inflated Type I error rates were observed under some conditions when polytomousnumber-correct scores were used, and were mitigated when the data were rescaled with theGRM. These results support the proposition that IRT-derived scores are more robust tospurious interaction effects in moderated statistical models than simple raw scores undercertain conditions.


    graded response model, item response theory, polytomous models, simulation

    Operationalizing a latent construct such as an attitude or ability is a common practice in psy-

    chological research. Stine (1989) described this process as the creation of a mathematical

    structure (scores) that represents the empirical structure (construct) of interest. Typically,

    researchers will use simple raw scores (e.g., either as a sum or a mean) from a scale or test

    as the mathematical structure for a latent construct. However, much debate regarding the

    properties of such scores has ensued since S. S. Stevenss classic publication of the nominal,

    ordinal, interval, and ratio scales of measurement (Stevens, 1946). Although it is beyond the

    scope of this article to enter the scale of measurement foray, an often agreed-on position is

    that simple raw scores for latent constructs do not exceed an ordinal scale of measurement.

    This scale imbues such scores with limited mathematical properties and permissible

    1Bridgewater State University, MA, USA2Ohio University, Athens, USA

    Corresponding author:

    Brendan J. Morse, Department of Psychology, Bridgewater State University, 90 Burrill Avenue, 340 Hart Hall,

    Bridgewater, MA 02325, USA


  • transformations that are necessary for the appropriate application of parametric statistical

    models. Nonparametric, or distribution-free, statistics have been proposed as a solution for

    the scale of measurement problem. However, many researchers are reluctant to use nonpara-

    metric techniques because they are often associated with a loss of information pertaining to

    the nature of the variables (Gardner, 1975). McNemar (1969) articulated this point by say-

    ing, Consequently, in using a non-parametric method as a short-cut, we are throwing away

    dollars in order to save pennies (p. 432).

    Assuming that simple raw scores are limited to the ordinal scale of measurement and

    researchers typically prefer parametric models to their nonparametric analogues, the empiri-

    cal question regarding the robustness of various parametric statistical models to scale viola-

    tions arises. Davison and Sharma (1988) and Maxwell and Delaney (1985) demonstrated

    through mathematical derivations that there is little cause for concern when comparing mean

    group differences in the independent samples t test when the assumptions of normality and

    homogeneity of variance are met. However, Davison and Sharma (1990) subsequently

    demonstrated that scaling-induced spurious interaction effects could occur with ordinal-level

    observed scores in multiple regression analyses. These findings suggest that scaling may

    become a problem when a multiplicative interaction term is introduced into a parametric sta-

    tistical model.

    Scaling and Item Response Theory (IRT)

    An alternative solution to the scale of measurement issue for parametric statistics is to rescale

    the raw data itself into an interval-level metric, and a variety of methods for this rescaling have

    been proposed (see Embretson, 2006; Granberg-Rademacker, 2010; Harwell & Gatti, 2001). A

    potential method for producing scores with near interval-level scaling properties is the applica-

    tion of IRT models to operationalize number-correct scores into estimated theta scoresthe

    IRT-derived estimate of an individuals ability or latent construct standing. Conceptually, the

    attractiveness of this method rests with the invariance property in IRT scaling, and such scores

    may provide a more appropriate metric for use in parametric statistical analyses.1 Reise,

    Ainsworth, and Haviland (2005) stated that

    Trait-level estimates in IRT are superior to raw total scores because (a) they are optimal scalings of

    individual differences (i.e., no scaling can be more precise or reliable) and (b) latent-trait scales have

    relatively better (i.e., closer to interval) scaling properties. (p. 98, italics in original)

    In addition, Reise and Haviland (2005) gave an elegant treatment of this condition by demon-

    strating that the log-odds of endorsing an item and the theta scale form a linearly increasing rela-

    tionship. Specifically, the rate of change on the theta scale is preserved (for all levels of theta) in

    relation to the log-odds of item endorsement.

    Empirical Evidence of IRT Scaling

    In a simulation testing the effect of scaling and test difficulty on interaction effects in factor-

    ial analysis of variance (ANOVA), Embretson (1996) demonstrated that Type I and Type II

    errors for the interaction term could be exacerbated when simple raw scores are used under

    nonoptimal psychometric conditions. Such errors occurred primarily due to the ordinal-level

    scaling limitations of simple raw scores, and the ceiling and floor effects imposed when an

    assessment is either too easy or too difficult for a group of individualsa condition known

    Morse et al. 123

  • as assessment inappropriateness (see Figure 1). Embretson fitted the one-parameter logistic

    (Rasch) model to the data and was able to mitigate the null hypothesis errors using the esti-

    mated theta scores rather than the simple raw scores. These results illuminated the

    Assessment Appropriateness

    Theta4 2 0 2 4

    Trait ScoresTest Information

    Assessment Inappropriateness

    Theta4 2 0 2 4

    Figure 1. A representation of the latent construct distribution and test information (reliability)distributions for appropriate assessments (top) and inappropriate assessments (bottom)

    124 Applied Psychological Measurement 36(2)

  • usefulness of IRT scaling for dependent variables in factorial models, especially under sub-

    optimal psychometric conditions. Embretson argued that researchers are often unaware when

    these conditions are present and can benefit from using appropriately fitted IRT models to

    generate scores that are more appropriate for use with parametric analyses.

    An important question that now arises is whether these characteristics extend to more com-

    plex IRT models such as the two- and three-parameter logistic models (dichotomous models

    with a discrimination and guessing parameter, respectively) and polytomous models. Although

    the Rasch model demonstrates desirable measurement characteristics (i.e., true parameter invar-

    iance; Embretson & Reise, 2000; Fischer, 1995; Perline, Wright, & Wainer, 1979), it is some-

    times too restrictive to use in practical contexts. However, the consensus regarding the

    likelihood that non-Rasch models could achieve interval-level scaling properties is yes

    (Embretson & Reise, 2000; Hambleton, Swaminathan, & Rogers, 1991; Harwell & Gatti, 2001;

    Reise et al., 2005). Investigations into the scaling properties of these more complex IRT models

    are thus necessary.

    In one extension of this sort, Kang and Waller (2005) simulated the scaling properties of

    simple raw scores, estimated theta scores derived from a two-parameter logistic IRT model,

    and assessment appropriateness with the interaction term in a moderated multiple regression

    (MMR) analysis. Similar to the findings of Embretson (1996), Kang and Waller discovered

    that using simple raw scores to operationalize a latent construct resulted in substantial infla-

    tions of the Type I error rate (.50% or p . .50) for the interaction term in MMR under con-ditions of assessment inappropriateness. However, the IRT-derived theta score estimates

    were found to mitigate the Type I error rate to acceptable levels (\10% or p \ .10) underthe same conditions. This extension demonstrated that the estimated theta scores from a

    non-Rasch IRT model could be used to better fit the assumptions of parametric statistical

    models involving an interaction term. Finally, Harwell and Gatti (2001) investigated the

    congruence of estimated (theta) and actual construct scores using a popular polytomous IRT

    model, the graded response model (GRM; Samejima, 1969, 1996). The authors posited that

    if the estimated construct (theta) scores were sufficiently similar to the actual construct

    (theta) scores, which