prev

next

out of 25

View

60Download

1

Embed Size (px)

Article

Applied Psychological Measurement36(2) 122146

The Author(s) 2012Reprints and permission:

sagepub.com/journalsPermissions.navDOI: 10.1177/0146621612438725

http://apm.sagepub.com

Using the Graded ResponseModel to Control SpuriousInteractions in ModeratedMultiple Regression

Brendan J. Morse1, George A. Johanson2, and Rodger W. Griffeth2

Abstract

Recent simulation research has demonstrated that using simple raw score to operationalizea latent construct can result in inflated Type I error rates for the interaction term of a mod-erated statistical model when the interaction (or lack thereof) is proposed at the latent vari-able level. Rescaling the scores using an appropriate item response theory (IRT) model canmitigate this effect under similar conditions. However, this work has thus far been limited todichotomous data. The purpose of this study was to extend this investigation to multicate-gory (polytomous) data using the graded response model (GRM). Consistent with previousstudies, inflated Type I error rates were observed under some conditions when polytomousnumber-correct scores were used, and were mitigated when the data were rescaled with theGRM. These results support the proposition that IRT-derived scores are more robust tospurious interaction effects in moderated statistical models than simple raw scores undercertain conditions.

Keywords

graded response model, item response theory, polytomous models, simulation

Operationalizing a latent construct such as an attitude or ability is a common practice in psy-

chological research. Stine (1989) described this process as the creation of a mathematical

structure (scores) that represents the empirical structure (construct) of interest. Typically,

researchers will use simple raw scores (e.g., either as a sum or a mean) from a scale or test

as the mathematical structure for a latent construct. However, much debate regarding the

properties of such scores has ensued since S. S. Stevenss classic publication of the nominal,

ordinal, interval, and ratio scales of measurement (Stevens, 1946). Although it is beyond the

scope of this article to enter the scale of measurement foray, an often agreed-on position is

that simple raw scores for latent constructs do not exceed an ordinal scale of measurement.

This scale imbues such scores with limited mathematical properties and permissible

1Bridgewater State University, MA, USA2Ohio University, Athens, USA

Corresponding author:

Brendan J. Morse, Department of Psychology, Bridgewater State University, 90 Burrill Avenue, 340 Hart Hall,

Bridgewater, MA 02325, USA

Email: bmorse@bridgew.edu

transformations that are necessary for the appropriate application of parametric statistical

models. Nonparametric, or distribution-free, statistics have been proposed as a solution for

the scale of measurement problem. However, many researchers are reluctant to use nonpara-

metric techniques because they are often associated with a loss of information pertaining to

the nature of the variables (Gardner, 1975). McNemar (1969) articulated this point by say-

ing, Consequently, in using a non-parametric method as a short-cut, we are throwing away

dollars in order to save pennies (p. 432).

Assuming that simple raw scores are limited to the ordinal scale of measurement and

researchers typically prefer parametric models to their nonparametric analogues, the empiri-

cal question regarding the robustness of various parametric statistical models to scale viola-

tions arises. Davison and Sharma (1988) and Maxwell and Delaney (1985) demonstrated

through mathematical derivations that there is little cause for concern when comparing mean

group differences in the independent samples t test when the assumptions of normality and

homogeneity of variance are met. However, Davison and Sharma (1990) subsequently

demonstrated that scaling-induced spurious interaction effects could occur with ordinal-level

observed scores in multiple regression analyses. These findings suggest that scaling may

become a problem when a multiplicative interaction term is introduced into a parametric sta-

tistical model.

Scaling and Item Response Theory (IRT)

An alternative solution to the scale of measurement issue for parametric statistics is to rescale

the raw data itself into an interval-level metric, and a variety of methods for this rescaling have

been proposed (see Embretson, 2006; Granberg-Rademacker, 2010; Harwell & Gatti, 2001). A

potential method for producing scores with near interval-level scaling properties is the applica-

tion of IRT models to operationalize number-correct scores into estimated theta scoresthe

IRT-derived estimate of an individuals ability or latent construct standing. Conceptually, the

attractiveness of this method rests with the invariance property in IRT scaling, and such scores

may provide a more appropriate metric for use in parametric statistical analyses.1 Reise,

Ainsworth, and Haviland (2005) stated that

Trait-level estimates in IRT are superior to raw total scores because (a) they are optimal scalings of

individual differences (i.e., no scaling can be more precise or reliable) and (b) latent-trait scales have

relatively better (i.e., closer to interval) scaling properties. (p. 98, italics in original)

In addition, Reise and Haviland (2005) gave an elegant treatment of this condition by demon-

strating that the log-odds of endorsing an item and the theta scale form a linearly increasing rela-

tionship. Specifically, the rate of change on the theta scale is preserved (for all levels of theta) in

relation to the log-odds of item endorsement.

Empirical Evidence of IRT Scaling

In a simulation testing the effect of scaling and test difficulty on interaction effects in factor-

ial analysis of variance (ANOVA), Embretson (1996) demonstrated that Type I and Type II

errors for the interaction term could be exacerbated when simple raw scores are used under

nonoptimal psychometric conditions. Such errors occurred primarily due to the ordinal-level

scaling limitations of simple raw scores, and the ceiling and floor effects imposed when an

assessment is either too easy or too difficult for a group of individualsa condition known

Morse et al. 123

as assessment inappropriateness (see Figure 1). Embretson fitted the one-parameter logistic

(Rasch) model to the data and was able to mitigate the null hypothesis errors using the esti-

mated theta scores rather than the simple raw scores. These results illuminated the

Assessment Appropriateness

Theta4 2 0 2 4

Trait ScoresTest Information

Assessment Inappropriateness

Theta4 2 0 2 4

Figure 1. A representation of the latent construct distribution and test information (reliability)distributions for appropriate assessments (top) and inappropriate assessments (bottom)

124 Applied Psychological Measurement 36(2)

usefulness of IRT scaling for dependent variables in factorial models, especially under sub-

optimal psychometric conditions. Embretson argued that researchers are often unaware when

these conditions are present and can benefit from using appropriately fitted IRT models to

generate scores that are more appropriate for use with parametric analyses.

An important question that now arises is whether these characteristics extend to more com-

plex IRT models such as the two- and three-parameter logistic models (dichotomous models

with a discrimination and guessing parameter, respectively) and polytomous models. Although

the Rasch model demonstrates desirable measurement characteristics (i.e., true parameter invar-

iance; Embretson & Reise, 2000; Fischer, 1995; Perline, Wright, & Wainer, 1979), it is some-

times too restrictive to use in practical contexts. However, the consensus regarding the

likelihood that non-Rasch models could achieve interval-level scaling properties is yes

(Embretson & Reise, 2000; Hambleton, Swaminathan, & Rogers, 1991; Harwell & Gatti, 2001;

Reise et al., 2005). Investigations into the scaling properties of these more complex IRT models

are thus necessary.

In one extension of this sort, Kang and Waller (2005) simulated the scaling properties of

simple raw scores, estimated theta scores derived from a two-parameter logistic IRT model,

and assessment appropriateness with the interaction term in a moderated multiple regression

(MMR) analysis. Similar to the findings of Embretson (1996), Kang and Waller discovered

that using simple raw scores to operationalize a latent construct resulted in substantial infla-

tions of the Type I error rate (.50% or p . .50) for the interaction term in MMR under con-ditions of assessment inappropriateness. However, the IRT-derived theta score estimates

were found to mitigate the Type I error rate to acceptable levels (\10% or p \ .10) underthe same conditions. This extension demonstrated that the estimated theta scores from a

non-Rasch IRT model could be used to better fit the assumptions of parametric statistical

models involving an interaction term. Finally, Harwell and Gatti (2001) investigated the

congruence of estimated (theta) and actual construct scores using a popular polytomous IRT

model, the graded response model (GRM; Samejima, 1969, 1996). The authors posited that

if the estimated construct (theta) scores were sufficiently similar to the actual construct

(theta) scores, which