Vasilyeva Et Al 2009

Embed Size (px)

Citation preview

  • 8/18/2019 Vasilyeva Et Al 2009

    1/25

    Examination of the Psychometric

    Properties of the MeasurementSkills Assessment

    Marina Vasilyeva

    Larry H. Ludlow

    Beth M. Casey

    Caroline St. Onge

    Boston College

    This article introduces the Measurement Skills Assessment (MeSA), which was designed

    to evaluate the mastery of measurement in elementary school students. The primary

    objectives for the MeSA include covering a broad range of measurement concepts, dis-

    tinguishing between major subtypes of measurement, and constructing a continuum of 

    items varying in complexity. This article investigates the psychometric properties of 

    scores from this new test. A series of principal components analyses helps distinguish

    between two major categories of measurement items—those tapping formula-based/ 

    analytical skills and those tapping spatial/conceptual skills. Rasch analysis provides evi-dence for the existence of a unidimensional structure of item complexity within each

    measurement category. The MeSA can be used by psychologists and educators to deter-

    mine which aspects of measurement present the greatest difficulties for individuals as

    well as groups of children and to evaluate the effectiveness of different instructional

    approaches.

     Keywords:   measurement skills; elementary school students; test construction;

     principal components analysis; Rasch analysis

    T his article introduces a new instrument designed to assess measurement skillsin elementary school students. Measurement is a process involved in taking acontinuous quantity, such as length, and breaking it into discrete countable seg-

    ments (Wilson & Rowland, 1993). Thus, measurement allows us to use numbers to

    communicate and reason about continuous quantities. Measurement is one of the

    most widely used applications of mathematics and is considered to be one of the

    Educational and

    Psychological Measurement

    Volume 69 Number 1

    February 2009 106-130

    © 2009 Sage Publications

    10.1177/0013164408318774

    http://epm.sagepub.comhosted at

    http://online.sagepub.com

    Authors’ Note:  The research presented in this article was supported by a National Science Foundation

    award, no. HRD-0522491. We would like to thank the teachers and students who participated in the

    study. We are also grateful to Michele Tully, Allison Butler, and Ed Bowers, who helped us conduct the

    study. Please address correspondence to Marina Vasilyeva, Campion 239, Chestnut Hill, MA 02467;

    e-mail: [email protected].

    106

  • 8/18/2019 Vasilyeva Et Al 2009

    2/25

    basic concepts for learning science (Clements & Bright, 2003; Lehrer, 2003).

    Although scientific measurement encompasses a wide range of phenomena, such as

    space, temperature, mass, and force, this article focuses on one particular domain,namely, on measuring space. The purpose of the new testing instrument is to evalu-

    ate children’s ability to combine numeric and spatial skills in measuring spatial

    dimensions, such as length, area, and volume.

    The importance of measurement as a key mathematical and scientific tool has

    been established by psychologists as well as educators (Lehrer, Jaslow, & Curtis,

    2003; Miller, 1989). The National Science Teachers Association (NSTA, 2003) con-

    siders measurement as a foundational unifying concept. The National Council of 

    Teachers of Mathematics (NCTM, 2000) points out that measurement ‘‘bridges two

    main areas of school mathematics—geometry and number’’ (p. 103). Clearly, theability to understand and apply measurement principles has major implications for

    children’s future success in mathematics, science, and technology. In recognizing

    this fact, educators have established spatial measurement as a key area of mathe-

    matics instruction in elementary grades (NCTM, 2000).

    Yet the findings of large-scale national and international studies suggest sub-

    stantial variability in the measurement skills of elementary school students

    (Lubienski, 2003; Mullis, Martin, Fierros, Goldberg, & Stemler, 2000). In particu-

    lar, measurement items reveal the most consistent gender difference favoring males

    starting at the fourth-grade level (Mullis et al., 2000; Mullis, Martin, Gonzalez, &Chrostowski, 2004). Furthermore, measurement is a mathematical strand showing

    the widest achievement gap among children from different socioeconomic back-

    grounds (Lubienski, 2003; Strutchens & Silver, 2000).

    It should be noted, however, that many of these findings on gender- and socio-

    economic status (SES)-related differences come from broad-range mathematical

    assessments in which the domain of measurement is only one of a number of math

    content areas dispersed throughout the test. This is true for national tests, such as

    the National Assessment of Educational Progress (NAEP) test of mathematics, as

    well as large-scale international projects such as the Trends in International Mathe-matics and Science Study (TIMSS). Although the existing instruments provide

    important evidence concerning the overall level of measurement skills, there may

    not be enough items within specific categories of measurement to examine the

    sources of children’s difficulties.

    To obtain a deeper understanding of the extent and nature of individual differ-

    ences in measurement skills, it is important to develop a systematic assessment

    tool. Such a tool should enable us to determine which aspects of measurement pre-

    sent the greatest difficulties for particular individuals as well as groups of students.

    Furthermore, it should allow us to examine possible interrelations between the var-

    ious aspects of spatial measurement. Finally, it should provide a way to evaluate

    the effectiveness of different instructional approaches to teaching measurement.

    Vasilyeva et al. / Psychometric Properties of Measurement Assessment 107

  • 8/18/2019 Vasilyeva Et Al 2009

    3/25

    In this article, we present a new instrument (the Measurement Skills Assessment

    [MeSA]) that was developed for the use of psychologists and educators to address

    the questions about individual differences in measurement. Measurement involvesthe intersection of numerical and spatial thought. Consequently, we were particu-

    larly interested in the distinction between two types of measurement items—those

    that primarily address skills at solving formula-based problems using numerical

    calculations and those that focus on problems requiring understanding of spatial

    relations. There is evidence for individual differences in students’ ability to solve

    math problems involving numerical and spatial skills (Kim, Plake, Wise, & Novak,

    1990). Yet in relation to measurement, there is no assessment tool designed specifi-

    cally to understand individual differences in how students solve these two types of 

    problems across a range of measurement constructs.Thus, in designing this test, we aimed to include items that tapped children’s

    acquisition of formula-based/analytical knowledge as well as spatial/conceptual

    understanding of measurement. Formula-based/analytical items can be solved on

    the basis of standard measurement facts and procedures. For example, calculating

    the area or perimeter of a rectangle when given information about side lengths

    requires using known formulas. In contrast, spatial/conceptual items cannot be

    solved simply by relying on known formulas. Rather, they require reasoning about

    concepts underlying measurement procedures. An example of a spatial/conceptual

    item would be to figure out the length of a line segment that is presented along withthe ruler but is not aligned with the beginning of the ruler. Solving this problem

    would require a shift from the standard measuring procedure; it cannot be solved

    by reading the number off the ruler corresponding to the end of the line.

    Existing evidence indicates that elementary school students may have particular

    difficulty with conceptual measurement problems that require deviating from

    standard measurement procedures or reasoning about spatial relations (Barrett &

    Clements, 2003; Battista, 2003; Kamii, 1995). In fact, it has been suggested that

    children’s difficulties with measurement may be due to the fact that they learn mea-

    surement procedures that are limited to certain types of tasks and are unable totransfer their knowledge to other contexts, as they lack the conceptual understand-

    ing (Shaw & Pucket-Cliatt, 1989). Both researchers and educators have pointed out

    the importance of testing not only how well students learned standard procedures

    but also how deeply they understand the concepts underlying these procedures

    (Battista, 2003; Lehrer, 2003; Wilson & Rowland, 1993). The present test was

    designed to evaluate both students’ knowledge of measurement formulas and their

    ability to reason conceptually according to measurement principles in situations in

    which the problem cannot be solved using formulas. Furthermore, we envisioned

    the construct of measurement as a continuum stretching from relatively simple to

    more complex skills, concepts, and operations.

    The purpose of this article is to discuss the development of the measurement skills

    assessment and the psychometric properties of scores obtained by administering

    108 Educational and Psychological Measurement

  • 8/18/2019 Vasilyeva Et Al 2009

    4/25

    MeSA to elementary school students. The development of this instrument included

    three phases. In the first phase, we piloted an extended set of items organized into

    four test versions. Following the analysis of results obtained at the first phase, thepool of items was reduced. In the second phase, we piloted a more focused set of 

    items organized into two test versions. On the basis of the results of the second pilot,

    we constructed a final set of items that underwent extensive psychometric examina-

    tion, resulting in the single test version reported herein.

    Method

    Sample

     Rationale for the target age.  Introduction of measurement concepts to students

    in the United States begins in the first grades of elementary school. Students gradu-

    ally extend the range of measurement skills, starting by learning standard proce-

    dures for measuring length and proceeding to the measurement of perimeter, area,

    and volume. At the fourth-grade level, when students start to work consistently on

    a variety of measurement tasks, significant problems become apparent. Reports

    from TIMSS show that although some gains were made in mathematics and science

    in the eighth grade for the United States as compared to other countries, there were

    little gains made in the fourth grade (Mullis et al., 2004). Furthermore, at this age

    there is already strong evidence of gender and SES differences in measurement

    (Robitaille, 1989). For these reasons, we selected fourth grade as the target grade

    level for our study.

    Sample characteristics.   Our participants included fourth-grade students from

    public schools in the Greater Boston area. As indicated above, the development of 

    the instrument involved three phases. In Phase I, we tested 201 students (122 girls,

    79 boys). The participating schools were drawn from three communities represent-

    ing distinct socioeconomic levels. Higher income schools (63 participants) served a

    community with a median family income of $92,993; middle-income schools (76

    participants) served a community with a median family income of $61,843; and

    lower income schools (62 participants) served a community with a median family

    income of $34,377 (all economic indicators are based on 2000 Census). The sample

    was racially and ethnically diverse, reflecting the composition of the three commu-

    nities. In particular, the student population in the higher income community was,

    on average, 8% African American, 61% White, 9% Hispanic, and 18% Asian. In

    the middle-income community, the student population was, on average, 51% Afri-

    can American, 24% White, 8% Hispanic, and 15% Asian. In the low-income com-

    munity, the student population was, on average, 39% African American, 13% White,

    37% Hispanic, and 8% Asian. The remaining students in the three communities

    Vasilyeva et al. / Psychometric Properties of Measurement Assessment 109

  • 8/18/2019 Vasilyeva Et Al 2009

    5/25

    represented other ethnic and racial groups, falling primarily in the multiracial cate-

    gory (all demographic indicators are based on statistics provided by the Massachu-

    setts Department of Education). In addition to community-level indicators,demographic information about individual schools can be provided on request to

    the first author. Generally, the ethnic distribution of students within participating

    schools was similar to the overall distribution in the community.

    In Phase II, we tested 279 children (153 girls, 126 boys) recruited from the same

    three communities as Phase I participants—83 students were from higher income

    schools, 100 students were from middle-income schools, and 96 students were from

    lower income schools. Following the first two phases of the study, we made a deci-

    sion to eliminate the middle-income group from further testing. This was done

    because in Phase III, in which we finalized our testing instrument, our goal was toexplore the nature of individual- and group-level differences among students. Thus,

    it was important to have a clear differentiation between the socioeconomic groups

    included in the study. A detailed analysis of communities participating in Phases I

    and II revealed that our middle-income sample was more heterogeneous than the

    other two groups. Some schools recruited from the middle-income community

    looked more like our higher income schools, whereas others looked more like our

    lower income schools. The data from the middle-income schools were also too het-

    erogeneous to differentiate income effects. Thus, in Phase III we tested 215 chil-

    dren (109 girls, 106 boys); 105 were from higher income schools and 110 werefrom lower income schools.

    Instrument

    In Phase I, the test included four versions of 41 items each. The items were

    created to reflect a range of measurement skills and concepts represented in the ele-

    mentary school curriculum (through fourth grade). We examined the NCTM  Prin-

    ciples and Standards for School Mathematics   (2000) to determine educational

    goals and academic standards with respect to measurement. Furthermore, we exam-ined research literature that provided the basis for identifying key measurement

    skills and concepts as well as challenges faced by elementary school students in

    mastering these concepts (e.g., Battista, 1999; Battista & Clements, 1998; Clements

    & Bright, 2003; Lehrer, 2003; Wilson & Rowland, 1993). On the basis of an analy-

    sis of academic standards and research literature, we designed a wide range of 

    items reflecting measurement content for the fourth-grade level.

    The items varied along several dimensions. First, they covered different content

    areas of spatial measurement, including linear measurement (length and perimeter),

    area, and volume. Second, within each content area, the items varied in whetherthey required the application of formula-based/analytical knowledge (see Figure 1

    for examples) or the spatial/conceptual understanding of measurement principles

    (see Figure 2 for examples). Finally, within each category described above, the

    110 Educational and Psychological Measurement

  • 8/18/2019 Vasilyeva Et Al 2009

    6/25

    items were designed to cover different difficulty levels. No item was used that

    required a correct response to any previous item, that is, the items were intended to

    maintain local independence.

    Each item was presented in a multiple-choice format. This choice of format was

    based partly on practical considerations—it simplified the scoring of responses for

    hundreds of children. Furthermore, the multiple-choice format allowed us to com-

    pare children’s performance to chance level and to examine how likely childrenwere to select particular foils indicative of specific misconceptions.

    During the Phase I testing, we purposefully used a greater number of items than

    needed so that eliminating some of them would still leave a sufficient number of 

    Figure 1

    Examples of Formula-Based/Analytical Items

     

    15

    inches

    8

    inches

    16

    inches

    10

    inches

     

       1 

       2 

       3 

       4 

       i  n  c   h  e  s 

    1 2 3 4 5 6inches

     

    Joshua drew a rectangle. One side of 

    the rectangle was 2 inches long and the

    other side was 5 inches LONGER. What

    was the AREA of the rectangle?

    10 square

    inches

    7 square

    inches

    14 square

    inches

    9 square

    inches

    What is the PERIMETER of the rectangle in inches?

    Figure 2

    Examples of Spatial and Conceptual Items

     

    5 feet7 feet12 feet20 feet

    Jada built a layer of blocks and then placed two towers

    on top. What is the VOLUME of the whole structure?

     

    -

    11

    cubes

    8

    cubes

    10

    cubes

    12

    cubes

    John made four guesses about the HEIGHT of the

    door (in feet) to his bedroom. Which one do you

    think is the best guess?

    Vasilyeva et al. / Psychometric Properties of Measurement Assessment 111

  • 8/18/2019 Vasilyeva Et Al 2009

    7/25

    appropriate items covering a range of measurement concepts. On the basis of the

    analysis of data obtained during Phase I, we reduced the original set almost in half.

    There were several main criteria for eliminating items. First, we examined eachitem’s ability to reveal individual differences. For example, those items with no

    response variance were eliminated. Most of these items showed either a ceiling

    effect (too easy for study participants) or a floor effect (too difficult for study parti-

    cipants). Next, we examined items in terms of their content validity. There were

    several items judged to be redundant as they presented very similar questions with

    trivial variations in either their prompts or the numbers used. We eliminated the

    least discriminating item of the redundant pair (based on their respective item–total

    correlations). Finally, in the process of testing, it became apparent that several

    items had unclear instructions because many participants asked questions to clarifythem. In revising the items, we either clarified the instructions or, when unable to

    do so, eliminated the item.

    In Phase II, the test included two versions of 42 items each. Most of these items

    were identical to those used in the previous phase. Several items were modified to

    clarify instructions. The data from Phase II were examined using the same criteria

    as described above. This led to finalizing the assessment instrument—a single ver-

    sion created on the basis of two waves of piloting. This version included 44 items

    plus one practice item that was used to introduce the procedure.

    The 44 items used in Phase III of the study included problems that were hypothe-sized to require either formula-based/analytical knowledge or spatial/conceptual under-

    standing of measurement; there was also a small subset of problems that appeared to

    involve both components. The Phase III analysis presented below allowed us to clarify

    the distinction between the two categories and to eliminate items that could not be

    clearly categorized as formula-based/analytical or spatial/conceptual.

    Results

    Validity

    We sought to ensure content validity of the MeSA through two main mechanisms.

    First, as indicated above, we designed the items on the basis of educational research

    and curricular materials on measurement for fourth-grade students. We also exam-

    ined the goals and standards related to measurement as formulated in the NCTM

    guidelines. Those materials and guidelines allowed us to identify measurement skills

    and concepts that were supposed to be mastered by fourth-grade students. Second,

    the initial pool of items was reviewed by a team of five experts from the fields of 

    developmental psychology and math education as well as by practicing elementary

    school teachers. On the basis of this content analysis, we were confident we had

    112 Educational and Psychological Measurement

  • 8/18/2019 Vasilyeva Et Al 2009

    8/25

    covered a full range of measurement concepts expected of students at the target grade

    level and we were able to eliminate redundancy in terms of similar items.

    Next, we used a series of principal components analyses to examine the struc-ture of the obtained scores. In designing the test, our primary goal was to tap

    children’s formula-based/analytical knowledge as well their spatial/conceptual

    understanding of measurement. Consequently, we categorized the items into these

    two types depending on the hypothesized cognitive process involved in solving that

    item. Consistent with this goal, in conducting the principal components analyses,

    we were specifically interested in determining whether these two types of items were

    in fact distinct from one another. Thus, we extracted a two-component solution. An

    initial exploratory analysis indicated that the majority of the items indeed fell into two

    groups separated along the spatial/conceptual versus formula-based/analytical distinc-tion. Component 1 (spatial/conceptual) primarily involved problems that required spa-

    tial reasoning, whereas Component 2 (formula-based/analytical) primarily involved

    problems that could be solved analytically by applying or numerically manipulating

    known formulas.

    On the basis of this initial exploratory analysis, a number of items were elimi-

    nated from the test. Three items required the knowledge of volume formula, which

    had not been taught in some of the participating classrooms by the time we con-

    ducted the test. Seven additional items that were eliminated were not strictly

    formula-based/analytical or spatial/conceptual problems but rather involved a com-bination of both analytical and spatial strategies. These were the items that we were

    not able to categorize into either measurement subtype. For example, one such item

    required both the knowledge of formula for the area of rectangle and the ability to

    estimate area using mentally generated units.

    Three checks on the structure of the correlation matrix were performed prior to the

    final analysis of the remaining 34 items: The determinant was nonzero, the Kaiser-

    Meyer-Olkin statistic was in the ‘‘satisfactory’’ range (.859), and the test for sphericity

    was significant. These results provided evidence of item covariance sufficient for the

    next analyses. It should be also noted that an oblique rotation was run as an initialcheck on the correlation between the components. It revealed that the correlation

    between Components 1 and 2 was quite low (−.33). This finding was taken as evi-

    dence of the orthogonal relation between the components, and a Varimax rotation was

    then performed. The Varimax option was chosen because it maximizes the variance

    on the factors, not the items or some combination of the two choices.

    A principal components extraction with Varimax rotation of a two-component

    solution accounted for 30% of the variance. The eigenvalues for the first two com-

    ponents were 8.26 and 2.10. When rotated, they accounted for 15.2% and 14.8% of 

    the variance, respectively. Although each component consisted of individually

    strong items, it was clear—and expected—that some items still cross-loaded.

    Hence, subsequent replications of these analyses could show minor shifts in the

    component identification for a few items. Nevertheless, it was felt that further

    Vasilyeva et al. / Psychometric Properties of Measurement Assessment 113

  • 8/18/2019 Vasilyeva Et Al 2009

    9/25

    reductions in the item set would threaten the content validity of the test. The two-component solution is presented in Table 1.

    A three-component solution was also extracted because we hypothesized that

    the formula-based/analytical component might consist of two subsets of items

    Table 1

    The Two-Factor Varimax Solution: Rotated Component Matrix

    Item Component 1 Component 2   h2

    m10 .686 .301 .561

    m3 .642 .354 .537

    m17 .588 .334 .457

    m42 .570   −.023 .325

    m15 .567 .276 .398

    m18 .536 .054 .290

    m30 .527 .221 .327

    m21 .469 .067 .224

    m36 .459 .024 .211

    m40 .451 .082 .210

    m37 .442 .149 .218

    m11 .429 .397 .342

    m8 .422 .400 .338

    m39 .417 .168 .203

    m24 .404 .277 .240

    m32 .379 .170 .173

    m31 .354 .210 .169

    m34 .345 .341 .235

    m9 .339 .243 .174

    m45 .274   −.145 .096

    m33 .043 .798 .639

    m35 .145 .697 .507

    m43 .066 .672 .456

    m44 .077 .581 .343

    m16 .049 .534 .288

    m28 .256 .529 .345

    m7 .269 .470 .293

    m29   −.011 .433 .188

    m27 .261 .425 .249

    m12 .220 .420 .225

    m41 .300 .417 .264m38 .336 .410 .281

    m6 .150 .409 .190

    m26 .305 .322 .197

    Eigenvalue 8.26 2.10

    Note: Extraction method: principal components analysis. Rotation method: Varimax with Kaiser normal-

    ization. Rotation converged in three iterations. The eigenvalue of the third, unretained component was

    1.62. h2= communality coefficient.

    114 Educational and Psychological Measurement

  • 8/18/2019 Vasilyeva Et Al 2009

    10/25

    characterized by the use of area formulas or perimeter formulas. This analysis did

    split the formula-based/analytical component in the two-component solution into

    two lesser components: one consisting of area formula items, the other consistingof perimeter formula items. Both two- and three-component solutions were con-

    sistent with our theoretical expectations concerning the spatial/conceptual versus

    formula-based/analytical items. Although the three-component solution allowed

    for a more precise differentiation of formula-based items, we elected to retain the

    two-component solution for parsimony in test reporting and interpretation.

    In addition, a parallel analysis was performed as a check on the number of fac-

    tors. It has been pointed out (Henson & Roberts, 2006) that parallel analysis pro-

    vides the most accurate procedure for making factor retention decisions. Generally,

    in this analysis a random data set is generated with the same number of observa-tions and variables as the original empirical data. When the   ith eigenvalue for the

    empirical data is greater than the corresponding eigenvalue for the generated data,

    the component is taken as real; otherwise, the components are viewed as mostly

    random noise. Here, in the parallel analysis, 34 random variables with mean zero

    and unit variance were subjected to a principal components analysis. The first three

    eigenvalues for the empirical data were 8.26, 2.10, and 1.62. The first three eigen-

    values for the generated data were 1.82, 1.71, and 1.69. These findings lend support

    to the decision to retain a two-component solution.

    The items included in the MeSA were designed to cover a wide range of mea-surement skills and concepts. Although we hypothesized that these items can be

    divided broadly into spatial/conceptual and formula-based/analytical categories,

    there were other differentiating factors within each category as well. For example,

    the spatial/conceptual category covered a variety of distinct skills, such as being

    able to estimate the length of a two-dimensional (2D) object versus being able to

    estimate the volume of a three-dimensional (3D) object by taking into account invi-

    sible parts. Because of the practical constraints on testing (the whole test had to be

    completed within a single class period), each of the specific skills within the

    spatial/conceptual category was represented by a small number of items. If we hadmultiple items for each of these skills, they would likely define their own separate

    skill types. This may explain why there was substantial unaccounted variance when

    we looked at the two-component solutions. However, it is important to point out

    that this solution was consistent with the broad categorical distinction originally

    hypothesized. Further research will ultimately support the decision to retain two

    components or point to expansion of the score reporting for diagnostic purposes.

    The ultimate purpose of the MeSA, in general, and the principal components

    analyses, specifically, was to produce measurement scales that effectively reveal

    individual differences between students. The preceding analyses provided evidence

    of content and construct validity in considering the structure of the obtained scores

    in terms of two scales. We now address the reliability of the scores generated by

    Vasilyeva et al. / Psychometric Properties of Measurement Assessment 115

  • 8/18/2019 Vasilyeva Et Al 2009

    11/25

    the scales and the extent to which these scales captured individual differences for

    measuring status and growth.

    Reliability

    First, we computed the interrater reliability for the coding of items as spatial/ 

    conceptual versus formula-based/analytical. Three independent coders were given the

    criteria for each of the two categories. The coders were asked to categorize the items

    that were included in the Phase III testing. They were in agreement on 91% of the

    items. For the remaining items, the coders commented that those items combined the

    elements of formula-based/analytical and spatial/conceptual knowledge. These items

    did not fit any component in our principal components analyses. It should be pointedout that the coders were in perfect agreement (100%) on all the items that were

    included in the final solution from the principal components analysis discussed above.

    The internal consistency of the scores, assessed through Cronbach’s alpha, was

    .90. It should be noted, however, that for a test that is conceptualized as consisting

    of different scales, it is more meaningful to assess internal consistency for each

    scale (Henson, 2001). Hence, we calculated Cronbach’s alpha for each component

    separately. For the two-component solution, the alphas were .85 for the spatial/ 

    conceptual and .82 for the formula-based/analytical components. In the three-

    component solution, the alphas for the spatial/conceptual, formula-based area, andformula-based perimeter components were .85, .80, and .63, respectively. It is gen-

    erally accepted that for research purposes, internal consistency estimates of .80 and

    above are sufficient to consider scores reliable; even in the early stages of instru-

    ment development, the desired magnitude of alpha is at least .70 (Nunnally &

    Bernstein, 1994). The fact that the Cronbach’s alpha for the third component in the

    three-component solution was below the accepted standards further contributed to

    our decision to retain only two components.

    Item Analysis

    The present test was designed to cover a wide continuum of complexity so as to

    allow us to distinguish between students demonstrating different levels of mastery of 

    measurement skills and concepts. In this section, we discuss the structure of the relative

    complexity of our measurement items. First, we present the rationale for the hypothe-

    sized structure of the items within each of the two main scales (formula-based/analytical

    and spatial/conceptual). Next, we present the results of a Rasch measurement model

    analysis that served as a confirmatory test of this hypothesized structure.

    Formula-based items. For each scale, we defined several criteria that were criti-

    cal in determining the theory-based development and subsequent inclusion of 

    items. The criteria for the formula-based/analytical scale are listed in Table 2. As

    116 Educational and Psychological Measurement

  • 8/18/2019 Vasilyeva Et Al 2009

    12/25

    indicated in the table, we specified that items differed in whether (a) the problem

    required a straightforward use of a formula or some manipulation of a formula,

    (b) the problem involved difficult or simple computations, and (c) the problem was

    presented in a standard or nonstandard manner that could trick children into using

    incorrect procedures. The formula-based/analytical problems included in the pre-

    sent test fell into three complexity levels according to these criteria.

    The simplest  problems involved using a formula in a straightforward way with all

    required numeric components provided and children simply having to ‘‘plug’’ the

    numbers into the formula. For example, in several items children were given infor-

    mation about the side lengths of the rectangle and were asked to calculate the area

    (item m6) or perimeter (items m26 and m27) of the rectangle. These problems were

    simple computationally (e.g., required multiplying 5 by 2). Some of these problems

    did not involve a pictorial representation of the rectangle and others did; if there was

    a picture, it was presented in a standard way that students were familiar with.

    The  moderately complex   problems involved items that required satisfying one

    of the criteria listed in Table 2. There were some items in this category that differed

    from the simplest items only in that they involved more complex calculations. For

    example, item m35 required multiplying 6 by 9, which is known to be a more com-

    plex task for elementary school students than single-digit multiplication involving

    smaller digits. There were also items in this category that differed from the simplest

    items in that they involved a nonstandard presentation that could lead the child to

    choose an incorrect formula. For example, in item m33 children were asked to find

    the formula of the depicted rectangle—the picture showed a rectangle and all four

    sides of the rectangle were marked with numbers. This presentation could bias chil-

    dren to add up all four numbers and thus calculate the perimeter rather than area.

    We hypothesized that avoiding the ‘‘trap’’ created by the picture would pose a chal-

    lenge for children because they are known to rely on superficial aspects of problem

    presentation in solving mathematical tasks.

    Finally, the  complex  problems were ones in which some of the standard compo-

    nents of measurement formulas (i.e., side lengths) were not provided and had to be

    Table 2

    Hypothesized Complexity Levels for the Formula-Based/Analytical Items

    Complexity

    Level

    Formula

    Manipulation

    Required

    Difficult

    Computational

    Level

    Nonstandard

    Presentation

    List of Items

    in Each Level

    Simplest None of the criteria applies 6, 7, 16, 26, 27, 43

    Moderately

    complex

    Only one of the three criteria applies 29, 12, 41, 35, 38, 33

    Complex Two of the three criteria apply 44, 28

    Vasilyeva et al. / Psychometric Properties of Measurement Assessment 117

  • 8/18/2019 Vasilyeva Et Al 2009

    13/25

    inferred through analytical reasoning and additional computation. For example, initem m44 children were asked to figure out the width of the rectangle according to

    the information about its length and perimeter. In this problem, rather than solving a

    familiar task of plugging the side lengths in the formula to calculate the perimeter,

    children had to reason backwards, starting from the perimeter and arriving at the side

    length. We hypothesized that these problems would constitute the most complex

    tasks for the formula-based/analytical scale because of their analytical demands and

    computational load.

    Spatial/conceptual items.   Similar to the formula-based/analytical items, thespatial/conceptual items were also designed to cover a wide continuum of complex-

    ity. However, the basis for the level of complexity varied between these two scales.

    Whereas the complexity of formula-based/analytical items depended largely on the

    demands for analytical reasoning and numeric computations, the complexity of the

    spatial/conceptual items depended largely on their spatial processing load, that is,

    the degree to which spatial reasoning was taxed. The criteria used for developing

    and including the spatial/conceptual items are listed in Table 3. As indicated in the

    table, we specified that items differed in whether (a) the problem required generat-

    ing a mental image of the object to be measured and/or unit of measure; (b) the pro-blem required manipulating a mental image, for example, mentally moving the unit

    of measure relative to the object; and (c) the problem required a precise spatial

    comparison in which the solution involved identifying the correct number of mea-

    surement units rather than just saying which measurement unit would be most

    appropriate. On the basis of these criteria, the spatial/conceptual problems included

    in the present test fell into three complexity levels (listed in Table 3).

    The   simplest level   involved items that required satisfying one of the criteria

    listed in Table 3. This included items that required establishing spatial relations

    between objects but did not require generating or manipulating spatial images. Forexample, item m32 involved considering the relation between the units of measure

    and objects to be measured, but both units and objects were shown pictorially and

    no mental movement of units was required. This level also included items that

    Table 3

    Hypothesized Complexity Levels for the Spatial/Conceptual Items

    Complexity

    Level

    Generation of 

    Mental Image

    Required

    Manipulation of 

    Mental Image

    Required

    Precise Spatial

    Comparison

    Required

    List of Items

    in Each Level

    Simplest Only one of the three criteria applies 32, 21, 40, 37

    Moderately

    complex

    Two of the three criteria apply 42, 36, 15, 9, 11, 24,

    8, 3, 39, 34, 10, 31

    Complex All three criteria apply 18, 45, 17, 30

    118 Educational and Psychological Measurement

  • 8/18/2019 Vasilyeva Et Al 2009

    14/25

    involved generating images of units and/or objects to be measured but did not

    require a precise numerical answer. Although generating mental images can

    increase the complexity of the task, in this subset of problems the situation wasmitigated by the fact that children did not have to produce a precise numeric

    response and could rely on a crude comparison of the size of the unit and the size

    of the object. For example, in item m37, children were asked which of the four

    given objects, specifically a coin, a paper clip, a car, and a distance between cities,

    is usually measured in feet. In this problem, the spatial images could be generated

    crudely and no mental movement was required to compare the sizes of objects.

    The moderately complex  level involved items that required satisfying any two of 

    the three criteria listed in Table 3. This included problems in which an object and a

    unit were shown but mental movement of the unit was required to measure the object.For example, in item m34 children were shown two objects; they were to measure the

    larger object using the smaller one as a unit. This required mentally moving the smal-

    ler object to juxtapose it with the larger one and to determine how many iterations it

    took to completely cover the larger object. Another example of an item from this level

    is item m36, in which children had to generate an image of an invisible part of an

    object but did not have to carry out any mental movements. To solve this problem,

    children had to count the number of cubic units that made up the depicted figure,

    which required taking into account the invisible part of the 3D object.

    Finally, the  complex  items included problems in which neither unit of measurenor object was shown—both had to be mentally generated and moved (through

    iterations). For example, in item m30 children were asked to decide which of the

    four answer choices (20, 12, 7, and 5 ft) represented the best guess about the height

    of a bedroom door. In this problem, children had to generate an image of a door

    and an image of a foot and then mentally iterate the image of a unit (a foot) to esti-

    mate how many feet were needed to span the height of the door.

    It should be noted that the complexity levels defining the spatial/conceptual and

    formula-based/analytical scales were intended to be relatively continuous and hier-

    archical. That is, there should be a relatively smooth transition from one level tothe next in the difficulty of the item, and success on items at higher levels of com-

    plexity generally would require success on less complex items. This is not to say,

    however, that the correct response on any specific item required a correct response

    on any other item.

    Rasch Analyses

    Rasch measurement models are used as confirmatory tests of the extent to which

    scales have been successfully developed according to explicit a priori measurementcriteria. These criteria include the following: (a) items operationally define a conti-

    nuum along a unidimensional variable, (b) items map a hierarchical ordering in their

    definition of the variable, and (c) items possess the property of local independence.

    Vasilyeva et al. / Psychometric Properties of Measurement Assessment 119

  • 8/18/2019 Vasilyeva Et Al 2009

    15/25

    Rasch models are powerful psychometric tools because they influence the way

    test developers conceptualize and subsequently construct measurement instru-

    ments. When the data fit a Rasch model, the continuous scale is analogous to a lin-ear ruler that is invariant in terms of level of ease or difficulty of accomplishing the

    task for any individual appropriate for testing. With regard to measuring the two

    variables of spatial/conceptual and formula-based/analytic measurement skills in

    children, a Rasch model was used not because it would ‘‘fit’’ the data better than

    any other statistical model but because if the data did fit the model, then students

    could be located along the two continua according to their ability to accomplish

    simpler to more complex measurement tasks.

    Starting from an a priori position of creating tasks (items) that define continuous

    and hierarchical levels for the respective scales, it was our intention to be able toliterally locate and position a student on each scale and then be able to describe the

    types of items, or tasks, that are most closely associated with that student’s score.

    The opportunity for diagnostic analysis and subsequent measurement of growth is

    greatly enhanced within this measurement framework.

    The Rasch dichotomous model (Rasch, 1960/1980; Wright & Stone, 1979) was

    used for these data. The same analysis was performed on the two scales, that is, for-

    mula-based/analytic and spatial/conceptual. The WINSTEPS software package

    was used (Wright & Linacre, 1998). The model may be represented as

    pni =ebn − di

    1+  ebn − di,

    where  pni   is the probability of student   n   responding correctly to item   i,   di   is the

    parameter (item ‘‘difficulty’’) for item  i, and bn  is the parameter (student ‘‘ability’’)

    for student  n. These parameters define location estimates in a metric referred to as

    logits but which are often transformed into some other user-friendly distribution

    (Ludlow & Haley, 1995). These estimates simultaneously portray the structure of 

    the specific measurement variable defined by the items and the location of each stu-

    dent along that variable. This representation of items and students through the con-

    struction of a ‘‘variable map’’ is one of the most attractive features of Rasch

    models because individual differences are readily evident and interpretable.

    Figure 3 contains the variable map for the formula-based/analytic scale; Figure 4

    contains the spatial/conceptual variable map. In each map, the items are ordered from

    easiest to accomplish (bottom of the map) to hardest to accomplish (top of the map).

    The students are ordered from lowest scoring (bottom of the map) to highest scoring

    (top of the map) in terms of their measurement skills. Below, we describe the distri-

    bution of items within each scale from easiest to most difficult.

    Formula-based scale.   For the formula-based/analytical scale, it was easiest to

    accomplish item m6, followed by items m26 and m27. These items were hypothe-

    sized to be of the simplest level of complexity because they did not involve formula

    120 Educational and Psychological Measurement

  • 8/18/2019 Vasilyeva Et Al 2009

    16/25

    Figure 3

    Variable Map for the Formula-Based/Analytical Scale

      3 ##### +

    |

    ######## |

    T|

    |

    |

    |

    |

    2 ##### +

    |

    |

    |

    .##### |T

    . S|

    |

    | m44

    |

    1 ##### + m28 m33

    |

    |S

    .######## |

    |

    | m38

    |

    .####### | m35

    . | m16 m43

    |

    0 .########## M+M

    | m7

    | m12 m41

    ########### | m29

    |

    |

    |

    .########## |

    |S

    | m27

    -1 .######### + m26

    |

    |

    S|

    ######## |T

    | m6

    |

    |

    -2 .######## +

    |

    |

    . |

    |

    |

    ## T|

    |

    -3 ## +

    EACH ‘#’ IS 2.

    Pupils MAP OF Items

    Vasilyeva et al. / Psychometric Properties of Measurement Assessment 121

  • 8/18/2019 Vasilyeva Et Al 2009

    17/25

    Figure 4

    Variable Map for the Spatial/Conceptual Scale

      Pupils MAP OF Items

    |

    ######### |

    |

    3 +

    |

    |

    .##### |

    |

    |

    |T

    ######## S| m45

    2 +

    |

    ######## |

    |

    |

    ###### |

    |

    ###### |S m18

    1 +

    .####### | m17 m30

    | m10 m31

    M| m34

    .######## | m3 m39

    | m8

    ##### | m11 m24

    | m15 m9

    0 ####### +M

    |

    ###### |

    |

    ##### |

    |

    ######### | m36

    S|

    -1 .#### +

    |S

    | m42###### | m37

    | m40

    . |

    .### |

    |

    -2 + m21

    # |

    . |T

    T| m32

    |

    |

    # |

    |

    -3 . +

    EACH ‘#’ IS 2.

    122 Educational and Psychological Measurement

  • 8/18/2019 Vasilyeva Et Al 2009

    18/25

    manipulation, difficult computations, or nonstandard presentation (e.g., in item m6,

    children had to calculate the areas of two separate rectangles, which could be done

    in a straightforward way by multiplying the side lengths provided). At the top of the map are the hardest items to accomplish. For example, items m44, m28, and

    m33 were all clustered at the top of the map. Items m44 and m28 were hypothe-

    sized to be very difficult because they involved formula manipulation and difficult

    computations (e.g., in item m28, rather than being provided with two side lengths

    to calculate the area of a rectangle, children were given the width and told that the

    rectangle was twice as long as it was wide).

    Item m33 was originally considered to be of medium difficulty because the only

    complicating factor was the nonstandard presentation. As indicated above, the pro-

    blem required children to calculate the area of a rectangle but the picture showedthe rectangle with all side lengths marked—a presentation typical for perimeter

    problems. The results point out that a nonstandard presentation leading children to

    an incorrect strategy choice had a strong effect on their performance, moving this

    item to the most difficult category. Only 22% chose the correct response and of 

    those who made errors, 86% calculated perimeter instead of area.

    In the middle of the distribution were items m35 down to m12 and m29. All items

    in this group were hypothesized to be of moderate complexity because they met one

    of the three criteria of difficulty listed in Table 2. For example, item m35 involved a

    straightforward formula use but required a calculation difficult at the fourth-gradelevel. Item m41 required formula manipulation, but it involved small numbers and

    thus did not require difficult computation (in contrast to items m28 and m44).

    The student estimates for the formula-based items were widely spread across the

    range (a desirable finding), but there were some students with higher ability esti-

    mates than these items can precisely measure. Future research will address the con-

    struction of more complex tasks within each scale.

    Spatial/conceptual scale.  For the spatial/conceptual scale, it was easiest for stu-

    dents to accomplish items m32 and m21 followed by items m40 and m37. Note thatall these items were hypothesized to be of the simplest level of complexity, as

    shown in Table 3. Indeed, solving these items did not require heavy spatial proces-

    sing either because both the unit of measure and the object were depicted or because

    only a crude image of a unit was required to solve a problem (e.g., in item m21,

    children were asked to choose the best unit of measure for the area of a notebook 

    page, which could be solved by making a crude comparison of the size of a page to

    the different units).

    At the top of the map are the hardest items to accomplish. For example, item

    m45 was harder to solve than item m18. In both items, 3D objects were depicted as

    2D drawings; children were to estimate the volume of the 3D object in cubic units.

    Item m45 was substantially harder because the object to be measured was not

    divided in the picture into individual cube units and so the child had to mentally

    Vasilyeva et al. / Psychometric Properties of Measurement Assessment 123

  • 8/18/2019 Vasilyeva Et Al 2009

    19/25

    keep track of the number of unit iterations in three different dimensions. We

    expected this item to be very difficult because it required generating an image of 

    the invisible part and mentally manipulating the unit to estimate the precise volumeof the object. However, it should be noted that item m18, which showed a prism

    divided into cube units, was still quite difficult for children. Whereas adults might

    find this item easy because it depicts a simple rectangular prism, for children who

    did not have much experience with the geometry of 3D objects, it was not obvious

    what was hidden behind the visible face. In fact, many of them (73% of students)

    widely overestimated the number of cubic units that comprised the prism.

    Many of the items on this scale fell in the middle of the distribution, with items

    m17 and m30 at the top of this midrange of items and m15 and m9 at the bottom.

    Both items m17 and m30 involved estimation of linear measurements when neitherthe unit nor the object was depicted. Consequently, on the basis of our theoretical

    scale, as shown in Table 3, we had placed these items in the ‘‘complex’’ category.

    They turned out to be slightly less difficult than we expected, which could be due

    to the fact that they required linear estimation as opposed to items m18 and m45,

    which required estimation in three dimensions. It should be noted, however, that

    items m17 and m30 were located close to item m18 and in fact were the most diffi-

    cult items in the middle group. The remaining items in the middle of the distribu-

    tion fit our definition of the moderately complex level, as shown in Table 3. These

    items met two of the three criteria of complexity. In other words, they involvedgeneration and manipulation of mental images, or generation of mental images

    combined with precise spatial comparison.

    The spatial/conceptual map shows that student estimates were spread across a

    wide range, even though some students had higher ability estimates than this set of 

    items can precisely measure. It is encouraging and consistent with our scale devel-

    opment expectations that proceeding up the scale means addressing increasingly

    more complex tasks.

    The spatial/conceptual item separation statistic was 5.9 units and the person

    separation was 2.1. For the formula-based/analytical scale, item and person separa-tion were 4.2 and 1.8, respectively. These statistics take into account the standard

    deviation of the parameter estimates relative to the mean measurement error in

    those estimates (Wright & Masters, 1982). Given that these statistics are partly a

    function of the sample size and number of items, respectively, the obtained values

    represent satisfactory spread in the item and person estimates.

    Rasch goodness-of-fit analyses rely principally on standardized residuals—the dif-

    ference between the observed response and the response expected under the model

    (Wright & Stone, 1979). A positive residual results when a higher than expected

    response occurs; negative residuals result from lower than expected responses.

    Although a variety of statistical and graphical procedures are available for analyzing

    residuals (Ludlow, 1985, 1986), a standard first approach is to consider summary sta-

    tistics in the form of standardized, weighted mean squares. Although these statistics

    124 Educational and Psychological Measurement

  • 8/18/2019 Vasilyeva Et Al 2009

    20/25

    do not have exact degrees of freedom and critical values, a rich history of experience

    has developed about their general properties and utility (e.g., Smith, 1991).

    We generally start with the standardized, weighted statistics because they areroughly analogous to  t  statistics and take into account the variance of the expected

    response (the so-called INFIT ZSTD in the WINSTEPS software). A very flexible

    criterion of +2 to +3 is often initially used. Because the version of this statistic for

    items is easily inflated as the sample size grows greater than 500 or so people, we

    also check the unstandardized version (the so-called INFIT mean square). A flex-

    ible criterion of  +1.3 is often used to flag potential problems. These two criteria

    are generally sufficient to reveal consistent unexpected responses either made to an

    item (item fit) or made by a person (person fit).

    A review of the fit statistics and standardized residuals on the two scales revealedonly one clearly problematic item. The spatial/conceptual scale analysis revealed that

    m45 (difficulty estimate= 2.09) generated numerous surprising correct responses

    from lower ability students (the item INFIT ZSTD and INFIT mean square were 4.3

    and 1.5, respectively). Although the presence of one misfitting item is certainly note-

    worthy, normal probability theory predicts the presence of at least one such item in

    each scale—simply because of chance variation. From a purely statistical perspec-

    tive, an argument may be made that this misfitting item is expected and hence of no

    consequence. Irrespective of that argument, however, it is possible to take the analy-

    sis of the unexpected responses further.Overall, m45 was the most difficult, complex level, spatial/conceptual item. It was

    also the last item on the test. An inspection of the residuals on this item for all levels

    of student ability (not just the lowest scoring) revealed widespread numbers of unex-

    pected correct responses. It is reasonable, therefore, to conclude that many students,

    running out of time, simply guessed correctly on this item. If this item were located

    anywhere else in the test, its fit would presumably improve. There is no other evi-

    dence in these data that the item itself is flawed and must be discarded or revised.

    When the items within each scale were ordered by item sequence, item difficulty,

    or item fit statistics, there were no discernible patterns. That is, the first items werenot the easiest or best fitting nor were the last items the hardest or poorest fitting.

    And neither the hardest items nor the easiest items were likely to be the best or the

    poorest fitting items, respectively. These results are desirable because the MeSA was

    not designed to start easy and end hard, nor is there evidence of any problematic

    ‘‘start-up’’ effects associated with the first items—which, in our experience, is often

    found when testing younger students.

    Finally, a series of principal components analyses were performed on the stan-

    dardized residuals from the respective scales. These analyses served as checks on

    the presence of multidimensional effects and evidence of lack of local indepen-

    dence. Theoretically, if the response patterns within a given scale can be attributed

    to a unidimensional construct and local independence holds, then correlations

    between item-level standardized residuals should be near zero. Hence, the analyses

    Vasilyeva et al. / Psychometric Properties of Measurement Assessment 125

  • 8/18/2019 Vasilyeva Et Al 2009

    21/25

    should generate eigenvalues all near one and the percentage of variance across the

    components should be uniform.

    Within each scale, the first eigenvalue tended to be slightly greater than one andit accounted for somewhat larger variance than the other components. The remain-

    ing eigenvalues ranged roughly from 1.2 to 0.8, and the variance distribution was

    roughly uniform across components. Various extractions and rotations of four-,

    three-, and two-component solutions yielded no meaningful combinations of items

    that could be labeled and interpreted as components. Individual components them-

    selves often consisted of a single high-loading item. Pairs of correlated item resi-

    duals did exist, but inspection of their respective item content did not reveal any

    obvious dependencies. According to these results, there was no compelling evi-

    dence of multidimensionality or of a violation of local independence.

    Discussion

    The purpose of this article was to present the development of a new test instru-

    ment designed to assess measurement skills in fourth-grade students and the psycho-

    metric properties of scores from this test. The instrument included items that covered

    a broad range of measurement skills and concepts. Our primary interest was in the

    ability of the test to distinguish between two subtypes of measurement—formula-based/analytical skills versus spatial/conceptual skills. The principal components

    analysis indeed showed evidence that the majority of items fit within these two sub-

    categories of measurement. Thus, the two-component solution was consistent with

    the hypothesized categorical distinction. Furthermore, the principal components ana-

    lysis allowed us to refine the construct validity of the test by identifying a few items

    that either loaded on both components or did not load on either; these items were

    subsequently eliminated from the final version of the test.

    The distinction between formula-based/analytical items and spatial/conceptual

    items demonstrated in this article is important both from a theoretical perspectiveand a more applied educational perspective. Measurement of space has been viewed

    as an area of integration of numerical and spatial thinking (Lehrer, 2003; Miller,

    1989; Wilson & Rowland, 1993). Our work allowed us to tease apart these two key

    components of measurement that have been posited theoretically. In prior research,

    investigators tended to focus on children’s problems within a particular type of mea-

    surement task, such as area or perimeter (e.g., Hiebert, 1984; Kamii, 1995; Nitabach

    & Lehrer, 1996; Outhred & Mitchelmore, 2000). However, different types of tasks

    vary in the degree to which they tap the numerical and spatial elements of measure-

    ment. Having a test with two scales that capture different subtypes of measurementwould enable researchers to obtain a more comprehensive picture of measurement

    skills and to determine which subtypes of measurement present particular difficulties

    for children.

    126 Educational and Psychological Measurement

  • 8/18/2019 Vasilyeva Et Al 2009

    22/25

    One potential application of the new instrument is in investigating the nature of 

    gender differences in measurement skills. The distinction between formula-based

    and spatial/conceptual skills is potentially relevant to the issue of gender differ-ences because one of the largest cognitive gender differences reported in the litera-

    ture is the male advantage on spatial skills (e.g., Halpern, 2004; Johnson & Meade,

    1987; Voyer, Voyer, & Bryden, 1995). Thus, boys’ advantage on measurement

    suggested on the basis of large-scale mathematical assessments may be particularly

    pronounced in the case of spatial/conceptual problems. In contrast, the subtype of 

    measurement problems that involve the use and manipulation of formulas is not

    likely to show a male advantage. These types of items depend in large part on the

    mastery of algorithms taught in school and on computational skills. There is evi-

    dence indicating that girls may have an advantage on this type of tasks (Kim et al.,1990; McGraw, Lubienski, & Strutchens, 2006; Sabers, Cushing, & Sabers, 1987).

    Identifying the specific dimensions of measurement that present particular difficul-

    ties for girls versus boys may indicate strategies for improving measurement per-

    formance of these groups of students.

    In addition to identifying the two major scales capturing the spatial/conceptual

    and formula-based analytical aspects of measurement through principal compo-

    nents analysis, we tested the hierarchical structure of items within each scale. The

    Rasch analysis provided evidence for the existence of a unidimensional structure

    within each of the scales. The Rasch results confirmed our hypotheses with respectto factors affecting complexity within the two subtypes of measurement items. In

    particular, it showed that the complexity of formula-based items increased with the

    analytical demands and the numeric difficulty of the problem, whereas the com-

    plexity of spatial/conceptual items increased with spatial processing load (i.e., the

    need to generate and manipulate spatial images). Having the test items organized

    in hierarchical map structures thus enables educators to identify the types of skills

    that the child has mastered for evaluation of the current knowledge status and for

    determining the next steps—where do we want the child to move to next?

    The analysis of items within the formula-based/analytical scale indicates that atthe lower levels of the scale children demonstrate their ability to use formulas to

    figure out the area or perimeter only under most optimal conditions, such as low

    numeric complexity and presenting problems accompanied by familiar pictorial

    representations of objects. Indeed, many children could solve area problems cor-

    rectly when they required a relatively easy calculation (5× 2), but when the pro-

    blem required a more difficult calculation (9 × 6), lower performing children were

    much more likely to produce an incorrect response by just adding the two numbers.

    This finding points to a fragile knowledge of measurement formulas—children

    appear to believe that if multiplication is hard to carry out, addition is the next best

    strategy. It also suggests that perhaps the initial instruction on measurement should

    involve low numerical complexity (so that calculation difficulties do not draw chil-

    dren’s attention away from the target measurement concepts), but children who

    Vasilyeva et al. / Psychometric Properties of Measurement Assessment 127

  • 8/18/2019 Vasilyeva Et Al 2009

    23/25

    have mastered formula use with simple numbers should be gradually challenged

    with increased numeric complexity.

    Furthermore, the choice of correct responses was reduced by presenting childrenwith pictures that were not typically associated with a particular question type. For

    example, when presented with an area problem where a picture included measure-

    ments for every side of a rectangle (which is commonly associated with calculating

    a perimeter), children tended to add up the numbers rather than multiplying them.

    These findings suggest that at the lower levels of performance, students can use for-

    mulas in limited contexts but lack a conceptual understanding behind the formula.

    The new measurement test will be instrumental in revealing this kind of weakness

    and suggesting possible next steps for instruction. For example, when practicing

    the use of formulas, the students should be presented with problems in a variety of nonstandard contexts and formats so that they do not rotely apply computations

    without thinking conceptually about the measurement question.

    Similar to the analysis of the formula-based/analytical scale, the item structure

    for the spatial/conceptual scale indicates what constitutes the lower levels of 

    knowledge and suggests the kinds of instruction that can help children improve

    their performance. We found that the students were quite successful in solving pro-

    blems that involved only generating a rough image of an object and no mental

    manipulation and tended to make more errors on problems with increasing spatial

    processing load (i.e., image generation and detailed analysis and/or mental manipu-lation). For example, we found differences in performance as a function of spatial

    demands on items where children were asked to make judgments about the number

    of units needed to measure the object. When children were provided with pictures

    of the objects and units, this reduced the spatial processing load and made the task 

    easier. However, when the size of the unit was presented in numerical terms, and

    no pictures were provided, children had to generate and manipulate spatial images

    themselves and that made the task harder. This suggests that children who can only

    solve easier spatial/conceptual problems may benefit from activities in which they

    are encouraged to generate and compare mental images, for example, mentallymoving a unit in juxtaposition with the object to be measured.

    In sum, the MeSA operationally defines two scales tapping two subtypes of 

    measurement skills: spatial/conceptual and formula-based/analytical. It can be used

    by researchers and educators to assess the child’s mastery of key measurement con-

    cepts presented at the elementary school level and to identify the next steps for

    children’s further acquisition of measurement knowledge at more advanced levels.

    The analysis reported in this article indicates sound psychometric properties of 

    scores from the two scales. However, it also points to certain limitations that need

    to be addressed in future research. In particular, the Rasch maps suggest that there

    are not enough items at the bottom and the top of the scales. Thus, the assessment

    instrument would benefit from some additional items allowing us to better distin-

    guish the students at the lowest and highest levels of performance.

    128 Educational and Psychological Measurement

  • 8/18/2019 Vasilyeva Et Al 2009

    24/25

    References

    Barrett, J. E., & Clements, D. H. (2003). Quantifying path length: Fourth-grade children’s developing

    abstractions for linear measurement. Cognition and Instruction, 21, 475-520.

    Battista, M. T. (1999). Fifth graders’ enumeration of cubes in 3D arrays: Conceptual progress in an

    inquiry-based classroom. Journal for Research in Mathematics Education, 30, 417-448.

    Battista, M. T. (2003). Understanding students’ thinking about area and volume measurement. In D. H.

    Clements & G. Bright (Eds.),   Learning and teaching measurement :   2003 yearbook   (pp. 122-142).

    Reston, VA: National Council of Teachers of Mathematics.

    Battista, M. T., & Clements, D. H. (1998). Finding the number of cubes in rectangular cube buildings.

    Teaching Children Mathematics, 4, 258-264.

    Clements, D. H., & Bright, G. (Eds.). (2003).  Learning and teaching measurement: 2003 yearbook .

    Reston, VA: National Council of Teachers of Mathematics.

    Halpern, D. F. (2004). A cognitive-process taxonomy for sex differences in cognitive abilities.  Current 

     Directions in Psychological Science, 13, 135-139.

    Henson, R. K. (2001). Understanding internal consistency reliability estimates: A conceptual primer on

    coefficient alpha. Measurement and Evaluation in Counseling and Development , 34, 177-189.

    Henson, R. K., & Roberts, J. K. (2006). Use of exploratory factor analysis in published research: Com-

    mon errors and some comment on improved practice. Educational and Psychological Measurement ,

    66 , 393-416.

    Hiebert, J. (1984). Why do some children have trouble learning measurement concepts?  Arithmetic Tea-

    cher , 31, 19-24.

    Johnson, E. S., & Meade, A. C. (1987). Developmental patterns of spatial ability: An early sex differ-

    ence. Child Development , 58, 725-740.

    Kamii, C. (1995, October).  Why is the use of a ruler so hard? Paper presented at the annual meeting of the North American Chapter of the International Group for the Psychology of Mathematics Educa-

    tion, Columbus, OH.

    Kim, H., Plake, B. S., Wise, S. L., & Novak, C. D. (1990). A longitudinal study of sex-related item bias

    in mathematics subtests of the California Achievement Test.  Applied Measurement in Education,  3,

    275-284.

    Lehrer, R. (2003). Developing understanding of measurement. In J. Kilpatrick, W. G. Martin, & D. E. Schifter

    (Eds.), A research companion to principles and standards for school mathematics  (pp. 179-192). Reston,

    VA: National Council of Teachers of Mathematics.

    Lehrer, R., Jaslow, L., & Curtis, C. (2003). Developing understanding of measurement in the elementary

    grades. In D. H. Clements & G. Bright (Eds.),  Learning and teaching measurement :  2003 yearbook 

    (pp. 100-121). Reston, VA: National Council of Teachers of Mathematics.Lubienski, S. T. (2003). A closer look at Black-White mathematics gaps: Intersections of race and SES

    in NAEP achievement and instructional practices data.  Journal of Negro Education, 71, 269-287.

    Ludlow, L. H. (1985). A strategy for the graphical representation of Rasch model residuals.   Educational

    and Psychological Measurement , 45, 851-859.

    Ludlow, L. H. (1986). Graphical analysis of item response theory residuals.   Applied Psychological

     Measurement , 10, 217-229.

    Ludlow, L. H., & Haley, S. M. (1995). Rasch model logits: Interpretation, use, and transformation.

    Educational and Psychological Measurement , 55, 967-975.

    McGraw, R., Lubienski, S. T., & Strutchens, M. E. (2006). A closer look at gender in NAEP mathe-

    matics achievement and affect data: Intersections with achievement, race/ethnicity, and socio-

    economic status. Journal for Research in Mathematics Education, 37 , 129-150.Miller, K. F. (1989). Measurement as a tool for thought: The role of measuring procedures in children’s

    understanding of quantitative invariance. Developmental Psychology, 25, 589-600.

    Vasilyeva et al. / Psychometric Properties of Measurement Assessment 129

  • 8/18/2019 Vasilyeva Et Al 2009

    25/25

    Mullis, I., Martin, M., Fierros, E., Goldberg, A., & Stemler, S. E. (2000).  Gender differences in achieve-

    ment: IEA’s third international mathematics and science study. Chestnut Hill, MA: TIMSS Interna-

    tional Study Center, Boston College.

    Mullis, I., Martin, M., Gonzalez, E., & Chrostowski, S. (2004).  TIMSS 2003 international mathematicsreport: Findings from IEA’s trends in international mathematics and science study at the fourth and 

    eighth grades. Chestnut Hill, MA: Boston College.

    National Council of Teachers of Mathematics. (2000). Principles and standards for school mathematics.

    Reston, VA: Author.

    National Science Teachers Association. (2003).  Standards for science teacher preparation.  Washington,

    DC: Author.

    Nitabach, E., & Lehrer, R. (1996). Developing spatial sense through area measurement. Teaching Chil-

    dren Mathematics, 2, 473-476.

    Nunnally, J. C., & Bernstein, I. H. (1994).  Psychometric theory (3rd ed.). New York: McGraw-Hill.

    Outhred, L. N., & Mitchelmore, M. C. (2000). Young children’s intuitive understanding of rectangular

    area measurement. Journal for Research in Mathematics Education, 31, 144-167.Rasch, G. (1980).  Probabilistic models for some intelligence and attainment tests. Chicago: University

    of Chicago Press. (Original work published 1960)

    Robitaille, D. F. (1989). Students’ achievements: Population A. In D. F. Robitaille & R. A. Gardner (Eds.),

    The IEA study of mathematics II: Contexts and outcomes of school mathematics   (pp. 102-125).

    New York: Pergamon.

    Sabers, D., Cushing, K., & Sabers, D. (1987). Sex differences in reading and mathematics achievement

    for middle school students. Journal of Early Adolescence, 7 , 117-128.

    Shaw, J., & Pucket-Cliatt, M. (1989). Developing measurement sense. In P. Trafton & A. Schule (Eds.),

     New directions for elementary school mathematics: 1989 yearbook   (pp. 149-155). Reston, VA:

    National Council of Teachers of Mathematics.

    Smith, R. (1991). Item and person analysis with the Rasch model. Maple Grove, MN: JAM Press. Avail-

    able from http://home.att.net/ ∼rsmith.arm/jam-press.htm

    Strutchens, M. E., & Silver, E. A. (2000). NAEP findings regarding race/ethnicity: Students’ perfor-

    mance, school experiences, and attitudes and beliefs. In E. A. Silver & P. A. Kenney (Eds.),  Results

     from the seventh mathematics assessment of the National Assessment of Educational Progress

    (pp. 45-72). Reston, VA: National Council of Teachers of Mathematics.

    Voyer, D., Voyer, S., & Bryden, M. P. (1995). Magnitude of sex differences in spatial abilities: A meta-

    analysis and consideration of critical variables.  Psychological Bulletin, 117 , 250-270.

    Wilson, P. S., & Rowland, R. (1993). Teaching measurement. In R. J. Jensen (Ed.),  Research ideas for 

    the classroom: Early childhood mathematics  (pp. 171-194). New York: Macmillan.

    Wright, B. D., & Linacre, M. (1998). WINSTEPS. Chicago: MESA Press.

    Wright, B. D., & Masters, G. N. (1982).   Rating scale analysis: Rasch measurement . Chicago: MESA

    Press.

    Wright, B. D., & Stone, M. (1979).  Best test design. Chicago: MESA Press.

    130 Educational and Psychological Measurement