Research List
Always Learning

Conference Reports

Find Research:
Linking Two Assessment Systems Using Common-Item IRT Method and Equipercentile Linking Method

When states move from one assessment system to another, it is often necessary to establish a concordance between the two assessments for accountability purposes. The purpose of this study is to model two alternative approaches to transitioning performance standards, both of which can be executed using data from regularly scheduled operational administrations.

Kirkpatrick, Rob, Turhan, Ahmet, Lin, Jie A 04-01-2012
Creating Curriculum-Embedded, Performance-Based Assessments for Measuring 21st Century Skills in K-5 Students

This paper will share the author’s experiences working with a large and diverse school district to design curriculum-embedded, performance-based assessments (PBAs) that measure 21st century skills in K-5 students.

Lai, Emily R. 04-01-2012
Assessing 21st Century Skills: Integrating Research Findings

This paper synthesizes research evidence pertaining to several so-called 21st century skills: critical thinking, creativity, collaboration, metacognition, and motivation.

Lai, Emily R., Viering, Michaela 04-01-2012
The Case for Performance-Based Tasks without Equating

This paper proposes a model for performance-based assessments that assumes random selection of performance-based tasks (PBTs) from a large pool, and that assumes tasks are comparable without equating PBTs.

Way, Walter D., Murphy, Daniel, Powers, Sonya, Keng, Leslie 04-01-2012
Improving Text Complexity Measurement through the Reading Maturity Metric

The purposes of this paper are to describe how Word Maturity has been incorporated into Pearson’s text complexity measure, to present initial comparisons between this new measure of text complexity and traditional readability measures, and to address measurement issues in the development and use of text complexity measurements.

Landauer, Tom, Way, Walter D. 04-01-2012
A Comparison of Three Content Balancing Methods for Fixed and Variable Length Computerized Adaptive Tests

The purpose of this study is to compare the WPM method to the WDM method under various conditions including the simple and complicated content constraint structure, different CAT settings such as item pool, item exposure control specification, and theta estimation options for both fixed- and variable-length CAT tests.

Shin, Chingwei David, Chien, Yuehmei, Way, Walter Denny 04-01-2012
Connecting English Langage Learning and Academic Performance: A Prediction Study

The purpose of this study was to investigate the use of English language proficiency and academic reading assessment scores to predict the future academic success of English learner (EL) students.

Kong, Jadie, Powers, Sonya, Starr, Laura, Williams, Natasha 04-01-2012
Population Invariance of Vertical Scaling Results

In this report, the population sensitivity of vertical scaling results was evaluated for a state reading assessment spanning grades 3–10 and a state mathematics test spanning grades 3–8.

Powers, Sonya, Turhan, Ahrmet, Binici, Salih (Florida State University) 04-01-2012
Putting Ducks in a Row: Methods for Empirical Alignment of Performance Scoring

Using historical state data, this report evaluates nine different methods of aligning performance standards and discusses the effects of selecting different methods as well as the potential implications for interpretations of student progress and school success.

McClarty, Katie Larsen, Murphy, Daniel, Keng, Leslie, Turhan, Ahmet, Tong, Ye 04-01-2012
The Impact of Item Position Change on Item Parameters and Common Equating Results under the 3PL Model

This study investigates the impact of IPC in the context of operational testing programs that employ the 3PL model, alternative equating procedures, and different item re-use policies.

Meyers, Jason L., Murphy, Stephen, Goodman, Joshua, Turhan, Ahmet 04-01-2012
Impact of Group Differences on Equating Accuracy and the Adequacy of Equating Assumptions

This study compared four curvilinear equating methods including frequency estimation, chained equipercentile, IRT true score, and IRT observed score equating.

Powers, Sonya 04-30-2011
Comparing Methods for Detecting Unstable Anchor Items with Net DIF and Global DIF Conceptions

This study is to compare different approaches for detecting misbehavior anchor items in IRT equating using Rasch and partial credit models.

Lau, C. Allen, Arce, Alvaro J. 04-11-2011
Expanding the Model of Item-Writing Expertise: Cognitive Processes and Requisite Knowledge Structures

In this paper, we expand the cognitive model of item writing to not only include cognitive processes but to also include requisite knowledge structures used by item writers.

Fulkerson, Dennis (Pearson), Nichols, Paul (Center for Assessment) , Snow, Eric (SRI International) 04-07-2011
Does Size Matter? A Study on the Use of Netbooks in K-12 Assessments.

In this paper, we analyze a study conducted during the spring 2010 administration of the Texas End-of-Course (EOC) assessments to evaluate the feasibility of using netbooks in the context of K-12 assessments.

King, Leslie, Kong, Xiaojing Jadie, Bleil, Bryan 04-01-2011
Comparison of Asymptotic and Bootstrap Item Fit Indices in Identifying Misfit to the Rasch Model

In this study, our results indicate that bootstrap critical values allow for greater statistical power in diagnosing item misfit caused by varying item slopes and lower asymptotes.

Wolfe, Edward W., McGill, Michael T. 04-01-2011
Statistical Properties of 3PL Robust Z: An Investigation with Real and Simulated Data Sets

The purpose of this paper was to inspect statistical properties of the robust z approach in the context of 3PL equating with the common item non-equivalent group design.

Arce, Alvaro J., Lau, C. Allen 04-01-2011
Investigating Common-Item Screening Procedures in Developing a Vertical Scale

Creating a vertical scale involves several decisions on assessment designs and statistical analyses to determine the most appropriate vertical scale.

Johnson, Marc, Yi, Qing 04-01-2011
Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test

This study investigated how well the guideline of content and construct representation was maintained while evaluating two stability assessment criteria (Robust z and 0.3-logit difference).

Hardy, M. Assunta (BYU), Young, Michael J. (Pearson), Yi, Qing (Pearson), Sudweeks, Richard R. (BYU), Bahr, Damon L. (BYU) 04-01-2011
Application of Latent Trait Models to Identifying Substantively Interesting Raters

This study demonstrates how existing latent trait modeling procedures can identify groups of raters who may be of substantive interest to those studying the experiential, cognitive, and contextual aspects of ratings.

Wolfe, Edward W., McVay, Aaron 04-01-2011
Through-Course Common Core Assessments in the United States: Can Summative Assessment Be Formative?

In this paper, we present a design for enhancing the formative uses of summative through-course assessments.

Way, Walter D., Larsen McClarty, Katie, Murphy, Dan, Ken, Leslie , Fuhrken, Charles 04-01-2011
Impact of Non-representative Anchor Items on Scale Stability

This study attempts to fill this gap by simulating item response data over  multiple administrations under the common-item nonequivalent groups design and  examining the effects of non-representative anchor items on scale stability.

Wei, Hua 05-01-2010
The Modified Briefing Book Standard Setting Method: Using Validity Data as a Basis for Setting Cut Scores

This paper focuses on two aspects of the modified briefing book standard setting process  developed to meet this need: 1) the validity research conducted to support the standard  setting; and 2) the standard setting itself, through which the validity research and  associated pertinent information was organized and presented to the panelists, and  resulting process through which these data were used to elicit cut score judgments.

Miles, Julie A., Beimers, Jennifer N., Way, Walter D. 05-01-2010
Improving the Post-Smoothing of Test Norms with Kernel Smoothing

The traditional methodology of apost-smoothing to develop norms used on educational  and clinic products is to hand-smooth the scale scores or their distributions. This approach is  very subjective, difficult to replicate, and extremely labor intensive. In hand-smoothing, the  scores or distributions are adjusted based on personal judgment. Different persons, or same  person at different times, will make significantly different judgments. By contrast, the kernel  smoothing method is a nonparametric approach, which is more flexible, less subjective, and  easier to replicate.

Lin, Anli, Yi, Qing, Young, Michael J. 05-01-2010
Running Head: IMPACT OF DIFFERENT ANCHOR STABILITY METHODS The Impact of Different Anchor Stability Methods on Equating Results and Student Performance

The key objective of this study is to demonstrate a methodological procedure or  strategy for examining the different anchor stability procedures and the accompanying  results and to evaluate the impact on the final RSSS tables and reported cut scores (i.e.,  performance levels). For our study we did not include the bivariate plots for the old and  new parameter values.

Murphy, Stephen, Little, Ian, Fan, Meichu, Lin, Chow-Hong, Kirkpatrick, Rob 05-01-2010
Comparisons of Test Characteristic Curve Alignment Criteria of the Anchor Set and the Total Test: Maintaining Test Scale and Impacts on Student Performance

The current paper investigates a tenet of the traditional view on the psychometric  characteristics of such anchor sets. Specifically, the traditional guideline, without any specificity, states that the test characteristic curve (TCC) of the anchor set and the total test should be closely overlapped.

Karkee, Thakur B., Ph. D, Fatica, Kevin, Murphy, Stephen T., Ph. D. 05-01-2010
Running Head: Predicting ELP A Multi-level Modeling Approach to Predicting Performance on a State ELA Assessment

The purpose of this study was to examine on a State English Language Proficiency Examination for grades K-12 (a) the performance of students in low SES environments vs. high SES environments as measured by school Title I participation, (b) the performance of males vs. females, (c) the effect of ethnicity( Hispanic vs. non-Hispanic students), and (d) any interaction effects.

Brown, Raymond S., Nguyen, T., Stephenson, A. 05-01-2010
What Item Writers Think When Writing Items: Towards A Theory OF Item Writing Expertise

The study of expert item writers offers the possibility of “bottling” the knowledge and skills acquired by these experts over years of hard work. The descriptions of the identified conceptual knowledge and skills of expert item writers could be incorporated into item writing workshops in order to equip new item writers with the tools necessary to produce quality figural response items.

Fulkerson, Dennis, Nichols, Paul, Mittelholtz, David 05-01-2010
Investigating Approaches to Estimate an Individual's Strand/objective Score Profile Reliability: A Monte Carlo Study

The paper studies performance of generalizability and classical test theory reliability approaches to estimate reliability of an individual's strand/objective score profile.

Arce-Ferrer, Alvaro J. 05-01-2010
Distractor Rationale Taxonomy: Diagnostic Assessment of Reading with Ordered Multiple-Choice Items

The distractor rataionale taxonomy (DRT) examined in this study is an understanding-level-driven distractor analysis system for multiple-choice items.  The DRT purposely creates distrators at different comprehension levels to pinpoint sources of misunderstanding.

Lin, Jie, Lee Chu, Kwang, Meng, Ying 05-01-2010
AutoCorreleation in the COFM. The effects of Autocorrelation on the Curve-of-factors Growth Model

This simulation study examined the performance of the curve-of-factors model (COFM) when autocorrelation and grwth processes were present in the first-level factor sturcture.  In addition to the standard curve-of-factors growth model, two new models were examined: one COFM that included a first-order autoagressive atuocorrelation parameter, and a second model that included first-order autoregressive and voving average autocorrelation parameters.

Murphy, Daniel J., Beretvas, S Natasha, Pituch, Keenan A 05-01-2010
Correlates of Mathematics Achievement in Developed and Developing Countries: An HLM Analysis of TIMSS 2003 Eighth-grade Mathematics Scores

The purpose of this study was to investigate correlates of math achievement in both developed and developing countries. Specifically, two developed countries and two developing countries that participated in the TIMSS 2003 eighth-grade math assessment were selected for this study. For each country, contextual factors at both the student and the teacher/school levels were used to construct Correlates of Math Achievement 3 models that yield country-specific findings related to students’ math performance.

Phan, Ha, Sentovich, Christina, Kromrey, Jeffrey, Dedrick, Robert, Ferron, John 05-01-2010
IRT Proficiency Estimators and Their Impact

In the current study, we further examined the statistical properties of the various  IRT estimators, especially focusing on their practical impact on the reported scores. We  4  also investigated a few practical scenarios, where the testing focus is on assessing college  readiness, assessing students’ minimal competency, or providing estimates for students  who have failed a previous exam (retesters).

Tong, Ye, Kolen, Michael J. 05-01-2010
The Hazards of Newness: A Portrait of Challenges Faced by New High School English Teachers

This paper reports findings of a survey study designed to examine how high school English  teachers are assigned to teach particular grades and track levels, whether these teachers have  their own classrooms, and how they and their students perceive one another.

Bieler, Deborah, Holmes, Stephen, Wolfe, Edward W. 05-01-2010
Conference Reports; Constructed Response Scoring

An increasing number of large scale assessments contain constructed response items such  as essays for the advantages they offer over traditional multiple-choice measures. Writing  assessments in particular often contain a mixture of multiple-choice and essay items. These  mixed-format assessments pose many technical challenges for psychometricians. This study  directly builds upon the Meyers et al. (2009) study by investigating how ability estimation, essay scoring approach, measurement model, and proportion of points allocated to multiple choice  items and the essay item on mixed-format assessments interact to recover ability and item  parameter estimates under different degrees of multidimensionality.

Meyers, Jason L., Turhan, Ahmet, Fitzpatrick, Steven J. 05-01-2010
Deriviation of a Profile Reliability Index for an Individual: A Multi-Factor Congeneric Approach with Guttnam Error Type Structures

The paper discusses results and proposes research to substantiate current supporting evidenc for the operational use of the profile reliability approach

Arce-Ferrer, Alvaro J. 11-25-2009
Growth, Precision, and CAT: An Examination of Gain Score Conditional SEM

Monitoring the growth of student learning is a critically important component of modern education. Such growth is typically monitored using gain scores representing differences between two testing occasions, such as prior to and following a year of instruction.

Thompson, Tony D. 12-01-2008
Effects of Different Training and Scoring Approaches on Human Constructed Response Scoring

This paper summarizes and discusses research studies related to the human scoring of constructed response items that have been conducted recently at a large scale testing company.

Nichols, Paul, Vickers, Daisy, Way, Walter D. 04-01-2008
Person-fit of English Language Learners (ELL) in K-12 High-Stakes Assessments

The No Child Left Behind Act holds states using federal funds accountable for student academic achievement.

Wan, Lei, Wu, Brad 04-01-2008
Maintaining Score Equivalence as Tests Transition Online: Issues, Approaches and Trends

The purpose of this paper is to summarize a number of studies that Pearson has conducted with K-12 state departments of education using a particular analysis method referred to as Matched Samples Comparability Analyses (MCSA).

Kong, Jadie, Lin, Chow-Hong, Way, Walter D. 03-28-2008
Field Testing and Equating Designs for State Educational Assessments

The educational accountability movement has spawned unprecedented numbers of new assessments. For example, the No Child Left Behind Act of 2002 (NCLB) required states to test students in grades 3 through 8 and at one grade in high school each year.

Kirkpatrick, Rob, Way, Walter D. 03-01-2008
An Investigation of the Changes in Item Parameter Estimates for Items Re-field Tested

Large-scale state testing programs typically rely upon a large bank of items to select from when building assessments.

Kong, Xiaojing Jadie, McClarty, Katie Larsen, Meyers, Jason L. 03-01-2008
Applying a User-Centered Design Approach to Data Management: Paper and Computer Testing

This paper discusses the application of a user-centered design (UCD) approach to a web-based application system that supports data management components of the high-stakes assessment lifecycle.

Wilson, Jeffrey R., PhD 03-01-2008
User-Centered Assessment Design

In this paper, we introduce user-centered assessment design (UCAD), an approach to test design intended to produce assessments that deliver to teachers the kind of complex information on student learning and knowledge that they can combine with sound pedagogical practice to improve student achievement.

Adams, Jeremy, Mittelholtz, David, Nichols, Paul, Van Duesen, Robert 03-01-2008
A Comparison of Pre-Equating and Post-Equating Using Large-Scale Assessment Data

Equating is a statistical process that is used to adjust scores on test forms so that scores on the forms can be used interchangeably (Kolen & Brennan, 2004), even though the test forms consist of different items.

Tong, Ye, Wu, Sz-Shyan, Xu, Ming 03-01-2008
A Tale of Two Modes: A Case Study in User-centered Design’s Role in Comparability and Construct Validity

Introduction: UCD’s Role within User-centered Assessment Design One merit of user-centered assessment design (UCAD) as defined by Nichols et al (2008) is its broadened view of test development.

Strain-Seymour, Ellen, PhD 03-01-2008
Score Reporting, Off-the-Shelf Assessments and NCLB: Truly and Unholy Trinity

One consequence resulting from NCLB, particularly as instructional time becomes more precious, is the desire to be more efficient in assessing learning.

Twing, Jon S., PhD 03-01-2008
Evidence of Test Score Use in Validity: Roles and Responsibilites

This paper has three goals.

Nichols, Paul D., Williams, Natasha 03-01-2008
Maintenance of Vertical Scales

Vertical scaling refers to the process of placing scores of tests that measure similar domains but at different educational levels onto a common scale, a vertical scale.

Kolen, Michael J., Ye, Tong 03-01-2008
Usability and Design Considerations for Computer-based Learning and Assessment

The overall success of computer-based products and systems is dependent to a significant extent on their usability and usefulness in the intended context.

Adams, Jeremy, Harms, Michael 03-01-2008
Score Comparability of Online and Paper Administrations of the Texas Assessment of Knowledge and Skills

The comparability studies presented in this paper illustrate how responsible and psychometrically defensible comparability analyses can be incorporated within the constraints of a high-stakes, operational testing program like TAKS.

Fitzpatrick, Steven, Laughlin Davis, Laurie , Way, Walter D. 04-01-2006
A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing

Testlet response theory (TRT) is a measurement model that can capture local dependency in testlet-based tests.

Chen, Tzu-An Ann, Dodd, Barbara G., Ho, Tsung-Han, Keng, Leslie
Response Probability Criterion and Subgroup Performance

In the standard setting literature, there has been much debate about the most appropriate response probability (RP) to use in an item mapping procedure such as the Bookmark Standard Setting Procedure.

Egan, Karla, Mueller, Canda D., Schneider, M. Christina
Exploring the Use of Item Bank Information to Improve IRT Item Parameter Estimation

On occasion, the sample of students available for calibrating a set of assessment items may not be optimal.

Ansley, Timothy, Hall, Erika
A Generalization of Stratified a that Allows for Correlated Measurement Errors between Subtests

This paper presents a generalization of Stratified a that allows for correlated measurement errors between some subtest scores that make up a composite score.

Keng, Leslie , Miller, G. Edward, O'Malley, Kimberly, Turhan, Ahmet