|Linking Two Assessment Systems Using Common-Item IRT Method and Equipercentile Linking Method||
When states move from one assessment system to another, it is often necessary to establish a concordance between the two assessments for accountability purposes. The purpose of this study is to model two alternative approaches to transitioning performance standards, both of which can be executed using data from regularly scheduled operational administrations.
|Kirkpatrick, Rob, Turhan, Ahmet, Lin, Jie A||04-01-2012|
|Creating Curriculum-Embedded, Performance-Based Assessments for Measuring 21st Century Skills in K-5 Students||
This paper will share the author’s experiences working with a large and diverse school district to design curriculum-embedded, performance-based assessments (PBAs) that measure 21st century skills in K-5 students.
|Lai, Emily R.||04-01-2012|
|Assessing 21st Century Skills: Integrating Research Findings||
This paper synthesizes research evidence pertaining to several so-called 21st century skills: critical thinking, creativity, collaboration, metacognition, and motivation.
|Lai, Emily R., Viering, Michaela||04-01-2012|
|The Case for Performance-Based Tasks without Equating||
This paper proposes a model for performance-based assessments that assumes random selection of performance-based tasks (PBTs) from a large pool, and that assumes tasks are comparable without equating PBTs.
|Way, Walter D., Murphy, Daniel, Powers, Sonya, Keng, Leslie||04-01-2012|
|Improving Text Complexity Measurement through the Reading Maturity Metric||
The purposes of this paper are to describe how Word Maturity has been incorporated into Pearson’s text complexity measure, to present initial comparisons between this new measure of text complexity and traditional readability measures, and to address measurement issues in the development and use of text complexity measurements.
|Landauer, Tom, Way, Walter D.||04-01-2012|
|A Comparison of Three Content Balancing Methods for Fixed and Variable Length Computerized Adaptive Tests||
The purpose of this study is to compare the WPM method to the WDM method under various conditions including the simple and complicated content constraint structure, different CAT settings such as item pool, item exposure control specification, and theta estimation options for both fixed- and variable-length CAT tests.
|Shin, Chingwei David, Chien, Yuehmei, Way, Walter Denny||04-01-2012|
|Connecting English Langage Learning and Academic Performance: A Prediction Study||
The purpose of this study was to investigate the use of English language proficiency and academic reading assessment scores to predict the future academic success of English learner (EL) students.
|Kong, Jadie, Powers, Sonya, Starr, Laura, Williams, Natasha||04-01-2012|
|Population Invariance of Vertical Scaling Results||
In this report, the population sensitivity of vertical scaling results was evaluated for a state reading assessment spanning grades 3–10 and a state mathematics test spanning grades 3–8.
|Powers, Sonya, Turhan, Ahrmet, Binici, Salih (Florida State University)||04-01-2012|
|Putting Ducks in a Row: Methods for Empirical Alignment of Performance Scoring||
Using historical state data, this report evaluates nine different methods of aligning performance standards and discusses the effects of selecting different methods as well as the potential implications for interpretations of student progress and school success.
|McClarty, Katie Larsen, Murphy, Daniel, Keng, Leslie, Turhan, Ahmet, Tong, Ye||04-01-2012|
|The Impact of Item Position Change on Item Parameters and Common Equating Results under the 3PL Model||
This study investigates the impact of IPC in the context of operational testing programs that employ the 3PL model, alternative equating procedures, and different item re-use policies.
|Meyers, Jason L., Murphy, Stephen, Goodman, Joshua, Turhan, Ahmet||04-01-2012|
|Impact of Group Differences on Equating Accuracy and the Adequacy of Equating Assumptions||
This study compared four curvilinear equating methods including frequency estimation, chained equipercentile, IRT true score, and IRT observed score equating.
|Comparing Methods for Detecting Unstable Anchor Items with Net DIF and Global DIF Conceptions||
This study is to compare different approaches for detecting misbehavior anchor items in IRT equating using Rasch and partial credit models.
|Lau, C. Allen, Arce, Alvaro J.||04-11-2011|
|Expanding the Model of Item-Writing Expertise: Cognitive Processes and Requisite Knowledge Structures||
In this paper, we expand the cognitive model of item writing to not only include cognitive processes but to also include requisite knowledge structures used by item writers.
|Fulkerson, Dennis (Pearson), Nichols, Paul (Center for Assessment) , Snow, Eric (SRI International)||04-07-2011|
|Does Size Matter? A Study on the Use of Netbooks in K-12 Assessments.||
In this paper, we analyze a study conducted during the spring 2010 administration of the Texas End-of-Course (EOC) assessments to evaluate the feasibility of using netbooks in the context of K-12 assessments.
|King, Leslie, Kong, Xiaojing Jadie, Bleil, Bryan||04-01-2011|
|Comparison of Asymptotic and Bootstrap Item Fit Indices in Identifying Misfit to the Rasch Model||
In this study, our results indicate that bootstrap critical values allow for greater statistical power in diagnosing item misfit caused by varying item slopes and lower asymptotes.
|Wolfe, Edward W., McGill, Michael T.||04-01-2011|
|Statistical Properties of 3PL Robust Z: An Investigation with Real and Simulated Data Sets||
The purpose of this paper was to inspect statistical properties of the robust z approach in the context of 3PL equating with the common item non-equivalent group design.
|Arce, Alvaro J., Lau, C. Allen||04-01-2011|
|Investigating Common-Item Screening Procedures in Developing a Vertical Scale||
Creating a vertical scale involves several decisions on assessment designs and statistical analyses to determine the most appropriate vertical scale.
|Johnson, Marc, Yi, Qing||04-01-2011|
|Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test||
This study investigated how well the guideline of content and construct representation was maintained while evaluating two stability assessment criteria (Robust z and 0.3-logit difference).
|Hardy, M. Assunta (BYU), Young, Michael J. (Pearson), Yi, Qing (Pearson), Sudweeks, Richard R. (BYU), Bahr, Damon L. (BYU)||04-01-2011|
|Application of Latent Trait Models to Identifying Substantively Interesting Raters||
This study demonstrates how existing latent trait modeling procedures can identify groups of raters who may be of substantive interest to those studying the experiential, cognitive, and contextual aspects of ratings.
|Wolfe, Edward W., McVay, Aaron||04-01-2011|
|Through-Course Common Core Assessments in the United States: Can Summative Assessment Be Formative?||
In this paper, we present a design for enhancing the formative uses of summative through-course assessments.
|Way, Walter D., Larsen McClarty, Katie, Murphy, Dan, Ken, Leslie , Fuhrken, Charles||04-01-2011|
|Impact of Non-representative Anchor Items on Scale Stability||
This study attempts to fill this gap by simulating item response data over multiple administrations under the common-item nonequivalent groups design and examining the effects of non-representative anchor items on scale stability.
|The Modified Briefing Book Standard Setting Method: Using Validity Data as a Basis for Setting Cut Scores||
This paper focuses on two aspects of the modified briefing book standard setting process developed to meet this need: 1) the validity research conducted to support the standard setting; and 2) the standard setting itself, through which the validity research and associated pertinent information was organized and presented to the panelists, and resulting process through which these data were used to elicit cut score judgments.
|Miles, Julie A., Beimers, Jennifer N., Way, Walter D.||05-01-2010|
|Improving the Post-Smoothing of Test Norms with Kernel Smoothing||
The traditional methodology of apost-smoothing to develop norms used on educational and clinic products is to hand-smooth the scale scores or their distributions. This approach is very subjective, difficult to replicate, and extremely labor intensive. In hand-smoothing, the scores or distributions are adjusted based on personal judgment. Different persons, or same person at different times, will make significantly different judgments. By contrast, the kernel smoothing method is a nonparametric approach, which is more flexible, less subjective, and easier to replicate.
|Lin, Anli, Yi, Qing, Young, Michael J.||05-01-2010|
|Running Head: IMPACT OF DIFFERENT ANCHOR STABILITY METHODS The Impact of Different Anchor Stability Methods on Equating Results and Student Performance||
The key objective of this study is to demonstrate a methodological procedure or strategy for examining the different anchor stability procedures and the accompanying results and to evaluate the impact on the final RSSS tables and reported cut scores (i.e., performance levels). For our study we did not include the bivariate plots for the old and new parameter values.
|Murphy, Stephen, Little, Ian, Fan, Meichu, Lin, Chow-Hong, Kirkpatrick, Rob||05-01-2010|
|Comparisons of Test Characteristic Curve Alignment Criteria of the Anchor Set and the Total Test: Maintaining Test Scale and Impacts on Student Performance||
The current paper investigates a tenet of the traditional view on the psychometric characteristics of such anchor sets. Specifically, the traditional guideline, without any specificity, states that the test characteristic curve (TCC) of the anchor set and the total test should be closely overlapped.
|Karkee, Thakur B., Ph. D, Fatica, Kevin, Murphy, Stephen T., Ph. D.||05-01-2010|
|Running Head: Predicting ELP A Multi-level Modeling Approach to Predicting Performance on a State ELA Assessment||
The purpose of this study was to examine on a State English Language Proficiency Examination for grades K-12 (a) the performance of students in low SES environments vs. high SES environments as measured by school Title I participation, (b) the performance of males vs. females, (c) the effect of ethnicity( Hispanic vs. non-Hispanic students), and (d) any interaction effects.
|Brown, Raymond S., Nguyen, T., Stephenson, A.||05-01-2010|
|What Item Writers Think When Writing Items: Towards A Theory OF Item Writing Expertise||
The study of expert item writers offers the possibility of “bottling” the knowledge and skills acquired by these experts over years of hard work. The descriptions of the identified conceptual knowledge and skills of expert item writers could be incorporated into item writing workshops in order to equip new item writers with the tools necessary to produce quality figural response items.
|Fulkerson, Dennis, Nichols, Paul, Mittelholtz, David||05-01-2010|
|Investigating Approaches to Estimate an Individual's Strand/objective Score Profile Reliability: A Monte Carlo Study||
The paper studies performance of generalizability and classical test theory reliability approaches to estimate reliability of an individual's strand/objective score profile.
|Arce-Ferrer, Alvaro J.||05-01-2010|
|Distractor Rationale Taxonomy: Diagnostic Assessment of Reading with Ordered Multiple-Choice Items||
The distractor rataionale taxonomy (DRT) examined in this study is an understanding-level-driven distractor analysis system for multiple-choice items. The DRT purposely creates distrators at different comprehension levels to pinpoint sources of misunderstanding.
|Lin, Jie, Lee Chu, Kwang, Meng, Ying||05-01-2010|
|AutoCorreleation in the COFM. The effects of Autocorrelation on the Curve-of-factors Growth Model||
This simulation study examined the performance of the curve-of-factors model (COFM) when autocorrelation and grwth processes were present in the first-level factor sturcture. In addition to the standard curve-of-factors growth model, two new models were examined: one COFM that included a first-order autoagressive atuocorrelation parameter, and a second model that included first-order autoregressive and voving average autocorrelation parameters.
|Murphy, Daniel J., Beretvas, S Natasha, Pituch, Keenan A||05-01-2010|
|Correlates of Mathematics Achievement in Developed and Developing Countries: An HLM Analysis of TIMSS 2003 Eighth-grade Mathematics Scores||
The purpose of this study was to investigate correlates of math achievement in both developed and developing countries. Specifically, two developed countries and two developing countries that participated in the TIMSS 2003 eighth-grade math assessment were selected for this study. For each country, contextual factors at both the student and the teacher/school levels were used to construct Correlates of Math Achievement 3 models that yield country-specific findings related to students’ math performance.
|Phan, Ha, Sentovich, Christina, Kromrey, Jeffrey, Dedrick, Robert, Ferron, John||05-01-2010|
|IRT Proficiency Estimators and Their Impact||
In the current study, we further examined the statistical properties of the various IRT estimators, especially focusing on their practical impact on the reported scores. We 4 also investigated a few practical scenarios, where the testing focus is on assessing college readiness, assessing students’ minimal competency, or providing estimates for students who have failed a previous exam (retesters).
|Tong, Ye, Kolen, Michael J.||05-01-2010|
|The Hazards of Newness: A Portrait of Challenges Faced by New High School English Teachers||
This paper reports findings of a survey study designed to examine how high school English teachers are assigned to teach particular grades and track levels, whether these teachers have their own classrooms, and how they and their students perceive one another.
|Bieler, Deborah, Holmes, Stephen, Wolfe, Edward W.||05-01-2010|
|Conference Reports; Constructed Response Scoring||
An increasing number of large scale assessments contain constructed response items such as essays for the advantages they offer over traditional multiple-choice measures. Writing assessments in particular often contain a mixture of multiple-choice and essay items. These mixed-format assessments pose many technical challenges for psychometricians. This study directly builds upon the Meyers et al. (2009) study by investigating how ability estimation, essay scoring approach, measurement model, and proportion of points allocated to multiple choice items and the essay item on mixed-format assessments interact to recover ability and item parameter estimates under different degrees of multidimensionality.
|Meyers, Jason L., Turhan, Ahmet, Fitzpatrick, Steven J.||05-01-2010|
|Deriviation of a Profile Reliability Index for an Individual: A Multi-Factor Congeneric Approach with Guttnam Error Type Structures||
The paper discusses results and proposes research to substantiate current supporting evidenc for the operational use of the profile reliability approach
|Arce-Ferrer, Alvaro J.||11-25-2009|
|Growth, Precision, and CAT: An Examination of Gain Score Conditional SEM||
Monitoring the growth of student learning is a critically important component of modern education. Such growth is typically monitored using gain scores representing differences between two testing occasions, such as prior to and following a year of instruction.
|Thompson, Tony D.||12-01-2008|
|Effects of Different Training and Scoring Approaches on Human Constructed Response Scoring||
This paper summarizes and discusses research studies related to the human scoring of constructed response items that have been conducted recently at a large scale testing company.
|Nichols, Paul, Vickers, Daisy, Way, Walter D.||04-01-2008|
|Person-fit of English Language Learners (ELL) in K-12 High-Stakes Assessments||
The No Child Left Behind Act holds states using federal funds accountable for student academic achievement.
|Wan, Lei, Wu, Brad||04-01-2008|
|Maintaining Score Equivalence as Tests Transition Online: Issues, Approaches and Trends||
The purpose of this paper is to summarize a number of studies that Pearson has conducted with K-12 state departments of education using a particular analysis method referred to as Matched Samples Comparability Analyses (MCSA).
|Kong, Jadie, Lin, Chow-Hong, Way, Walter D.||03-28-2008|
|Field Testing and Equating Designs for State Educational Assessments||
The educational accountability movement has spawned unprecedented numbers of new assessments. For example, the No Child Left Behind Act of 2002 (NCLB) required states to test students in grades 3 through 8 and at one grade in high school each year.
|Kirkpatrick, Rob, Way, Walter D.||03-01-2008|
|An Investigation of the Changes in Item Parameter Estimates for Items Re-field Tested||
Large-scale state testing programs typically rely upon a large bank of items to select from when building assessments.
|Kong, Xiaojing Jadie, McClarty, Katie Larsen, Meyers, Jason L.||03-01-2008|
|Applying a User-Centered Design Approach to Data Management: Paper and Computer Testing||
This paper discusses the application of a user-centered design (UCD) approach to a web-based application system that supports data management components of the high-stakes assessment lifecycle.
|Wilson, Jeffrey R., PhD||03-01-2008|
|User-Centered Assessment Design||
In this paper, we introduce user-centered assessment design (UCAD), an approach to test design intended to produce assessments that deliver to teachers the kind of complex information on student learning and knowledge that they can combine with sound pedagogical practice to improve student achievement.
|Adams, Jeremy, Mittelholtz, David, Nichols, Paul, Van Duesen, Robert||03-01-2008|
|A Comparison of Pre-Equating and Post-Equating Using Large-Scale Assessment Data||
Equating is a statistical process that is used to adjust scores on test forms so that scores on the forms can be used interchangeably (Kolen & Brennan, 2004), even though the test forms consist of different items.
|Tong, Ye, Wu, Sz-Shyan, Xu, Ming||03-01-2008|
|A Tale of Two Modes: A Case Study in User-centered Design’s Role in Comparability and Construct Validity||
Introduction: UCD’s Role within User-centered Assessment Design One merit of user-centered assessment design (UCAD) as defined by Nichols et al (2008) is its broadened view of test development.
|Strain-Seymour, Ellen, PhD||03-01-2008|
|Score Reporting, Off-the-Shelf Assessments and NCLB: Truly and Unholy Trinity||
One consequence resulting from NCLB, particularly as instructional time becomes more precious, is the desire to be more efficient in assessing learning.
|Twing, Jon S., PhD||03-01-2008|
|Evidence of Test Score Use in Validity: Roles and Responsibilites||
This paper has three goals.
|Nichols, Paul D., Williams, Natasha||03-01-2008|
|Maintenance of Vertical Scales||
Vertical scaling refers to the process of placing scores of tests that measure similar domains but at different educational levels onto a common scale, a vertical scale.
|Kolen, Michael J., Ye, Tong||03-01-2008|
|Usability and Design Considerations for Computer-based Learning and Assessment||
The overall success of computer-based products and systems is dependent to a significant extent on their usability and usefulness in the intended context.
|Adams, Jeremy, Harms, Michael||03-01-2008|
|Score Comparability of Online and Paper Administrations of the Texas Assessment of Knowledge and Skills||
The comparability studies presented in this paper illustrate how responsible and psychometrically defensible comparability analyses can be incorporated within the constraints of a high-stakes, operational testing program like TAKS.
|Fitzpatrick, Steven, Laughlin Davis, Laurie , Way, Walter D.||04-01-2006|
|A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing||
Testlet response theory (TRT) is a measurement model that can capture local dependency in testlet-based tests.
|Chen, Tzu-An Ann, Dodd, Barbara G., Ho, Tsung-Han, Keng, Leslie|
|Response Probability Criterion and Subgroup Performance||
In the standard setting literature, there has been much debate about the most appropriate response probability (RP) to use in an item mapping procedure such as the Bookmark Standard Setting Procedure.
|Egan, Karla, Mueller, Canda D., Schneider, M. Christina|
|Exploring the Use of Item Bank Information to Improve IRT Item Parameter Estimation||
On occasion, the sample of students available for calibrating a set of assessment items may not be optimal.
|Ansley, Timothy, Hall, Erika|
|A Generalization of Stratified a that Allows for Correlated Measurement Errors between Subtests||
This paper presents a generalization of Stratified a that allows for correlated measurement errors between some subtest scores that make up a composite score.
|Keng, Leslie , Miller, G. Edward, O'Malley, Kimberly, Turhan, Ahmet|