Analysis of biology midterm exam items using a comparison of the classical theory test and the Rasch model

Tanti Priyani; Bowo Sugiharto

doi:10.22219/jpbi.v10i3.34345

Authors

Tanti Priyani Biology Education Masters Program, Faculty of Teacher Training and Education, Universitas Sebelas Maret, Indonesia
Bowo Sugiharto Biology Education Masters Program, Faculty of Teacher Training and Education, Universitas Sebelas Maret, Indonesia

DOI:

https://doi.org/10.22219/jpbi.v10i3.34345

Keywords:

difficulty level, distractor effectiveness, item discrimination, reliability, validity

Abstract

In biology learning, test instruments are essential for assessing students' understanding of complex concepts. A test instrument is a crucial factor in learning evaluation; however, its implementation remains minimal. This descriptive quantitative study aims to analyze the quality of test items using the classical approach in terms of validity, reliability, difficulty index, discrimination power, distractor effectiveness, and the Rasch model analysis. The data consists of 30 multiple-choice questions from a biology midterm exam administered to 40 students. Classical test data analysis uses Microsoft Excel, while Rasch model analysis uses Winsteps software. The validity test results from both approaches show 14 valid questions and 16 invalid ones. The reliability scores are 0.619 (adequate) for the classical approach's Cronbach's Alpha, 0.85 (good) for the Rasch model, and 0.65 (weak) for personal reliability. The classical test theory and the Rasch model categorize item difficulty into four levels. The classical approach produces five categories for item discrimination, while the Rasch model identifies three groups based on the item separation index (H=3.45) and two groups based on respondent ability (H=1.96). Distractor effectiveness shows 93.3% functional distractors in the classical test and 80% in the Rasch model. The Rasch model offers greater precision in measuring student ability and detecting bias. Both models should be integrated for comprehensive item analysis. Future tests should focus on improving invalid items and the quality of distractors.

Downloads

Download data is not yet available.

References

Aaij, R., Abdelmotteleb, A. S. W., Beteta, C. A., Gallego, F. J. A., Ackernley, T., Adeva, B., Adinolfi, M., Afsharnia, H., Agapopoulou, C., Aidala, C. A., Aiola, S., Ajaltouni, Z., Akar, S., Albrecht, J., Alessio, F., Alexander, M., Albero, A. A., Aliouche, Z., Alkhazov, G., … Zunica, G. (2022). Study of the doubly charmed tetraquark Tcc+. Nature Communications, 13(1). https://doi.org/10.1038/s41467-022-30206-w

Andrich, D., Marais, I., & Humphry, S. M. (2016). Controlling Guessing Bias in the Dichotomous Rasch Model Applied to a Large-Scale, Vertically Scaled Testing Program. Educational and Psychological Measurement, 76(3), 412–435. https://doi.org/10.1177/0013164415594202

Angell, D. K., Lane-Getaz, S., Okonek, T., & Smith, S. (2024). Metacognitive Exam Preparation Assignments in an Introductory Biology Course Improve Exam Scores for Lower ACT Students Compared with Assignments that Focus on Terms. CBE Life Sciences Education, 23(1). https://doi.org/10.1187/cbe.22-10-0212

Aryadoust, V., Ng, L. Y., & Sayama, H. (2021). A comprehensive review of Rasch measurement in language assessment: Recommendations and guidelines for research. Language Testing, 38(1), 6–40. https://doi.org/10.1177/0265532220927487

Ayres, P., Lee, J. Y., Paas, F., & van Merriënboer, J. J. G. (2021). The Validity of Physiological Measures to Identify Differences in Intrinsic Cognitive Load. In Frontiers in Psychology (Vol. 12). Frontiers Media S.A. https://doi.org/10.3389/fpsyg.2021.702538

Babu, N., & Kohli, P. (2023). Commentary: Reliability in research. In Indian Journal of Ophthalmology (Vol. 71, Issue 2, pp. 400–401). Wolters Kluwer Medknow Publications. https://doi.org/10.4103/ijo.IJO_2016_22

Baghaei, P., Yanagida, T., & Heene, M. (2017). Development of a Descriptive Fit Statistic for the Rasch Model. In North American Journal of Psychology (Vol. 19, Issue 1).

Baldan, D., Negash, M., & Ouyang, J. Q. (2021). Are individuals consistent? Endocrine reaction norms under different ecological challenges. Journal of Experimental Biology, 224(12). https://doi.org/10.1242/jeb.240499

Barbic, D., Kim, B., Salehmohamed, Q., Kemplin, K., Carpenter, C. R., & Barbic, S. P. (2018). Diagnostic accuracy of the Ottawa 3DY and Short Blessed Test to detect cognitive dysfunction in geriatric patients presenting to the emergency department. BMJ Open, 8(3). https://doi.org/10.1136/bmjopen-2017-019652

Batista, S. A., Stedefeldt, E., Nakano, E. Y., De Oliveira Cortes, M., Assunção Botelho, R. B., Zandonadi, R. P., Raposo, A., Han, H., & Ginani, V. C. (2021). Design and development of an instrument on knowledge of food safety, practices, and risk perception addressed to children and adolescents from low-income families. Sustainability (Switzerland), 13(4), 1–20. https://doi.org/10.3390/su13042324

Bejerholm, U., & Lundgren-Nilsson, Å. (2015). Rasch Analysis of the Profiles of Occupational Engagement in people with Severe mental illness (POES) instrument. Health and Quality of Life Outcomes, 13(1). https://doi.org/10.1186/s12955-015-0327-0

Blacquiere, L. D., & Hoese, W. J. (2016). A valid assessment of students’ skill in determining relationships on evolutionary trees. Evolution: Education and Outreach, 9(1). https://doi.org/10.1186/s12052-016-0056-9

Blanco, I., Boemo, T., Martin-Garcia, O., Koster, E. H. W., De Raedt, R., & Sanchez-Lopez, A. (2023a). Online Contingent Attention Training (OCAT): transfer effects to cognitive biases, rumination, and anxiety symptoms from two proof-of-principle studies. Cognitive Research: Principles and Implications, 8(1). https://doi.org/10.1186/s41235-023-00480-3

Bradley, C., & Massof, R. W. (2017). Validating Translations of Rating Scale Questionnaires Using Rasch Analysis. In Ophthalmic Epidemiology (Vol. 24, Issue 1, pp. 1–2). Taylor and Francis Ltd. https://doi.org/10.1080/09286586.2016.1246667

Bramley, T. (2015). Rasch Measurement in the Social Sciences and Quality of Life Research. Europe’s Journal of Psychology, 11(1), 169–171. https://doi.org/10.5964/ejop.v11i1.913

Cecilio-Fernandes, D., Medema, H., Collares, C. F., Schuwirth, L., Cohen-Schotanus, J., & Tio, R. A. (2017). Comparison of formula and number-right scoring in undergraduate medical training: A Rasch model analysis. BMC Medical Education, 17(1). https://doi.org/10.1186/s12909-017-1051-8

Cheema, J. R. (2019). Cross-country gender DIF in PISA science literacy items. European Journal of Developmental Psychology, 16(2), 152–166. https://doi.org/10.1080/17405629.2017.1358607

Chen, P.-Y., Wu, W., Garnier-Villarreal, M., Kite, B. A., & Jia, F. (2019). Testing Measurement Invariance with Ordinal Missing Data: A Testing Measurement Invariance with Ordinal Missing Data: A Comparison of Estimators and Missing Data Techniques Comparison of Estimators and Missing Data Techniques. https://epublications.marquette.edu/nursing_fac

Cliff, W. H. (2023). Teaching with core concepts to facilitate the integrated learning of introductory organismal biology. Advances in Physiology Education, 47(3), 562–572. https://doi.org/10.1152/ADVAN.00134.2022

Darmana, A., Sutiani, A., Nasution, H. A., Ismanisa*, I., & Nurhaswinda, N. (2021). Analysis of Rasch Model for the Validation of Chemistry National Exam Instruments. Jurnal Pendidikan Sains Indonesia, 9(3), 329–345. https://doi.org/10.24815/jpsi.v9i3.19618

de Jong, L. H., Bok, H. G. J., Schellekens, L. H., Kremer, W. D. J., Jonker, F. H., & van der Vleuten, C. P. M. (2022). Shaping the right conditions in programmatic assessment: how quality of narrative information affects the quality of high-stakes decision-making. BMC Medical Education, 22(1). https://doi.org/10.1186/s12909-022-03257-2

De Sá, A. R., Liebel, G., De Andrade, A. G., Andrade, L. H., Gorenstein, C., & Wang, Y. P. (2019a). Can gender and age impact on response pattern of depressive symptoms among college students? A differential item functioning analysis. Frontiers in Psychiatry, 10(FEB). https://doi.org/10.3389/fpsyt.2019.00050

Echevarría-Guanilo, M. E., Gonçalves, N., & Juceli Romanoski, P. (2019). Psychometric properties of measurement instruments: Conceptual basis and evaluation methods- Part II. Texto e Contexto Enfermagem, 28. https://doi.org/10.1590/1980-265X-TCE-2017-0311

Eden, M. M. (2018). Shoulder-Specific Patient Reported Outcome Measures for Use in Patients with Head and Neck Cancer:An Assessment of Reliability, Construct Validity, and Overall Appropriateness of Test Score Interpretation Using Rasch Analysis. Doctoral dissertation. Nova Southeastern University. Retrieved from NSUWorks, College of Health Care Sciences-Physical Therapy Department (Issue 62). https://nsuworks.nova.edu/hpd_pt_stuetd

Elliott, M. L., Knodt, A. R., Ireland, D., Morris, M. L., Poulton, R., Ramrakha, S., Sison, M. L., Moffitt, T. E., Caspi, A., & Hariri, A. R. (2020). What Is the Test-Retest Reliability of Common Task-Functional MRI Measures? New Empirical Evidence and a Meta-Analysis. Psychological Science, 31(7), 792–806. https://doi.org/10.1177/0956797620916786

Finch, H., & Edwards, J. M. (2016). Rasch Model Parameter Estimation in the Presence of a Nonnormal Latent Trait Using a Nonparametric Bayesian Approach. Educational and Psychological Measurement, 76(4), 662–684. https://doi.org/10.1177/0013164415608418

Fischer, H. F., & Rose, M. (2016). Www.common-metrics.org: A web application to estimate scores from different patient-reported outcome measures on a common scale. BMC Medical Research Methodology, 16(1). https://doi.org/10.1186/s12874-016-0241-0

Fujimoto, Y., Chevance, M., Haydon, D. T., Krumholz, M. R., & Kruijssen, J. M. D. (2019). A fundamental test for stellar feedback recipes in galaxy simulations. Monthly Notices of the Royal Astronomical Society, 487(2), 1717–1728. https://doi.org/10.1093/mnras/stz641

Gaitán-Rossi, P., Vilar-Compte, M., Teruel, G., & Pérez-Escamilla, R. (2021). Food insecurity measurement and prevalence estimates during the COVID-19 pandemic in a repeated cross-sectional survey in Mexico. Public Health Nutrition, 24(3), 412–421. https://doi.org/10.1017/S1368980020004000

Gao, Y., Bing, L., Li, P., King, I., & Lyu, M. R. (2019). Generating Distractors for Reading Comprehension Questions from Real Examinations. www.aaai.org

Garrido, C. C., González, D. N., Seva, U. L., & Piera, P. J. F. (2019). Multidimensional or essentially unidimensional? A multi-faceted factoranalytic approach for assessing the dimensionality of tests and items. Psicothema, 31(4), 450–457. https://doi.org/10.7334/psicothema2019.153

Gay, C. L., Kottorp, A., Lerdal, A., & Lee, K. A. (2016). Psychometric limitations of the center for epidemiologic studies-depression scale for assessing depressive symptoms among adults with HIV/AIDS: A rasch analysis. Depression Research and Treatment, 2016. https://doi.org/10.1155/2016/2824595

Goel, A., & Gross, A. (2019). Differential item functioning in the cognitive screener used in the Longitudinal Aging Study in India. International Psychogeriatrics, 31(9), 1331–1341. https://doi.org/10.1017/S1041610218001746

Goh, H. E., Marais, I., & Ireland, M. J. (2017). A Rasch Model Analysis of the Mindful Attention Awareness Scale. Assessment, 24(3), 387–398. https://doi.org/10.1177/1073191115607043

Gray, N., Calleja, D., Wimbush, A., Miralles-Dolz, E., Gray, A., De Angelis, M., Derrer-Merk, E., Oparaji, B. U., Stepanov, V., Clearkin, L., & Ferson, S. (2020). Is “no test is better than a bad test”? Impact of diagnostic uncertainty in mass testing on the spread of COVID-19. PLoS ONE, 15(10 October). https://doi.org/10.1371/journal.pone.0240775

Hagquist, C. (2019). Explaining differential item functioning focusing on the crucial role of external information - an example from the measurement of adolescent mental health. BMC Medical Research Methodology, 19(1), 185. https://doi.org/10.1186/s12874-019-0828-3

Hope, D., Adamson, K., McManus, I. C., Chis, L., & Elder, A. (2018). Using differential item functioning to evaluate potential bias in a high stakes postgraduate knowledge based assessment. BMC Medical Education, 18(1). https://doi.org/10.1186/s12909-018-1143-0

Ibrahim, F. M., Shariff, A. A., & Tahir, R. M. (2015). Using Rasch model to analyze the ability of pre-university students in vector. AIP Conference Proceedings, 1682. https://doi.org/10.1063/1.4932472

Jacob, E. R., Duffield, C., & Jacob, A. M. (2019). Validation of data using RASCH analysis in a tool measuring changes in critical thinking in nursing students. Nurse Education Today, 76, 196–199. https://doi.org/10.1016/j.nedt.2019.02.012

Jacobs, N. W., Berduszek, R. J., Dijkstra, P. U., & van der Sluis, C. K. (2017). Validity and Reliability of the Upper Extremity Work Demands Scale. Journal of Occupational Rehabilitation, 27(4), 520–529. https://doi.org/10.1007/s10926-016-9683-9

Jimam, N. S., Ahmad, S., & Ismail, N. E. (2019). Psychometric Classical Theory Test and Item Response Theory Validation of Patients’ Knowledge, Attitudes and Practices of Uncomplicated Malaria Instrument. Journal of Young Pharmacists, 11(2), 186–191. https://doi.org/10.5530/jyp.2019.11.39

Jin, I. H., & Jeon, M. (2019). A Doubly Latent Space Joint Model for Local Item and Person Dependence in the Analysis of Item Response Data. Psychometrika, 84(1), 236–260. https://doi.org/10.1007/s11336-018-9630-0

Jones, I., Bisson, M., Gilmore, C., & Inglis, M. (2019). Measuring conceptual understanding in randomised controlled trials: Can comparative judgement help? British Educational Research Journal, 45(3), 662–680. https://doi.org/10.1002/berj.3519

Kassim, M. A. M., Pang, N. T. P., Kamu, A., Arslan, G., Mohamed, N. H., Zainudin, S. P., Ayu, F., & Ho, C. M. (2023). Psychometric Properties of the Coronavirus Stress Measure with Malaysian Young Adults: Association with Psychological Inflexibility and Psychological Distress. International Journal of Mental Health and Addiction, 21(2), 819–835. https://doi.org/10.1007/s11469-021-00622-y

Köhler, C., & Hartig, J. (2017). Practical Significance of Item Misfit in Educational Assessments. Applied Psychological Measurement, 41(5), 388–400. https://doi.org/10.1177/0146621617692978

Kok, K., & Priemer, B. (2023). Assessment tool to understand how students justify their decisions in data comparison problems. Physical Review Physics Education Research, 19(2). https://doi.org/10.1103/PhysRevPhysEducRes.19.020141

Korbee, S., VAN KEMPEN, R., VAN WENSEN, R., VAN DER STEEN, M., & Liu, W. Y. (2022). Measurement properties of the HOOS-PS in revision total hip arthroplasty: a validation study on validity, interpretability, and responsiveness in 136 revision hip arthroplasty patients. Acta Orthopaedica, 93, 742–749. https://doi.org/10.2340/17453674.2022.4572

Królikowska, A., Reichert, P., Karlsson, J., Mouton, C., Becker, R., & Prill, R. (2023). Improving the reliability of measurements in orthopaedics and sports medicine. Knee Surgery, Sports Traumatology, Arthroscopy, 31(12), 5277–5285. https://doi.org/10.1007/s00167-023-07635-1

Kumar Mohajan, H. (2017). TWO CRITERIA FOR GOOD MEASUREMENTS IN RESEARCH: VALIDITY AND RELIABILITY.

Le, C., Guttersrud, Ø., Sørensen, K., & Finbråten, H. S. (2022). Developing the HLS19-YP12 for measuring health literacy in young people: a latent trait analysis using Rasch modelling and confirmatory factor analysis. BMC Health Services Research, 22(1). https://doi.org/10.1186/s12913-022-08831-4

Lestari, E. K., & Yudhanegara, R. M. (2015). Penelitian Pendidikan Matematika. Refika Aditama.

Lewis, A. F., Myers, M., Heiser, J., Kolar, M., Baird, J. F., & Stewart, J. C. (2020). Test–retest reliability and minimal detectable change of corticospinal tract integrity in chronic stroke. Human Brain Mapping, 41(9), 2514–2526. https://doi.org/10.1002/hbm.24961

Li, J. J., Reise, S. P., Chronis-Tuscano, A., Mikami, A. Y., & Lee, S. S. (2016). Item Response Theory Analysis of ADHD Symptoms in Children With and Without ADHD. Assessment, 23(6), 655–671. https://doi.org/10.1177/1073191115591595

Liu, R., & Jiang, Z. (2020). A general diagnostic classification model for rating scales. Behavior Research Methods, 52(1), 422–439. https://doi.org/10.3758/s13428-019-01239-9

Liu, Y., Yin, H., Xin, T., Shao, L., & Yuan, L. (2019). A comparison of differential item functioning detection methods in cognitive diagnostic models. Frontiers in Psychology, 10(MAY). https://doi.org/10.3389/fpsyg.2019.01137

Liuzza, M. T., Spagnuolo, R., Antonucci, G., Grembiale, R. D., Cosco, C., Iaquinta, F. S., Funari, V., Dastoli, S., Nistico, S., & Doldo, P. (2021). Psychometric evaluation of an Italian custom 4-item short form of the PROMIS anxiety item bank in immune-mediated inflammatory diseases: An item response theory analysis. PeerJ, 9. https://doi.org/10.7717/peerj.12100

Mazurek, M. O., Carlson, C., Baker-Ericzén, M., Butter, E., Norris, M., & Kanne, S. (2020). Construct Validity of the Autism Impact Measure (AIM). Journal of Autism and Developmental Disorders, 50(7), 2307–2319. https://doi.org/10.1007/s10803-018-3462-8

McKeigue, P. (2019). Quantifying performance of a diagnostic test as the expected information for discrimination: Relation to the C-statistic. Statistical Methods in Medical Research, 28(6), 1841–1851. https://doi.org/10.1177/0962280218776989

Milania, A. A., & Murniati, W. (2022). Teacher’s Pedagogic Competence In Evaluating Learning. KINDERGARTEN: Journal of Islamic Early Childhood Education, 5(2), 245. https://doi.org/10.24014/kjiece.v5i2.20013

Morgan-López, A. A., Saavedra, L. M., Hien, D. A., Killeen, T. K., Back, S. E., Ruglass, L. M., Fitzpatrick, S., López-Castro, T., & Patock-Peckham, J. A. (2020). Estimation of equable scale scores and treatment outcomes from patient-and clinician-reported PTSD measures using item response theory calibration. Psychological Assessment, 32(4), 321–335. https://doi.org/10.1037/pas0000789

Murphy, M., McCloughen, A., & Curtis, K. (2019). Using theories of behaviour change to transition multidisciplinary trauma team training from the training environment to clinical practice. Implementation Science, 14(1). https://doi.org/10.1186/s13012-019-0890-6

Musa, A., Shaheen, S., Elmardi, A., & Ahmed, A. (2021). Item difficulty & item discrimination as quality indicators of physiology MCQ examinations at the Faculty of Medicine Khartoum University. Khartoum Medical Journal, 11(2). https://doi.org/10.53332/kmj.v11i2.610

Nielsen, J. B., Kyvsgaard, J. N., Sildorf, S. M., Kreiner, S., & Svensson, J. (2017). Item analysis using Rasch models confirms that the Danish versions of the DISABKIDS® chronic-generic and diabetes-specific modules are valid and reliable. Health and Quality of Life Outcomes, 15(1). https://doi.org/10.1186/s12955-017-0618-8

O’Brien, K. K., Dzingina, M., Harding, R., Gao, W., Namisango, E., Avery, L., & Davis, A. M. (2021). Developing a short-form version of the HIV Disability Questionnaire (SF-HDQ) for use in clinical practice: a Rasch analysis. Health and Quality of Life Outcomes, 19(1). https://doi.org/10.1186/s12955-020-01643-2

Orozco, T., Segal, E., Hinkamp, C., Olaoye, O., Shell, P., & Shukla, A. M. (2022). Development and validation of an end stage kidney disease awareness survey: Item difficulty and discrimination indices. PLoS ONE, 17(9 September). https://doi.org/10.1371/journal.pone.0269488

Papenberg, M., & Musch, J. (2017). Of Small Beauties and Large Beasts: The Quality of Distractors on Multiple-Choice Tests Is More Important Than Their Quantity. Applied Measurement in Education, 30(4), 273–286. https://doi.org/10.1080/08957347.2017.1353987

Parra-Anguita, L., Sánchez-García, I., Del Pino-Casado, R., & Pancorbo-Hidalgo, P. L. (2019). Measuring knowledge of Alzheimer’s: Development and psychometric testing of the UJA Alzheimer’s Care Scale. BMC Geriatrics, 19(1). https://doi.org/10.1186/s12877-019-1086-2

Poorebrahim, A., Lin, C. Y., Imani, V., Kolvani, S. S., Alaviyoun, S. A., Ehsani, N., & Pakpour, A. H. (2021). Using Mindful Attention Awareness Scale on male prisoners: Confirmatory factor analysis and Rasch models. PLoS ONE, 16(7 July). https://doi.org/10.1371/journal.pone.0254333

Prenovost, K. M., Fihn, S. D., Maciejewski, M. L., Nelson, K., Vijan, S., & Rosland, A. M. (2018). Using item response theory with health system data to identify latent groups of patients with multiple health conditions. PLoS ONE, 13(11). https://doi.org/10.1371/journal.pone.0206915

Pretz, C. R., Kean, J., Heinemann, A. W., Kozlowski, A. J., Bode, R. K., & Gebhardt, E. (2016). A Multidimensional Rasch Analysis of the Functional Independence Measure Based on the National Institute on Disability, Independent Living, and Rehabilitation Research Traumatic Brain Injury Model Systems National Database. Journal of Neurotrauma, 33(14), 1358–1362. https://doi.org/10.1089/neu.2015.4138

Retnawati, H. (2014). Teori respons butir dan penerapannya: Untuk peneliti, praktisi pengukuran dan pengujian, mahasiswa pascasarjana. Nuha Medika.

Retnawati, H. (2016). Analisis Kuantitatif Instrumen Penelitian (Pertama). Parama Publishing. www.nuhamedika.gu.ma

Rezigalla, A. A., Eleragi, A. M. E. S. A., Elhussein, A. B., Alfaifi, J., ALGhamdi, M. A., Al Ameer, A. Y., Yahia, A. I. O., Mohammed, O. A., & Adam, M. I. E. (2024). Item analysis: the impact of distractor efficiency on the difficulty index and discrimination power of multiple-choice items. BMC Medical Education, 24(1). https://doi.org/10.1186/s12909-024-05433-y

Robinson, M., Johnson, A. M., Walton, D. M., & MacDermid, J. C. (2019). A comparison of the polytomous Rasch analysis output of RUMM2030 and R (ltm/eRm/TAM/lordif). BMC Medical Research Methodology, 19(1). https://doi.org/10.1186/s12874-019-0680-5

Rodrigo, M. F., Molina, J. G., Losilla, J. M., Vives, J., & Tomás, J. M. (2019). Method effects associated with negatively and positively worded items on the 12-item General Health Questionnaire (GHQ-12): Results from a cross-sectional survey with a representative sample of Catalonian workers. BMJ Open, 9(11). https://doi.org/10.1136/bmjopen-2019-031859

Rogowska, A. M., Ochnik, D., & Kuśnierz, C. (2022). Revisiting the multidimensional interaction model of stress, anxiety and coping during the COVID-19 pandemic: a longitudinal study. BMC Psychology, 10(1). https://doi.org/10.1186/s40359-022-00950-1

Ronk, F. R., Hooke, G. R., & Page, A. C. (2016). Validity of clinically significant change classifications yielded by Jacobson-Truax and Hageman-Arrindell methods. BMC Psychiatry, 16(1). https://doi.org/10.1186/s12888-016-0895-5

Runge, J. M., Lang, J. W. B., Chasiotis, A., & Hofer, J. (2019). Improving the Assessment of Implicit Motives Using IRT: Cultural Differences and Differential Item Functioning. Journal of Personality Assessment, 101(4), 414–424. https://doi.org/10.1080/00223891.2017.1418748

Saat, N. A. (2020). Sains Humanika Humanika Summative Test Items Analysis Using Classical Test Theory (CTT) Analisis Item Kertas Peperiksaan Sumatif Menggunakan Teori Ujian Klasik (TUK). www.sainshumanika.utm.my

Seamon, B. A., Kautz, S. A., & Velozo, C. A. (2019). Rasch Analysis of the Activities-Specific Balance Confidence Scale in Individuals Poststroke. Archives of Rehabilitation Research and Clinical Translation, 1(3–4). https://doi.org/10.1016/j.arrct.2019.100028

Seide, S. E., Röver, C., & Friede, T. (2019). Likelihood-based random-effects meta-analysis with few studies: Empirical and simulation studies. BMC Medical Research Methodology, 19(1). https://doi.org/10.1186/s12874-018-0618-3

Sen, S., Cohen, A. S., & Kim, S. H. (2016). The Impact of Non-Normality on Extraction of Spurious Latent Classes in Mixture IRT Models. Applied Psychological Measurement, 40(2), 98–113. https://doi.org/10.1177/0146621615605080

Stanley, L. M., & Edwards, M. C. (2016). Reliability and Model Fit. Educational and Psychological Measurement, 76(6), 976–985. https://doi.org/10.1177/0013164416638900

Steiner, M. D., & Frey, R. (2021). Representative Design in Psychological Assessment: A Case Study Using the Balloon Analogue Risk Task (BART). Journal of Experimental Psychology: General, 150(10), 2117–2136. https://doi.org/10.1037/xge0001036

Subali, B., Kumaidi, Aminah, N. S., & Sumintono, B. (2019). Student achievement based on the use of scientific method in the natural science subject in elementary school. Jurnal Pendidikan IPA Indonesia, 8(1), 39–51. https://doi.org/10.15294/jpii.v8i1.16010

Sumantri, S. M., & Retni Satriani. (2016). The Effect of Formative Testing and Self-Directed Learning on Mathematics Learning Outcomes. In International Electronic Journal of Elementary Education (Vol. 8, Issue 3). www.simdik.info/hasilun/index.aspx.

Sumintono, B. (2016). Seminar Nasional Pendidikan IPA Prosiding Seminar Nasional Pendidikan IPA “Mengembangkan Keterampilan Berpikir Tingkat Tinggi Melalui Pembelajaran IPA” Penerbit: S2 IPA UNLAM PRESS PENILAIAN KETERAMPILAN BERPIKIR TINGKAT TINGGI: APLIKASI PEMODELAN RASCH PADA ASESMEN PENDIDIKAN.

Sumintono, & Widhiarso, W. (2015). Aplikasi Permodelan Rasch Pada Assessment Pendidikan (B. Trim, Ed.; Cetakan I). Trim Komunikata. www.trimkomunikata.com

Teresi, J. A., Wang, C., Kleinman, M., Jones, R. N., & Weiss, D. J. (2021). Differential Item Functioning Analyses of the Patient-Reported Outcomes Measurement Information System (PROMIS®) Measures: Methods, Challenges, Advances, and Future Directions. Psychometrika, 86(3), 674–711. https://doi.org/10.1007/s11336-021-09775-0

Tesio, L., Caronni, A., Kumbhare, D., & Scarano, S. (2024). Interpreting results from Rasch analysis 1. The “most likely” measures coming from the model. Disability and Rehabilitation, 46(3), 591–603. https://doi.org/10.1080/09638288.2023.2169771

Trampush, J. W., Yang, M. L. Z., Yu, J., Knowles, E., Davies, G., Liewald, D. C., Starr, J. M., Djurovic, S., Melle, I., Sundet, K., Christoforou, A., Reinvang, I., Derosse, P., Lundervold, A. J., Steen, V. M., Espeseth, T., Räikkönen, K., Widen, E., Palotie, A., … Lencz, T. (2017). GWAS meta-analysis reveals novel loci and genetic correlates for general cognitive function: A report from the COGENT consortium. Molecular Psychiatry, 22(3), 336–345. https://doi.org/10.1038/mp.2016.244

Tzafilkou, K., Perifanou, M., & Economides, A. A. (2022). Development and validation of students’ digital competence scale (SDiCoS). International Journal of Educational Technology in Higher Education, 19(1). https://doi.org/10.1186/s41239-022-00330-0

Vaccarino, A. L., Black, S. E., Gilbert Evans, S., Frey, B. N., Javadi, M., Kennedy, S. H., Lam, B., Lam, R. W., Lasalandra, B., Martens, E., Masellis, M., Milev, R., Mitchell, S., Munoz, D. P., Sparks, A., Swartz, R. H., Tan, B., Uher, R., & Evans, K. R. (2023). Rasch analyses of the Quick Inventory of Depressive Symptomatology Self-Report in neurodegenerative and major depressive disorders. Frontiers in Psychiatry, 14. https://doi.org/10.3389/fpsyt.2023.1154519

Van Vliet, M., Doornenbal, B. M., Boerema, S., & Van Den Akker-Van Marle, E. M. (2021). Development and psychometric evaluation of a Positive Health measurement scale: A factor analysis study based on a Dutch population. BMJ Open, 11(2). https://doi.org/10.1136/bmjopen-2020-040816

Van Zile-Tamsen, C. (2017). Using Rasch Analysis to Inform Rating Scale Development. Research in Higher Education, 58(8), 922–933. https://doi.org/10.1007/s11162-017-9448-0

Wang, F., Liu, Q., Chen, E., Huang, Z., Chen, Y., Yin, Y., Huang, Z., & Wang, S. (2020). Neural Cognitive Diagnosis for Intelligent Education Systems. www.aaai.org

Wang, L., Wu, Y. X., Lin, Y. Q., Wang, L., Zeng, Z. N., Xie, X. L., Chen, Q. Y., & Wei, S. C. (2022). Reliability and validity of the Pittsburgh Sleep Quality Index among frontline COVID-19 health care workers using classical test theory and item response theory. Journal of Clinical Sleep Medicine, 18(2), 541–551. https://doi.org/10.5664/jcsm.9658

Wilberforce, M., Sköldunger, A., & Edvardsson, D. (2019). A Rasch analysis of the Person-Centred Climate Questionnaire - Staff version. BMC Health Services Research, 19(1). https://doi.org/10.1186/s12913-019-4803-9

Wu, X., Zhang, L. J., & Liu, Q. (2021). Using Assessment for Learning: Multi-Case Studies of Three Chinese University English as a Foreign Language (EFL) Teachers Engaging Students in Learning and Assessment. Frontiers in Psychology, 12. https://doi.org/10.3389/fpsyg.2021.725132

Wyse, A. E., & Mapuranga, R. (2009). Differential Item Functioning Analysis Using Rasch Item Information Functions. International Journal of Testing, 9(4), 333–357. https://doi.org/10.1080/15305050903352040

Ye, S., Sun, K., Huynh, D., Phi, H. Q., Ko, B., Huang, B., & Ghomi, R. H. (2022). A Computerized Cognitive Test Battery for Detection of Dementia and Mild Cognitive Impairment: Instrument Validation Study. JMIR Aging, 5(2). https://doi.org/10.2196/36825

Yow, W. Q., & Priyashri, S. (2019). Computerized Electronic Features Direct Children’s Attention to Print in Single-and Dual-Language e-Books. AERA Open, 5(3). https://doi.org/10.1177/2332858419878126

Zlatkin-Troitschanskaia, O., Pant, H. A., Toepper, M., Lautenbach, C., & Molerov, D. (2017). Valid Competency Assessment in Higher Education: Framework, Results, and Further Perspectives of the German Research Program KoKoHs. AERA Open, 3(1). https://doi.org/10.1177/2332858416686739