Validity in evaluation: where is the argument-based approach heading?




psychometrics; validity; educational assessment; generative artificial intelligence


The evolution of the concept of validity is examined in the context of the integration of Generative Artificial Intelligence and ethical stances, and with it, informed decision-making. The methodology used includes the history of concepts as laid out by Koselleck, analyzing how the concept of validity is a fundamental concept. The method used is a literature review, analyzing historical and contemporary perspectives and arguments from influential authors such as Messick and Kane. This conceptual journey leads us to recognize that validity is not a monolithic entity, but a complex fabric of multiple theoretical and practical threads, ranging from the internal logic of evaluations to the repercussions of their application in society. Furthermore, validity is recognized as a complex construct that cannot be simplified to a single aspect or characteristic of a test or evaluation, differentiating between validity and validation. The five historical periods distinguished in the literature that reflect paradigmatic changes in the understanding of validity were: gestational, crystallization, fragmentation, reunification, deconstruction, culminating with the period of diffusion. The most relevant conclusion is that validity is not static but dynamic, evolving with context and application. It also emphasizes the need for continuous validation adapted to emerging challenges, such as Generative Artificial Intelligence (GenAI), with the goal of ensuring that evaluations are accurate and fair amid a growing trend on ideas of quantum computing.

Author Biographies

Karla Karina Ruiz Mendoza, Universidad Autónoma de Baja California - UABC, Mexico

Professor and Researcher at Universidad Autónoma de Baja California - UABC, Mexico. PhD (C) in Educational Sciences from the IIDE-UABC, Mexico.

Luis Horacio Pedroza Zúñiga, Universidad Autónoma de Baja California - UABC, Mexico

Professor and Researcher at Universidad Autónoma de Baja California UABC, Mexico

Alma Yadhira López García, Universidad Autónoma de Baja California - UABC, Mexico

Professor and Researcher at Universidad Autónoma de Baja California UABC, Mexico. Researcher at the The Learning Bar, Inc., Canada


Acree, J., Hoeve, K.B., Weir, J.B. (2016). Approaching the validation of accountability systems. Unpublished paper and presentation. ERM 600: Validity and Validation, University of North Carolina at Greensboro.

Aloisi, C. (2023). The future of standardised assessment: Validity and trust in algorithms for assessment and scoring. European Journal of Education, 58, 98–110.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2018). Estándares para pruebas educativas y psicológicas (M. Lieve, Trans.). American Educational Research Association.

Bachman, L. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 2, 1–34.

Bachman, L., & Palmer, A. (2010). Language assessment in practice. Oxford University Press.

Bardin, L. (2011). Análisis de contenido (3ª ed.). Ediciones Akal.

Borsboom, D. (2009). Educational Measurement (4th ed.). Structural Equation Modeling-a Multidisciplinary Journal, 16 (4), 702-711., L. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Borsboom, D., Cramer, A., Kievit, R., Scholten, A., & Franic, S. (2009). The end of construct validity. In The concept of validity (pp. 135-170).

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071.

Brennan, R. (2001a). An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement, 36, 295–317.

Briggs, D. C. (2004). Comment: Making an argument for design validity before interpretive validity. Measurement: Interdisciplinary Research and Perspectives, 2(3), 171–191.

Carrillo, B.; Sánchez, M., & Leenen, I. (2020). El concepto moderno de validez y su uso en educación médica. Investigación en Educación Médica, 98-106.

Chapelle, C. (2012). Validity argument for language assessment: The framework is simple…. Language Testing, 29(1), 19-27.

Chapelle, C. (2021). Argument-Based Validation in Testing and Assessment. SAGE.

Chapelle, C., & Sauro, S. (2017). Introduction to the Handbook of Technology and Second Language Teaching and Learning. The Handbook of Technology and Second Language Teaching and Learning, 1–9.

Chapelle, C., Enright, M., & Jamieson, J. (2008). Building a Validity Argument for the Test of English as a Foreign Language. Routledge.

Chapelle, C., Enright, M., & Jamieson, J. (2010). Does an Argument-Based Approach to validity make a difference? Educational Measurement: Issues and Practice, 29(1), 3-13.

Chomsky, N., Roberts I., & Watumull, J. (8 de marzo de 2023). Noam Chomsky: The False Promise of ChatGPT. The New York Times.

Cizek, G. J., Kosh, A. E., & Toutkoushian, E. K. (2018). Gathering and Evaluating Validity Evidence: The Generalized Assessment Alignment Tool. Journal of Educational Measurement, 55(4), 477–512.

Cohen, L., Manion, L. & Morrison, K. (2007). Research methods in education. Routledge.

Cook, D., Brydges, R., Ginsburg, S., & Hatala, R. (2015). A contemporary approach to validity arguments: a practical guide to Kane’s framework. Medical Education, 49, 560-575. doi: 10.1111/medu.12678

Cronbach, L. J. (1971). Test Validation. In R. Thorndike (Ed.), Educational Measurement (2nd ed., p. 443). American Council on Education.

Cronbach, L. J. (1982). Designing evaluations of educational and social programs. Jossey-Bass.

Cronbach, L. J. (1989). Construct validation after thirty years. In R. E. Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp. 147–171). Urbana: University of Illinois Press.

Cronbach, L.J., & Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302

Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (pp. 621– 694). Washington, DC: American Council on Education.

De Jong Gierveld, J. (1987). Developing and testing a model of loneliness. Journal of Personality and Social Psychology, 53(1), 119-128.

De Jong Gierveld, J., & van Tilburg, T. G. (1992). Triangulatie in operationalization method. In G. J. N. Bruinsma & M. A. Zwanenburg (Eds.), Methodologie voor Bestuurskundigen: Stromingen en Methoden (pp. 273-298).

De Jong Gierveld, J., & van Tilburg, T. G. (2011). Manual of the Loneliness Scale 1999. Vrije Universiteit, Department of Social Research Methodology.

Delgado-Rico, E.; Carretero-Dios, H., & Ruch, W. (2012). Content validity evidences in test development: An applied perspective. International Journal of Clinical and Health Psychology, 12(3), 449-459.

Derrida, J. (1997). Una filosofía deconstructiva. Zona erógena, 35.

Diana Arya, Anthony Clairmont, Daniel Katz & Andrew Maul (2020). Measuring Reading Strategy Use. Educational Assessment, 25:1, 5-30.

Embretson, S. (2007). Construct Validity: A Universal Validity System or Just Another Test Evaluation Procedure? Educational Researcher, 36, 449-455.

Embretson, S. (2016). An Integrative Framework for Construct Validity.

Embretson, S., & Gorin, J. (2001). Improving Construct Validity With Cognitive Psychology Principles. Journal of Educational Measurement, 38(4), 343–368.

Evelyn S. Johnson, Angela Crawford, Laura A. Moylan & Yuzhu Zheng (2020). Validity of a Special Education Teacher Observation System, Educational Assessment, 25:1, 31-46, DOI: 10.1080/10627197.2019.1702461

Fabrigar, L. R., Wegener, D. T., & Petty, R. E. (2020). A Validity-Based Framework for Understanding Replication in Psychology. Personality and Social Psychology Review, doi:10.1177/1088868320931366

Fan, J. (2014). Chinese test takers’ attitudes towards the Versant English Test: a mixed-methods approach. Language Testing in Asia, 4(1). doi:10.1186/s40468-014-0006-9

Ferrara, S. (2007). Our field needs a framework to guide development of validity research agendas and identification of validity research questions and threats to validity. Measurement: Interdisciplinary Research and Perspectives, 5(2–3), 156–164.

Gafni, N. (2016). Comments on implementing validity theory. Assessment in Education: Principles, Policy & Practice.

Gallent-Torres, C., Zapata-González, A., & Ortego-Hernando, J.L. (2023). El impacto de la inteligencia artificial generativa en educación superior: una mirada desde la ética y la integridad académica. RELIEVE, 29(2), art. M5.

García-Medina, A.; Martínez-Rizo, F.; Cordero-Arroyo, G., & Caso-Niebla, J. (2017). Evolución del concepto de validez en la medición educativa.

Garfield, E. (1979). Citation Indexing—Its Theory and Application in Science, Technology, and Humanities. Wiley.

Haertel, E. (2013). How is testing supposed to improve schooling? Measurement: Interdisciplinary Research and Perspectives, 11(1-2), 1-18.

Hoeve, K.B. A validity framework for accountability: educational measurement and language testing. Lang Test Asia 12, 3 (2022).

Hornberger, M., Bewersdorff, A., & Nerdel, C. (2023). What do university students know about Artificial Intelligence? Development and validation of an AI literacy test. Computers and Education: Artificial Intelligence, 5, 100165.

Jawhar, S., Al, M., Alhawsawi, S. & Alkushi, A. (2021). Validating English Language Entrance Test at a Saudi University for Health Sciences. Arab World English Journal (AWEJ), 12(2), 49-71. DOI:

Jong-Gierveld, J. (1987). Developing and testing a model of loneliness. Journal of Personality and Social Psychology, 53(1), 119–128.

Kane, M. (2002). Inferences about Variance Components and Reliability-Generalizability Coefficients in the Absence of Random Sampling. Journal of Educational Measurement, 39 (2), 165-181.

Kane, M. (2006a). Content-Related Validity Evidence in Test Development. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 131–153). Lawrence Erlbaum Associates Publishers.

Kane, M. (2006b). Current Concerns in Validity Theory. Journal of Educational Measurement, 38(4), 319-342.

Kane, M. (2011). Validating score interpretations and uses. Language Testing, 29(1), 3–17. doi:10.1177/0265532211417210

Kane, M. (2013a). Validating the interpretations and Uses of Test Scores. Journal of Educational Measurement, 50(1), 1-73.

Kane, M. (2013b) The Argument-Based Approach to Validation, School Psychology Review, 42:4, 448-457.

Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38(4), 319-342.

Kerlinger, F. y Lee, H. (2001). Investigación del comportamiento: métodos de investigación en ciencias sociales. McGraw Hill.

Koretz, D. (2008). Measuring up. What educational testing really tells us. Harvard University Press.

Koselleck, R. (2000). Los estratos del tiempo: estudios sobre la historia. Paidós Ibérica.

LaFlair, G. T., & Staples, S. (2017). Using corpus linguistics to examine the extrapolation inference in the validity argument for a high-stakes speaking assessment. Language Testing, 34(4), 451–475. doi:10.1177/0265532217713951

Lavery, M., Bostic, J., Kruse, L., Krupa, E., & Carney, M. (2020). Argumentation Surrounding Argument‐Based Validation: A Systematic Review of Validation Methodology in Peer‐Reviewed Articles. Educational Measurement: Issues and Practice. doi:10.1111/emip.12378

Lindquist, E. F. (Ed.). (1951). Educational measurement. American Council on Education.

Lingard L. Writing with ChatGPT: An Illustration of its Capacity, Limitations & Implications for Academic Writers. Perspectives on Medical Education, 12(1): 261–270. DOI:

Lissitz, R. (2009). The concept of validity: Revisions, new directions, and applications. Information Age Publishing.

Markus, K. & Borsboom, D. (2013). Frontiers of Test Validity Theory. Measure, Causation and Meaning. Routledge.

Messick S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher (18), 2, 5-11.

Mislevy, R., Steinberg, L., & Almond, R. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3–62.

Nasution, N. E. A. (2023). Using artificial intelligence to create biology multiple choice questions for higher education. Agricultural and Environmental Education, 2(1), em002.

Newton, P., & Shaw, S. (2014). Validity in educational & psychological assessment. SAGE.

Paul E. Newton & Jo-Anne Baird (2016) The great validity debate. Assessment in Education: Principles, Policy & Practice, 23:2, 173-177.

Pedrosa, I., Suárez-Álvarez, J., & García-Cueto, E. (2013). Evidencias sobre la validez de contenido: avances teóricos y métodos para su estimación. Acción Psicológica, 10(2), 3-18.

Pellegrino, J. W., DiBello, L. V., & Goldman, S. R. (2016). A framework for conceptualizing and evaluating the validity of instructionally relevant assessments. Educational Psychologist, 51(1), 59–81.

Santamaría, F. (2012). De la analítica al (neo) pragmatismo. El giro de la filosofía anglosajona. Revista Colombiana de Humanidades, 80, 105-143.

Schilling, S. G. (2004). Conceptualizing the Validity Argument: An Alternative Approach. Measurement: Interdisciplinary Research and Perspectives, 2(3), 178–182.

Schmidt, T. , & Strasser, T.(2022). Artificial Intelligence in Foreign Language Learning and Teaching Anglistik, Volume 33, Issue 1 (2022), 165 – 184. DOI:

Shepard, L. (2016) Evaluating test validity: reprise and progress. Assessment in Education: Principles, Policy & Practice, 23(2), 268-280.

Sijtsma, Klaas. (2009). Correcting Fallacies in Validity, Reliability, and Classification. International Journal of Testing, 9, 167-194.

Sireci, S. (2013). Agreeing on validity arguments. Journal of Educational Measurement, 50(1), 99–104.

Sireci, S. G. (2007). On Validity Theory and Test Validation. Educational Researcher, 36(8), 477–481.

Sireci, S. G. (2016). On the validity of useless tests. Assessment in Education: Principles, Policy and Practice, 23.

Sireci, S., & Faulkner-Bond, M. (2014). Validity evidence based on test content. Psicothema, 26(1), 100–107.

Sireci, Stephen & Doğan, Nuri. (2017). Interview with Stephen G. Sireci on Validity. Eğitimde ve Psikolojide Ölçme ve DEğerlendirme, 8, 158-168.

Thorndike, R. M. (1997). Measurement and evaluation in psychology and education (6th ed.). Merrill Publishing Co/Prentice-Hall.

Toulmin, S. (1958). The uses of argument. Cambridge University Press.

Watson, P. (2002). Introducción: la evolución de las leyes del pensamiento. En Historia intelectual del siglo XX, (pp.11-15). Crítica.

Zumbo, B. & Chan, E. (Ed.) (2014). Validity and Validation in Social, Behavioral, and Health Sciences. Springer Cham.




How to Cite

Ruiz Mendoza, K. K., Pedroza Zúñiga, L. H., & López García, A. Y. (2024). Validity in evaluation: where is the argument-based approach heading?. Sapienza: International Journal of Interdisciplinary Studies, 5(3), e24048.



Economic & Social Sciences - Original Articles