Enter your details and we will get back to you within 24 hours

Methods for Establishing Validity and Reliability of the SPEAK Assessment


The SPEAK Assessment is a computer based test of spoken English. The purpose of the test is to assess an applicant’s English language competence for the context of professional and academic environments where English is necessary for successful communication. This paper explains the methods for establishing the validity claims that the SPEAK Assessment is suitable for that purpose, and the methods that ensure reliability of the results.

Structure of the Test:

The structure of the test is composed of adaptive multiple choice questions, short-answer comprehension questions and open ended questions requiring extended answers. Students are rated on their vocabulary range, phonological control, fluency, cohesion, grammatical accuracy and comprehension.

The Construct Validity of the Speak Assessment

Validity of Levels

Construct validity is "the degree to which a test measures what it claims, or purports, to be measuring" (Brown, 1996, p. 231). In this case, the claim is that the test measures the level of competence of the applicant for the purposes of successful oral communication in a professional or academic context. Part of the validity of the SPEAK Assessment is obtained through alignment with the CEFR (The Council of Europe Framework of Reference for Languages). The CEFR is an international standard that provides an established range of levels of language proficiency necessary for various purposes. For example: B2 is widely seen as “the first level of proficiency of interest for office work”. C1 is described as “effective operational proficiency”, and C2 as “practical mastery of the language as a non-native speaker” (North, 2007). As these levels have been tested and validated in a wide range of contexts and countries, they provide a robust measure of validity for the SPEAK Assessment.

The use of CEFR descriptors helps a prospective employer or educator determine whether a given applicant has the appropriate level of English to communicate effectively in the given setting. Studies have been conducted validating the use of the B2 level as necessary for academic success, for example. Clearly, different jobs and tasks require different levels of proficiency. Accordingly, SPEAK does not establish a “passing score” but rather provides a description of the applicant’s proficiency so that the determination of sufficiency can be made by the prospective employer or educator.

Validity of Test Items

Test items (i.e. questions and tasks) are written by native English writers with experience in both instruction and assessment of language learners. Item writers undergo training familiarizing them with the CEFR levels. Items are written to test skills at different CEFR levels. Items go through a process of peer-review for level, wording, and difficulty. Every item is field-tested with language learners of various levels of proficiency.

All test items are subjected to Rasch analysis, which measures the relative levels of the questions and ensures that applicants at lower levels of proficiency receive questions that would not be challenging for those at higher levels and applicants at higher levels receive questions that elicit the full range of their abilities. The use of Rasch analysis provides further validation of our levels (Karlin, 2018).

Validity of the Assessment Format:

The cornerstone of the SPEAK Assessment is its open-response questions. These questions range in complexity and are assigned based on an adaptive measure of grammatical, lexical and pragmatic proficiency. The adaptive questions generate an estimate of the applicant’s overall receptive language level as beginner, independent, or proficient. Based on this estimate, students are assigned questions designed to elicit their highest productive abilities, both lexical and grammatical. The open-ended questions increase in level of abstraction. Questions at higher levels require more linguistic complexity and require the ability to take a stance on an issue. Lower level questions are connected to day-to-day and familiar topics and require less grammatical and lexical sophistication.

The choice to use open-ended questions allows the assessment to approximate authentic language use tasks, while still allowing a measure of control that allows for automated assessment.

Validity of Measures Tested

The SPEAK assessment focuses on the accuracy, range, and intelligibility of spoken English. These measures cover the domains of linguistic competence and some aspects of pragmatic competence. All of these measures have been shown to significantly influence raters’ (both professional and lay) judgments of language proficiency.


Vocabulary range has been established as a necessary skill for success in academic and professional contexts. Research has established a threshold of a productive vocabulary of approximately 3000 words for success at the university level (Nation, 1993, AB Manan et al, 2017).


Features related to fluency, such as location and duration of pauses, have also been shown to contribute to judgments of a speaker’s proficiency, as do grammar and vocabulary, to a lesser extent (Saito, et al, 2016).


Pronunciation is frequently cited as a source of negative judgments of non-native speakers of English in both the scholarly and popular literature with regard to international teaching assistants (e.g. Isaacs, 2008) and business professionals (Executive Education, 2013. While it is necessary to acknowledge that many of these judgments may reflect biases or judgments of non-linguistic proficiencies, presenting the pronunciation scores together with the rest of the applicant’s language proficiencies allows a determination to be made of the speaker’s intelligibility, to what extent the accent may or may not impede communication, and to what degree the pronunciation reflects a more general linguistic proficiency. In this way, the SPEAK Assessment can be used in conjunction with training of HR personnel to mitigate bias against high-level non-native speakers of English.


According to most linguists and language teachers today, “the primary goal of language learning today is to foster communicative competence, or the ability to communicate effectively and spontaneously in real-life settings” (Purpura, 2004). The SPEAK assessment distinguishes between judgments of grammatical accuracy and the ability to produce grammatically accurate speech. The first measure, in conjunction with questions about vocabulary and communicative knowledge, is used to gain a general proficiency level (beginner, intermediate, or advanced). The ability to use grammar effectively in speech is judged through assessment of the open-ended questions.

Concurrent Validity of the Speak Assessment

The Speak Assessment uses similar rubrics to other tests and measures similar parameters. Accordingly, the results generated are comparable to those of other tests of spoken English.

Reliability of the SPEAK Assessment

Rating the Exams:

Exams are rated by the Speak in-house algorithm, which is normed to the ratings of trained raters. In order to ensure high quality data for training the algorithms, human raters undergo a training process in the application of the Speak rubrics. The rubrics of Speak for rating are closely aligned with those of the CEFR and exams that rate based on the CEFR. Raters were trained in consultation with an experienced trainer for raters of similar exams. Rater training includes familiarization with the SPEAK rubrics, practice identifying the features of the rubrics applicable to leveled audio files, and blind rating of a set of audio files. The process of rating continues until the rater can consistently reliably rate a set of audio files.

Benchmark exams at each level were prepared for training through a blind rating process in which multiple expert raters, including external consultants, rated sets of exams. The scores were compared and benchmark scores assigned based on median or mode as was more applicable. The exams were then annotated and became training sets, used to train new raters. Raters are trained with these sets until they are able to obtain consistency of over 85%.

Raters are used on an ongoing basis to ensure the accuracy of the algorithms. At least 10% of all exams, whether rated by a human rater or by machine grading are subjected to a second rating by a different human rater. If there are discrepancies between the first two ratings, a third human rater is employed. Ratings are continually monitored to ensure consistency of assignment of CEFR levels. When raters’ consistency drops below 80% they retrain.

Human raters also review any exams that have been flagged as problematic for any of a variety of reasons (questionable audio, suspicious behavior, etc). This process ensures that results provided to customers and for the refinement of the Speak algorithms are consistent and high quality.

Our inter-rater reliability statistics are as shown in table 1.

Table 1:

Measure Spearman’s ρ
CEFR Total score 0.851843
Fluency 0.939273
Pronunciation 0.925404
Vocabulary 0.933235
Grammar 0.934466
Mean Score 0.969923


Works Cited

ab Manan, Nor & Azizan, Noraziah & Fatima Wahida Mohs Nasir, Nur. (2017). Receptive and Productive Vocabulary Level of Diploma Students from a Public University in Malaysia. Journal of Applied Environmental and Biological Sciences. 7. 53-59.

Carlsen, C. H. (2018). The Adequacy of the B2 Level as University Entrance Requirement. Language Assessment Quarterly, 15(1), 75–89. doi: 10.1080/15434303.2017.1405962

Executive Education. (2013, December 7). The Glass Ceiling Facing Nonnative English Speakers -- K@W. Retrieved from https://knowledge.wharton.upenn.edu/article/glass-ceiling-facing-nonnative-english-speakers/

Isaacs, T. (2008). Towards Defining a Valid Assessment Criterion of Pronunciation Proficiency in Non-Native English-Speaking Graduate Students. Canadian Modern Language Review, 64(4), 555–580. doi: 10.3138/cmlr.64.4.555

Jaroszek, M. (2011). The Development of Conjunction Use in Advanced L2 Speech. Studies in Second Language Learning and Teaching, 1(4), 533–553. Retrieved from https://search-ebscohost-com.ezproxy.snhu.edu/login.aspx?direct=true&db=eric&AN=EJ1136573&site=eds-live&scope=site

Karlin, O., & Karlin, S. (2018). Making Better Tests with the Rasch Measurement Model. InSight: A Journal of Scholarly Teaching , 13, 76–100. Retrieved from https://files.eric.ed.gov/fulltext/EJ1184946.pdf

Nation, I.S.P. (1993) Vocabulary size, growth and use. In The Bilingual Lexicon. R. Schreuder and B. Weltens (eds.), Amsterdam/Philadelphia: John Benjamins: 115-134.

North, B. (2007, February 6). Common European Framework of Reference for Languages (CEFR). Retrieved from https://www.coe.int/en/web/common-european-framework-reference-languages/documents

Purpura, J. (2004). Differing notions of ‘grammar’ for assessment. In Assessing Grammar (Cambridge Language Assessment, pp. 1-23). Cambridge: Cambridge University Press. doi:10.1017/CBO9780511733086.002

Sato, Takanori. (2013). The Influential Features on Linguistic Laypersons' Evaluative Judgments of Second Language Oral Communication Ability. JLTA Journal. 16. 107-126. 10.20622/jltajournal.16.0_107.

Saito, K., Trofimovich, P., & Isaacs, T. (2016). Second language speech production: Investigating linguistic correlates of comprehensibility and accentedness for learners at different ability levels. Applied Psycholinguistics, 37(2), 217-240. doi:http://dx.doi.org.ezproxy.snhu.edu/10.1017/S0142716414000502

Read next:

Repeated Administrations of the Speak Assessment >>

External Validation Study >>