Test Development and Content
LanguageScreen was developed to provide education professionals a quick and accurate way of assessing children's language skills, with a particular emphasis on identifying children who would likely benefit from language support. Initial selection of items was guided by linguistic and psycholinguistic factors. Subsequently, based on extensive pilot data, items were retained or replaced to ensure good coverage of the range of ability targeted by the test. Pictures were selected as being culturally appropriate for the British context. It is acknowledged that adaptations to the test may be required for use in different cultures.
Expressive Vocabulary (EV). The starting point for this test was a graded set of 20 items for naming (from
Snowling et al., 1988), supplemented by items chosen from “age of acquisition” tables (
Ellis & Morrison, 1998). Pictures of the items that were considered unambiguous were arranged in order of difficulty for piloting. When implemented in the app, the child sees a series of stylized colored pictures and is asked to name each one. The assessor presses a button on the screen to indicate whether the response is correct or incorrect. The test contains 24 items ranging in age of acquisition from 22.1 to 140 months (
Morrison & Ellis, 2000): bed, castle, ladder, umbrella, bell, glove, sword, drawer, scarecrow, whale, volcano, fence, wheelbarrow, acorn, plug, anchor, stool, handcuffs, parachute, eyelash, envelope, needle, stethoscope, and pliers. Testing discontinues after eight consecutive errors.
Receptive Vocabulary (RV). The choice of target items for the receptive vocabulary test followed the same process as for the expressive language test. Following the work of
Snowling et al. (1988), each target was paired with a similar-sounding (phonological) distractor, a meaning-related (semantic) distractor, and an unrelated distractor (see
Table 1). The selection of distractors was based on confusability with reference to phonology and semantics; distractors were not closely matched for frequency of occurrence with the targets. When implemented in the app, the child hears a word and is asked to touch one of the four stylized colored pictures that corresponds to the word presented. There are 23 items ranging in age of acquisition from 22.1 to 140 months (
Morrison & Ellis, 2000). Testing discontinues after eight consecutive errors.
Sentence Repetition (SR). For this test, the child hears a spoken sentence and is asked to repeat it verbatim. Twenty-two items were piloted, being chosen to reflect a range of sentence structures from an experimental sentence repetition test; these were arranged in order of difficulty according to data from 260 children assessed at the ages of 6 and 8 years participating in the Wellcome Language and Reading Project (
Snowling et al., 2019). Accuracy was scored following each item (correct/incorrect), and a single error made by the child rendered that item incorrect. Following item analyses, 14 items were chosen for use in the app (see
Table 2). Testing discontinues after five consecutive errors.
Listening Comprehension (LC). The listening comprehension test is an adapted version of one used in an evaluation of the Nuffield Early Language Intervention program (
Fricke et al., 2013). The child hears three short stories (without pictorial support), and immediately after hearing each story, they answer questions posed about the content of the story. There are 16 questions that include both literal (factual) and inferential questions. The examiner is presented with acceptable responses for each question on the screen to facilitate scoring. Each question is scored as correct/incorrect (1/0) by tapping buttons on the screen. Testing continues provided the child answers at least one question correctly on the first two passages (the test is discontinued if they answer all questions on the first two passages incorrectly).
Administration
The app and website, to which data are uploaded, are designed to be highly secure. To use the app for assessments, the user first creates an account and enters the details of the children to be assessed (name, gender, date of birth). Once an account is created and details of the children are uploaded, the user can download a set of QR codes for the children to be assessed. The user then downloads the app to an Android or Apple tablet or phone. To begin an assessment, the examiner scans the QR code to identify the child, and the assessment begins. The instructions make clear that children's responses should be scored for accuracy discounting dialectical variation. An assessment can be paused if necessary and then restarted at the point of pausing by rescanning the child's QR code. The app stores no personally identifiable information about the child being tested.
The four subscales take roughly 10 min to administer, and data are automatically uploaded to a secure server where the child's test data are linked to their personal details (including the child's name, date of birth, and date of testing). A report of the scores for each child can then be downloaded from the user's account. The report provides lists of the children who have been assessed, ranked by overall language standard scores for each year group, along with instructions on how to interpret scores.
Participants
The LanguageScreen app was supplied to approximately 10,000 schools as part of a COVID-19 catch-up scheme in English preschools and primary schools. Screening was carried out by teachers and their assistants in these schools, who did not receive training to use the app. Schools were asked to test all children in each classroom that was screened. Data were available from 8,273 schools containing 348,944 children for the present analyses, indicating that schools tested approximately 42 children on average (where a typical class size is in the region of 25, but many schools had only one class per year group). All pupils up to the age of 9 years were eligible for testing. In practice, most children assessed were in the first year of formal schooling (referred to as reception in England, with pupils entering reception at the age of 4.5 years). Of the sample, 168,931 were identified as female and 178,907 were identified as male (a total of 1,106 were identified as either “unknown” or “other” in terms of gender).
Statistical Model and Analysis Plan
The study was designed to be analyzed using the Rasch model to determine item characteristics and was conceived within a theoretical framework that views language as a unitary (latent) trait (e.g.,
Tomblin & Zhang, 2006). LanguageScreen consists of 77 dichotomously scored items from the four subscales. The items in each subscale are presented in order of difficulty (easiest to hardest) as determined by earlier pilot phases where the items were administered to smaller samples and evaluated using the Rasch model for fit and refined or omitted where appropriate. Below, we present data concerning the psychometric properties of LanguageScreen based on a very large standardization sample.
Cronbach's alpha (α) coefficient and McDonald's omega hierarchical (ω
h) coefficient were used to evaluate total score reliability, and the latter was calculated according to the procedure suggested by
Flora (2020). First, a confirmatory bifactor analysis was applied where all items loaded on a general factor as well as a specific factor for their respective subscale. This confirmatory model showed excellent fit to the test data (Comparative Fit Index = .95, Tucker–Lewis index = .94, root-mean-square error of approximation [RMSEA] = .02) and was used to calculate the omega hierarchical coefficient according to
Green and Yang's (2009) formulation. Unlike the alpha coefficient, the omega hierarchical coefficient provides a reliability estimate for the variance accounted for by just the general factor and thus provides evidence of the degree of unidimensionality across the items (
Revelle & Zinbarg, 2009).
The item response data were analyzed using the Rasch model to evaluate the reliability and sufficiency of the total test score; the item difficulties and their fit to the model; and the invariance of the assessment across age, gender, and EAL status (
Andrich, 2005). The Rasch model was chosen because LanguageScreen was developed to provide a total score that gives a reliable measure of a unidimensional language construct. LanguageScreen was developed, piloted, and refined in accordance with Rasch measurement theory to establish the sufficiency, reliability, and validity of this total score (
Andrich, 2018). Reliability was evaluated in terms of the person separation reliability (PSR) statistic, which is analogous to the alpha coefficient, and is an estimate of the ratio of true variance to observed variance. Overall model fit was evaluated in terms of the RMSEA (cutoff value of .06) and standardized root-mean-square residual (SRMSR; cutoff value of .08) values, and item fit was evaluated using the infit mean-square residual statistic, with critical values of less than 0.8 and greater than 1.2, as well as by graphical inspection of the item characteristic curves.
The invariance of the assessment was evaluated in terms of DIF by age, gender, and EAL status using a logistic regression approach. For the latter two variables, only those who identified as male and female and those who identified as EAL or non-EAL were included in the analysis. This approach to estimating DIF has been broadly applied and was chosen here as it allows for both continuous (age in months) and categorical (gender, EAL status) predictors, and it enables the investigation of both uniform DIF, which indicates differences in the items' difficulty across groups, and nonuniform DIF, which indicates differences in the items' discrimination across groups (
Swaminathan & Rogers, 1990). This approach involves estimating three logistic regression models for each item: (a) a base model that only includes the ability estimate as a predictor, which, in this case, was the ability estimate from the Rasch analysis; (b) a model that includes both the ability estimate and the group factor as predictors, which is used to evaluate uniform DIF through comparison with the base model; and (c) a model that includes the ability estimate, the group factor, and their interaction as predictors, which is used to evaluate nonuniform DIF through comparison to the second model. Given the extremely large sample size, trivial differences in item difficulties between the groups will be statistically significant. Thus, the magnitude of each item's uniform and nonuniform DIF was evaluated in terms of differences in
Nagelkerke's (1991) pseudo-
R2 effect size measure across the models, and these pseudo-
R2 differences were further categorized using
Jodoin and Gierl's (2001) recommendations into three categories: A = negligible, B = moderate, and C = large.