The Dyslexia Screening Test App provides users with 3 kinds of information:
- the likelihood that a particular test subject will experience dyslexia-related challenges at school or work
- information about a subject’s cognitive learning and processing features, and especially those that often accompany a dyslexic processing style, which will be relevant to providing that individual with appropriate learning and instruction
- individualized recommendations regarding instruction, accommodations, remediation, and/or need for additional assessment
The two most commonly used definitions of dyslexia in the English-speaking world, agree in two key respects:
- That dyslexia’s core diagnostic feature is difficulty developing fast and accurate reading and spelling skills (particularly at the level of sounding out and spelling words) which are unexpected in relation to an individual’s intelligence, age, and education.
- That challenges with the phonological component of language (essentially the way the brain identifies and processes sounds in words) are a key component of the brain-based differences underlying dyslexic reading and spelling challenges. (See Shaywitz, S. Overcoming Dyslexia for a good review of the phonological aspects of dyslexia.)
The Rose Review, which is used in the UK, also notes that challenges with verbal memory and processing speed have also been shown to play a role in dyslexic learning challenges. (See ref. 2)
Both definitions were based on a comprehensive review of the research on dyslexia available at the time of its formulation, and each reflected the broad consensus of understanding current among dyslexia researchers at the time of its formulation.
As additional research has been accumulated, several important additional points about dyslexia have been recognized:
- Dyslexia has increasingly been recognized as a syndrome that is multifactorial in origin, having many possible underlying cognitive contributors rather than just one (i.e., phonological impairments). These factors include not only verbal memory and processing speed (as per the Rose Review), but also word retrieval and naming speed, visual attention, and overall language ability. (See ref. 4)
- Consistent with the growing recognition of this multifactorial origin, it has been increasingly recognized that the overall likelihood that an individual will show dyslexic reading and spelling challenges rises with the number of risk factors that the individual possesses. As a result, dyslexia risk assessment requires a clear and well-defined way of integrating the results from a wide variety of assessments of different cognitive skills.
- It has also become increasingly clear that students with high verbal ability may experience highly significant challenges with reading and spelling (especially with speed or fluency of reading, stamina for reading, ability to decode unfamiliar words, spelling and writing) that are clearly dyslexic in nature, despite scoring at or above population means on tests of phonological awareness, reading comprehension, and/or other traditional dyslexia assessments. (Verbal IQ, as measured on the WISC, has a correlation of approximately 80% with Vocabulary, as measured both on the WISC and our app, so our app uses Vocabulary as a stand-in for Verbal IQ.) For such students, proper identification depends the recognition of discrepancies between their own personal areas of strength and weakness, rather than on comparisons of their weaknesses with population norms. For example, a very bright third grade student who tests into a gifted program with a verbal IQ of 135 (99th percentile), and who tests at the 80th percentile on an untimed reading comprehension test (i.e., reading short stories or informational paragraphs) but who also tests at the 50th percentile on tests of phonological awareness and single word decoding and spelling, will very likely also show academically significant difficulties keeping up class work in reading and writing, and will show difficulties spelling, making “silly mistakes” in reading test items, etc., that are dyslexic in nature and require special intervention, even though by most traditional measures they don’t qualify for a diagnosis of dyslexia. Similarly, among the age 5-6 pre-reading population, individual discrepancies between verbal ability and word and sub-word level reading skills can already be seen, and these often provide the only tip-off that the child is at high risk of dyslexia-related reading and spelling challenges. No other assessment currently in use identifies these children.
- In addition, our series of more than 100 consecutive dyslexic students examined in our own clinic has revealed patterns of relationships between scores on different cognitive measures that are highly distinctive and robust for dyslexic students, and highly predictive both of academic challenges and potentially successful interventions. We’ve appended a graph to this document showing how these students scored on WISC-IV IQ and WIAT-III achievement tests. This graph reveals the characteristic relationships between scores on the various subtests that dyslexic students show, and which are almost entirely preserved across the range of IQs. Note, for example, the consistent broad gap between relatively higher verbal and non-verbal comprehension scores on WISC IQ, and the relatively lower working memory and processing speed scores. Also note the across-the-board reductions in academic fluency skills in such areas as math calculation, reading rate, and essay production. These internal relationships are highly consistent at the individual as well as the population level, and form another signature for dyslexic processing that we have found to have value both in diagnosis, and in recommending interventions.
- Finally, we have found that an individual student’s variations from these characteristic patterns have predictive value in deciding upon what form of interventions they’ll be most likely to benefit from. For example, in the areas of reading and spelling, there are many different kinds of interventions available, some of which use muscle memory, others speech-related memory, others training of auditory processing functions, others visualization or verbal memory (mnemonics), etc. Based upon a student’s unique combination of results in the kinds of subtests featured in our app, we can guide instructors toward the use of interventions that will be especially likely to work successfully with an individual student (and away from less successful methods). The specificity with which we can deliver these results also means that our highly individualized recommendations will fit quite in instructional settings using UDL concepts and methods. [These patterns will be used to make recommendations, but will mostly not be part of the risk assessment algorithm]
In contrast with our assessment tool, most current dyslexia assessments (like the CTOPP, RAN/RAS), and early reading screeners that predict reading acquisition without using the term dyslexia (like DIBELS), do not take into account these more recent discoveries about dyslexia. As such, they often do a poor job both of identifying dyslexic students, and suggesting appropriate interventions.
For example, these tests attempt to measure various aspects of phonological processing (like the ability to segregate, manipulate, identify, retrieve, and remember the kinds of basic sound units that make up words) and/or word retrieval or naming speed, but they all fall short in several ways:
- Although each of these tests measures one or more cognitive skills known to predispose to dyslexic reading and spelling challenges, no single test assesses all or even most of these cognitive skills.
- Some of the skills that predispose to reading and spelling challenges are not included in any current common test (e.g., visual attention).
- Though all currently common tests provide normative information on where an examinee falls in relation to others in their age group on certain cognitive skills, all fail to provide strict criteria for using these results to make a concrete assessment of the likelihood of dyslexia. This is because they all fail to provide examiners with concrete guidance in how to integrate information from the various subtests, or how to use the data generated to calculate an overall likelihood assessment. One result is examiner variability, and uncertainty and inaccuracy in making diagnosis, as a result of which an estimated one in four dyslexic students are not being identified.
- All currently used tests fail to provide guidance regarding individual discrepancy and strength-weakness patterns as discussed above. As a consequence, all routinely fail to identify high verbal “stealth” dyslexic students, or to take advantage of the full range of cognitive skills that can be used to learn basic reading skills and/or build compensatory skills.
In addition, these tests typically require highly time and resource intensive one-one-one administration, and their formats are relatively unengaging for students.
Our screener provides specific results and recommendations in line with the most up to date research-based thinking on dyslexia.
 National Institute of Child Health (2002). Definition of dyslexia. Washington, DC: National Institute of Child Health and Human Development. The key part of the definition reads: ‘Dyslexia…. is characterized by difficulties with accurate and/or fluent word recognition, and by poor spelling and decoding abilities. These difficulties typically result from a deficit in the phonological component of language that is often unexpected in relation to other cognitive abilities and the provision of effective classroom instruction’.
 Rose, J. (2009). Identifying and teaching children and young people with dyslexia and literacy difficulties. Available from: http://www.teachernet.gov.uk/wholeschool/sen/ [last accessed 5 July 2009]. This definition reads: ‘Dyslexia is a learning difficulty that primarily affects the skills involved in accurate and fluent word reading and spelling. Characteristic features of dyslexia are difficulties in phonological awareness, verbal memory and verbal processing speed’.
 Shaywitz, S. (2003). Overcoming Dyslexia. New York: Simon and Shuster.
 Snowling, M, & Rose, J. (2012). Annual Research Review: The nature and classification of reading disorders – a commentary on proposals for DSM-5. Journal of Child Psychology and Psychiatry 53:5. (2012), pp 593–607.
 Wolf, M, & Bowers, P. (1999). The double-deficit hypothesis for the developmental dyslexias.. Journal of Educational Psychology 91:3. pp 415–438.
 Schneps MH, Thomson JM, Sonnert G, Pomplun M, Chen C, et al. (2013) Shorter Lines Facilitate Reading in Those Who Struggle. PLoS ONE 8: e71161 doi:10.1371/journal.pone.0071161.
 Pennington, B.F. (2006). From single to multiple deficit models of developmental disorders. Cognition, 101, 385–413.
 Eide, Bl, & Eide, F. (2006). The Mislabeled Child. New York; Hyperion. Eide, BL, & Eide, F. (2011) The Dyslexic Advantage. New York: Hudson Street Press. Eide, BL, & Eide, F., 2e Newsletter. October 2005. Hoeft, Fumiko, in preparation.
 WISC-IV. (2003). San Antonio: Pearson.
 Turner, M. (1997). Psychological Assessment of Dyslexia. London: Whurr.
 Eide, BL, & Eide, F. Unpublished observation.
Estimation of Dyslexia Risk Using Matrix Factorization
Mark H. Moulton, Ph.D.
Educational Data Systems, Inc.
April 11, 2019
The Problem: Measuring Dyslexia Risk
Dyslexia refers to a cluster of related learning disabilities with roots in visual and language processing that can affect reading, spelling, speaking, writing, sounding out words mentally, and comprehension (see website for the International Dyslexia Association). In current clinical practice there is no single widely agreed upon gold standard assessment test or measure used to diagnose dyslexia. Instead, dyslexia is typically diagnosed by experts with training and experience in relevant fields of cognition or learning using a variety of tests that in turn contain a wide range of item types, including timed items. This inherent complexity has hindered the identification of a single measure or clinical result that can be used to make a diagnosis of dyslexia. Individuals who score low on one dyslexia-related dimension may score high on others. Therefore, measurement of dyslexia risk requires ability to synthesize observations of person performance across a range of constructs (Peterson et. al., 2012).
In 2016, Neurolearning SPC of Edmonds, WA, contracted with Educational Data Systems (EDS) of Morgan Hill, CA, to develop tools for measuring dyslexia risk using psychometric methods. Psychometrics employs latent trait models from the field of measurement theory, such as the Rasch model, to measure mental traits in a variety of settings, such as educational assessment, with emphasis on the following objectives (see “The Rasch Model”, Wikipedia, “Rasch Modeling”, Population Health Methods):
- Construct validity. It should be possible to show that all tests intended to measure a given mental construct contain questions (“items”) that do in fact measure that construct and no others.
- Reliability. It should be possible to show that all “valid” tests measure the intended construct to an acceptable degree of statistical precision, i.e., have standard errors that are acceptably small relative to the standard deviation of the person measures.
- Reproducibility. Person measures should be reproducible across tests, so that measures for a sample of persons obtained from one test can be expected to correlate highly with those that would be obtained if the same persons took another test of the same construct.
- Comparability. Measures derived from one test should be directly comparable to those derived from another test, even if they contain different questions—in principle equivalent to what one would expect if everyone took exactly the same test. Group averages from different years should be comparable. The same person measured at different times should yield an accurate representation of that person’s growth or decline along a scale. The procedure for making measures comparable across tests is called “equating”.
- Scale linearity. All person measures should represent positions along a linear interval scale, so that a unit change at one position of the scale has the same meaning as a unit change at any other position along the scale.
- Fairness. Tests should distinguish persons according to one or more specifically defined constructs, and no others. Measures should not be distorted by ethnicity, gender, or other extraneous factors.
The goal, in other words, was to generate dyslexia risk measures that would have the same measurement and statistical properties as those expected in the field of educational assessment. Neurolearning would provide data obtained by administering their own dyslexia risk measurement instruments to a reasonably diverse sample of persons. EDS would identify an appropriate measurement model, analyze the data, and provide one tool to “calibrate” the items (calculate their psychometric characteristics for future use) and a second tool to “score” individuals on a one-by-one basis who take those items (or a subset) in the future.
Neurolearning administered a computer application-based dyslexia screening instrument containing some 577 items to a diverse sample of 828 persons ranging in age from 7 to 82, of which approximately 50% were chosen to have some small or large degree of dyslexia. Individuals took only those items appropriate to their age, exposing them to roughly 76% (or 443) of the item pool on average, though this varied per individual. There were 19 distinct item types intended to get at different aspects of dyslexia. These included “timed” item types in which the length of time required to answer each item was recorded and used to shed light on dyslexia risk.
Most educational assessments are built to be unidimensional—to measure proficiency along a single, well-defined construct like geometry or reading comprehension. The items are strongly correlated and selected to differ only in difficulty. Such datasets are analyzed using unidimensional models such as the Rasch, 1-PL, 2-PL, and 3-PL models from Item Response Theory. When assessments aim at multiple constructs, such as reading, writing, listening, and speaking, items are usually written—and person measures calculated—for each construct separately, creating multiple subscales. Measures across subscales may then be weighted and combined to yield a single overall measure. All items, even across subscales, are positively correlated and responses tend to be dichotomous (correct/incorrect) or polytomous (a rating scale).
The Neurolearning dyslexia dataset did not fit cleanly into this unidimensional paradigm:
- Construct. Unlike most educational constructs, dyslexia is explicitly hypothesized to be a multidimensional construct, reflecting a cluster of semi-related factors. If true, we can expect persons to score inconsistently across items, possibly even within a given item type.
- Many subscales. The relatively high number of subscales underscored the multidimensionality of the construct in question. It also posed a difficulty for potential equating designs, as equating would have to be applied separately to each subscale. This can be done when the pool of available items per subscale is large, but as the pool falls below, say, 20 items, equating designs become tenuous, as does the reliability of each subscale. The subscales in the dyslexia dataset were near this boundary in many cases.
- Timed items. Items defined as the time required to respond to a given question pose special challenges. First, there is a negative correlation between performing well on an item and the time required to answer it. With unidimensional models, this can be addressed through reverse coding, but it is not always advisable to do so. Second, time variables are continuous and not easily handled by psychometric models that assume ordinal scales. Third, time required to respond to an item is a potent and fundamentally different kind of construct than mere success on the item, and quite important for diagnosing dyslexia risk. This greatly expands the multidimensionality of the person performance space.
- Aggregation. When there are multiple subscales it is not clear how they should be weighted and combined to calculate an overall person measure. This is particularly true when there is variation in how the subscales are correlated with each other—neither fully correlated nor fully uncorrelated. If all subscales are weighted equally but most of those subscales are strongly targeted on one dyslexia dimension, then the remaining dyslexia dimensions will be poorly represented in the overall scale—and it won’t necessarily be clear that this is the case. In addition, as subscales are added or dropped from the test, or even as items are added or dropped within individual subscales, the meaning of the overall dyslexia measure may change in unpredictable and invisible ways, undermining the goal of “reproducibility”. Such issues can be quite difficult to manage over the long term, even for experts.
- Age effects. The calibration sample included persons ranging from age 7 to 82, reflecting the need for a dyslexia risk instrument capable of measuring dyslexia risk for persons of any age on a common scale. However, age has a powerful effect on how persons respond to items, what items they are able to respond to, and the probability of their success. Age is therefore an essential secondary dimension that must be modeled and disentangled from the primary dyslexia dimensions.
To address these issues, EDS recommended an alternative measurement approach drawn from the field of Machine Learning that is robust to multidimensionality and cleanly sidesteps the problem of how to weight and combine subscales—a psychometric adaptation of Matrix Factorization for use as an “expert recommender system”.
EDS has since 2003 been developing psychometric procedures for dealing with highly multidimensional datasets such as the Dyslexia dataset, implemented through its open-source Damon software package written in Python. Damon uses a matrix factorization algorithm called Alternating Least Squares that EDS has adapted for use in the field of psychometrics to support the goals listed above—reliability, reproducibility, comparability, fairness, and a linear interval scale (Moulton, 2013). Alternating Least Squares was one of the strongest single methodologies to compete in the Netflix Prize movie ratings contest in 2009 (Zhou et. al.).
Damon decomposes a person-by-item response (n-by-i) array X into an n-by-D person coordinates array R (for rows) of rank D (for dimensionality) and a D-by-i item coordinates array C (for columns), also of rank D. Their matrix product E (for estimates) approximates X, differing from it by the residuals or error matrix e.
X = RC + e
E = RC
For example, if we find that a given dataset is “best” modeled as 3-dimensional (i.e., D = 3), then we imagine each person to have a 3-dimensional “ability” and each item to have a 3-dimensional “easiness”. We may not know what these dimensions are, only that they exist and have numerical values. Say person A has ability [1, 2, 3] and item I has easiness [5, 2, 4], then we multiply them (calculate their dot product) as follows to get an estimate for how person A is likely to perform on item I:
E[A, I] = 1*5 + 2*2 + 3*4 = 21
In practice, the numbers in question will be in a log-odds unit (logit) metric ranging from negative to positive that is convertible into probabilities. These numbers are obtained by Alternating Least Squares as follows:
- Observation array X[raw] is normalized in one of several ways to ensure that all columns are in the same metric.
- C is populated with random numbers.
- For each row n of observations in X we calculate R[n] from C and X[n] using a simple “ordinary least squares” regression.
- Do this for all rows in R. The result is an initial estimate of R across all rows—necessarily rough since C is random.
- Now repeat the process for each column i of X to calculate C[i] from R and X[i] using least squares. This yields a second version of C which is more precise than the first version.
- Repeat these steps back and forth between R and C, iteratively improving them, until a specified stopping condition is met. The estimates matrix E (=RC) will now be as close as possible (within the specified tolerance) to observations matrix X given the constraint of dimensionality D.
The most important unknown is dimensionality D. If D is set too high, the estimates matrix E will be close to X but biased by its noise e, causing the results to be unreproducible. If D is set too low, E (and R and C) will be more reproducible but will miss important sources of variation and yield poor predictions. Damon uses two criteria to identify optimal dimensionality: a) ability to predict cell values that have been made missing (called “accuracy”); b) similarity of the R (or C) coordinates when calculated from different subsets of X (called “stability”). The root-product of accuracy and stability, which EDS calls objectivity, will hit its maximum, in theory, at the “true” dimensionality, making it possible to identify optimal dimensionality D.
The primary outputs of Damon’s matrix factorization are coordinates R and C (person and item parameters that can be used for scoring in the future), estimates array E, and standard errors SE that can be calculated for each cell estimate. The procedure is robust to large amounts of missing data and automatically generates estimates for missing cells—no additional missing data imputation procedures are used or needed. Person measures are calculated by averaging estimates across all columns and converting these averages to a linear interval scale as needed. Subscale measures are calculated by averaging estimates across only those columns whose items correspond to the specified subtest.
Because estimates for cells in a given column are calculated using information from the whole array, each column’s estimates will be more precise and reliable—often to a marked degree—than the original raw data for those columns. That means the corresponding subscale measures will be more precise and reliable as well. Thus, by drawing on information from items outside the subtest, a Damon subscale consisting of 5 items may have the precision and reliability of a 20-item test.
Conveniently, each set of overall person measures or subscale measures has its own distinct C coordinates that can be used for scoring in the future with the assurance that the resulting measures will retain comparability and content validity even as the underlying items are switched out over time.
In order for measures calculated using Damon’s matrix factorization routine to enjoy these measurement properties, several requirements must be met, the most important of which is that all items in the testing pool must fit in the common multidimensional space. In other words, if Damon finds that a test is best modeled using a dimensionality D = 5, then every item on that test must be sensitive to each of those five ability/easiness dimensions and no others. Items may be much more sensitive to some dimensions than others, but all must have some degree of sensitivity to each dimension, and none should be sensitive to dimensions outside of those five.
In practice, these conditions are never fully met. This manifests as increased standard error and decreased reliability, as well as model “misfit”—statistically significant discrepancies between a row or column’s observed values and the corresponding model estimates. Misfitting items are items that for whatever reason do not share the same dimensions as the other items, either because they are unduly sensitive to extraneous dimensions or insensitive to the shared dimensions.
A useful property of Damon’s factorization model is that it is generally possible: a) to flag misfitting items; and b) to remove them from the analysis. Conventional statistical models rely on assumptions that individual cases are “statistically representative” of (i.e., randomly chosen from) some well-defined population (of persons and items). Psychometric models such as the Rasch model and Damon’s factorization model do not rely on such assumptions, nor could they given the frequent lack of representative samples of persons and items and the requirement for fairness in educational testing. Therefore, removing misfitting items is not merely permissible but it can actually increase test reliability and reproducibility, making “analysis of fit” an important step in the item calibration process.
An Expert Recommender System
The primary objective of the Neurolearning project was to derive a single, generalizable Dyslexia Risk Scale that not only maintains objective measurement properties over time but reflects the field’s best understanding of what dyslexia is. Ordinarily, as stated above, one would simply average cell estimates across the columns to obtain each person’s dyslexia risk measure, and these measures would form the basis of a scale. However, the meaning of the scale would then be weighted toward those item types that happen to appear most frequently on the instrument, which may or may not be related to how dyslexia measurement is best operationalized therapeutically or understood in the field. The scale, in other words, would have poor content validity.
Because current best practice for the diagnosis of dyslexia involves expert judgment based upon a global assessment of subject performance across a broad range of relevant assessment measures, it was decided to approach the problem by asking the following question: Given the data collected through the Neurolearning dyslexia screening test application, which is representative of the type of data obtained through interaction with a clinician seeking to make a diagnosis of dyslexia for each person, how would a diagnostic expert in dyslexia classify that individual on a dyslexia incidence or risk rating scale from 0 to 10?
Once expert ratings are assigned, then use Damon to predict in future cases what each expert diagnostic rating would be were that expert present. In other words, expert diagnosis is applied only during the calibration phase of the analysis, then delegated to Damon to be performed automatically in place of the expert during the scoring phase of the analysis when individuals take the test and need a real-time dyslexia risk score.
The task of generating ratings was delegated to the dyslexia expert at Neurolearning. For each individual in the 828 person sample, a dyslexia rating was assigned on a scale from 0 to 10 that represented a diagnostic synthesis of all the data (more than 400 items) collected about that person, taking into account their age and other relevant factors. The expert ratings were appended as an extra column to the matrix of observations and then analyzed as part of the item calibration process, yielding coordinates for that column C[expert] that were then stored away in an item coordinates database for future reference.
In order to score a new person n who takes the dyslexia instrument, person n is administered all or some of the items on a computer. Person coordinates R[n] are calculated for that person using their data and the item parameters banked as a result of the calibration analysis. The dyslexia measure is then calculated as the dot product of R[n] and C[expert] (suitably transformed to the desired metric). In this way, we obtain an estimate of how an expert would have diagnosed person n, even though no such expert was available. That is the sense in which this is an “expert recommender system” (see the Data Science Made Simpler article for a helpful description).
Bear in mind, however, that the system is not merely trying to replicate an expert rating; it is trying to improve on it. Instead of a predicted raw rating, Damon yields a value on a continuous scale that can be interpreted as what would be obtained if a panel of experts submitted ratings and averaged them. Unlike the raw ratings, this measure is also accompanied by a standard error.
In addition to calculating an overall measure of dyslexia risk, a goal of the Neurolearning contract was to calculate subscale measures for diagnostic and research purposes. Originally, subscale measures were defined in terms of the timed and untimed items associated with each subtest in the instrument, 19 in all. The procedure did not follow the “expert recommender” pattern described above. Instead, each cluster of items identified as a subtest by Neurolearning was treated as representative of what a complete test in that content area might look like, an assumption that becomes increasingly tenuous as the number of items in a subtest falls below, say, 20. Each person’s subscale score was obtained by averaging the cell estimates for the items within that subscale, mathematically equivalent to multiplying the person n coordinates R[n] by the average of the item coordinates C[sub] for that subscale. The subscale average of C[sub] was stored in a bank of item coordinates and used to calculate measures on the subscale in the future.
In March 2019, it was decided to change course and redefine the subscales to comply with the “expert recommender” pattern described above. The goal was to increase the interpretability and diagnostic power of each subscales and to improve methodological consistency across all aspects of the test.
A Neurolearning expert specified six dyslexia subscales that were focused on different aspects of dyslexia and implemented a manual procedure for examining each person’s data in the calibration sample and generating an expert risk rating from 0 to 10 on each subscale. These subscale ratings were added as new columns in the dataset. Damon was then delegated the task of predicting how an expert would rate each person on each subscale in a process exactly analogous to that used for the main dyslexia risk measure.
During calibration, Damon generates subscale coordinates C[sub] and stores them in a bank for future reference. When person n submits response data X[n], the applicable item coordinates C[i] are looked up to calculate person coordinates R[n]. The dot product of person coordinates R[n] and subscale coordinates C[sub] yields predictions of how an expert would rate person n on those subscales. These are rescaled to be linear and have the same mean and standard deviation as the overall dyslexia risk measure.
Note that once the subscale coordinates are calculated, subscale measures can be calculated that will be comparable regardless of how items are added and subtracted to the test. It is even possible, in principle, to predict person performance on a subscale calculated from a subtest containing items the person never took. This remarkable property holds to the degree that it can be assumed that the items in that subtest “fit” in the same dimensional space as the test as a whole. To approximate this property in practice, it is necessary to perform “analysis of fit” to remove all items that do not “fit” in the multidimensional space defined by the test and discovered by Damon. Items that “fit the model” are all sensitive to the same set of dyslexia-related latent dimensions and to no others.
Accounting for Age
A young examinee receives a low score across the items. Is it because the examinee is young or because the examinee has dyslexia? It is essential to work this out. One common approach is to test each age group separately and only make comparisons between persons within an age group, as is often done with IQ testing. Such an approach abandons the possibility of a single scale for measuring dyslexia risk and replaces it with multiple scales, loosely related. It also abandons the possibility of measuring change in dyslexia risk for a given individual over time.
The EDS approach was to treat age as a known dimension impacting item responses and separate it from the various dyslexia dimensions. This was done in Damon by performing the factorization in multiple steps:
- Log Age. Take the log of each person’s reported age, which has the effect of more or less equalizing the effect of age differences across the age range.
- First factorization. Treat log(age) as a fixed dimensional coordinate of R (also known as “anchoring”). Applying least squares, calculate C[i] for each column i using R and the observed (but normalized) data X[i]. This yields C and estimates matrix E = RC as well as residual matrix Res = X – E.
- Second factorization. Apply Damon’s alternating least squares routine to residuals matrix Res to obtain the optimal dimensionality D and coordinates R and C. D is the number of dyslexia dimensions, the age dimension having been stripped out.
- Merge R and R. Append dimensions R to the age dimension R to create an (almost) final version of R; call it R[penultimate].
- Third factorization. Anchoring on R[penultimate] (both the age dimension and the dyslexia dimensions), use least squares to calculate a final version of C and use C to calculate the final version of R. This causes the person coordinates to be calculated last in the iterative process, ensuring that the person measures (instead of the item measures, which we are less interested in) reflect the best possible fit between X and E.
The effect of this procedure is to provide an age-adjusted “bump” to the success rates of younger examinees (resulting in lower dyslexia risk scores), a bump that experts already apply as part of their evaluation. This allows Damon to reduce misfit in the expert ratings column and to adjust dyslexia risk scores to take into account the age of the examinee.
It is important to note that age is only used during the calibration phase. Examinee age is not used when scoring. This is made possible by the fact that, as a result of the calibration process, each item in C has a coordinate (the 6th dimension) that reflects how sensitive that item is in general to the age of the examinee. This age sensitivity coordinate makes it possible for Damon to calculate a coordinate in R[n] that approximately estimates the log(age) of person n. This was an unlooked-for side-effect of the age-adjustment procedure described above. We could have designed the scoring algorithm to use the examinee’s stated age when scoring, but it was found, after much study, to be unnecessary and even counter-productive.
The objective of the EDS contract with Neurolearning was not to perform the actual psychometric analysis that Neurolearning would use but rather to provide software tools that would allow Neurolearning experts to do the analysis themselves. Therefore, the analysis that EDS did and that is superficially described below was purely exploratory and does not reflect the current psychometric properties of the instrument. However, it is understood that Neurolearning has made minimal changes to the EDS analysis.
Damon found that the optimal dimensionality D for modeling the Neurolearning dataset, not including the age dimension, was 5 or 6. We chose D = 5 as the working dimensionality. However, it was also clear that not all item types fit in the same space. When “timing” items were removed, the optimal dimensionality D shrank to 2 or 3, indicating that timing introduces a separate set of important dyslexia-related dimensions, which comes as no surprise. We could have decided at that point to analyze the “timing” and “non-timing” items separately, in parallel, but decided ultimately that the hypothetical mathematical benefits were not worth the cost in analysis complexity.
Analysis of Fit
We performed extensive analysis of fit, removing from analysis various classes of items (and persons) that did not appear to fit the 5-dimensional space, and were able to improve reliability, objectivity, and various other statistics. We found well-fitting and reliable solutions by removing around 100 items and as many persons but also felt the improvements were incremental and that no great loss would be incurred by analyzing the complete dataset.
We identified a class of persons for whom the Damon overall and subscale dyslexia measures were deemed to be inadequate. Upon closer examination, it was found that these were persons whose responses did not fit the 6-dimensional model specified by Damon. In other words, the raw scores and timings for these items across the test did not match the corresponding cell estimates, the values predicted for them by the measurement model. This is conceptually similar to a situation that sometimes arises in educational psychometrics where a student taking a test tends to get the hard items right and the easy items wrong. While for practical reasons such students are given scores, psychometrically those scores are meaningless. For whatever reason, the test instrument is unable to measure their ability on that particular construct and the scores that are assigned are uninterpretable.
The same situation can arise when measuring dyslexia risk. For whatever reason, some examinees simply do not engage with the test in a way that can be interpreted. The effects on the resulting dyslexia measures appear to be exacerbated by the examinees being measured in a 6-dimensional instead of a 1-dimensional space, as is customary in educational testing.
How can the problem be addressed? Fortunately, while it is not possible to assign valid scores to such examinees, it is possible to flag them using their person fit statistics. When person n’s misfit (a function of the size of the discrepancies between his or her observed data and the corresponding cell estimates) exceeds a given threshold, that student is flagged for personal attention and their scores set aside. Depending on how high the threshold is set, we estimated that the number of such persons ranged from 1% to 10% of the calibrating sample, but this needs to be explored further.
Age and Item Drift
Because of the importance of removing age as a disturbing factor in dyslexia risk measurement, we performed a wide variety of tests to confirm that the age factor was minimized. The most important of these was to break the dataset into three groups by age—”kids” (age 7–9), “teens” (age 10–15), “adults” (age 16 and above)—and analyze each group separately. This was done by calibrating the items separately for each group and comparing the resulting item coordinates for “drift”—similar to what is called “DIF” (Differential Item Functioning) in educational psychometrics. Item drift is the degree to which the item coordinates do not match or correlate across groups. We found that while there might be some minor benefits to analyzing the groups separately, the item drift statistics showed very low item drift for all three age groups, more than sufficient to justify analyzing all ages together in one common coordinate system. This supports the idea of a single dyslexia risk scale applicable to all ages. In short, our method of removing the effect of age on test performance was found to be effective.
Depending on the analysis, Damon reported reliability statistics for measurement of overall Dyslexia Risk to be in the neighborhood of 0.92, which would be considered satisfactory in an educational assessment context (0.85 is a widely used minimum). (“Reliability” is similar to the Cronbach-alpha statistic referred to frequently in psychometrics.) The subscale reliabilities were similarly high.
As mentioned, EDS provided two tools to Neurolearning, the first for calibrating the items using the Neurolearning calibration dataset and storing their coordinates in a bank, the second for scoring individual examinees using their response data and the stored item coordinates. To validate the calibration model, we plotted Damon’s estimated dyslexia measures against the original expert ratings under various conditions (first ensuring that one was not biasing the other). These correlations tended to run from 0.70 to 0.85. However, such correlations can be difficult to interpret as they are comparing a discrete ordinal scale against a continuous interval scale, and it is just as likely that the Damon measures, by uniformly and mathematically synthesizing responses across the items, are closer to the “true” dyslexic risk as that the expert ratings are closer. What they do show is that there is substantial agreement between the expert and the model, and that the model can stand in for the expert if needed. However, this is an area that requires further study and better validation criteria.
To validate the scoring routine we plotted the examinee measures (and standard errors) on the overall Dyslexia Risk scale as obtained during the calibration phase with the measures obtained by scoring each examinee individually using the scoring tool without access to expert ratings. The correlation was 0.99 and the points hugged the identity line. We repeated the procedure for each subscale and obtained similarly high correlations.
On the basis of data and dyslexia expertise provided by Neurolearning, EDS has shown that it is possible (with a small number of identifiable exceptions) to measure examinees of any age on their risk and incidence of dyslexia with a satisfactory degree of reliability and precision, that a single generalized dyslexia scale based on an expert recommender matrix factorization methodology is both theoretically and practically feasible, that the same methodology can be applied to obtain reliable subscale measures, and that such a methodology can be used to calculate valid dyslexia risk measures in real time for individual examinees with minimal direct involvement by an expert.
“How do you build a ‘People who bought this also bought that’-style recommendation engine?”. Blog: Data Science Made Simpler. Retrieved March 1, 2019, from https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/
Moulton, M. H. (2013). Objectivity and Multidimensionality: An Alternating Least Squares Algorithm for Imposing Rasch-like Standards of Objectivity on Highly Multidimensional Datasets. Educational Data Systems whitepaper. Presented at the International Objective Measurement Workshop, Vancouver, BC, 2012. https://eddata.com/wp-content/uploads/2015/11/EDS_Objectivity_Multidimensionality_Whitepaper.pdf.
Peterson, R. L., & Pennington, B. F. (2012). Developmental dyslexia. Lancet (London, England), 379(9830), 1997-2007.
”Testing and Evaluation”. International Dyslexia Association. Retrieved March 1, 2019, from https://dyslexiaida.org/testing-and-evaluation/.
“The Rasch Model”. In Wikipedia. Retrieved March 1, 2019, from https://en.wikipedia.org/wiki/Rasch_model
“Rasch Modeling”. In Population Health Methods: An educational platform for innovative population health methods, and the social, behavioral, and biological sciences. Retrieved March 3, 2019 from https://www.mailman.columbia.edu/research/population-health-methods/rasch-modeling
Zhou, Y., Wilkinson, D., Schreiber, R., Pan, R. (date unknown). “Large-scale Parallel Collaborative Filtering for the Netflix Prize”. HP Labs (Palo Alto, CA). Retrieved March 1, 2019, from http://www.grappa.univ-lille3.fr/~mary/cours/stats/centrale/reco/paper/MatrixFactorizationALS.pdf