The Oxford Children's Corpus: lessons for learning to read

The written word is arguably the greatest cultural invention. Orthography (the conventional writing system of a language) provides a set of tools that allows us to write words so that others who share our tools can also share our thoughts, ideas, and dreams. The written word allows us to create narratives that play to our imaginations or teach us about the world; transcending space and time, it is almost impossible to imagine the world without it. As skilled readers, the connection between the letters on the page and the image they construe in our minds is so fast, so rich in content and so automatic, we rarely stop to think of the underlying complexities of what we do as we read and the enormous task facing children as they learn to read. 

The scientific study of reading has revealed a good deal about the early stages of reading development. Children need to grasp the fundamental insight that in a language like English, letters are code for meaning via sound. Armed with this insight, they can “sound-out” words and discover spelling-sound correspondences. Evidence from a large research base has translated to the classroom with a phonics-based reading curriculum characterising the teaching of reading in the early years in the UK. 

Critically however, we know relatively little about how children move from the laborious process of decoding individual words. Reading is a skill that needs to be learned and like other skills, practice is critical. It is not surprising to find that children who read more frequently become better readers than those who read less often. But what is it that children learn from experience? Our project seeks answers to this question.

Our first task is to quantify reading experience. Although every child is different, we can estimate average reading experience using methods from corpus linguistics – a computational approach that analyses very large datasets of natural language, looking for patterns, distributions and statistical regularities which can tell us about how language works. Our large dataset is provided by the Oxford Children’s Corpus, owned and developed by Oxford University Press. It currently comprises more than185 million words written for children, or by children themselves as part of BBC Radio 2’s annual 500 Words writing competition. By computing various statistics across this complex database we will gain a unique insight into what children are likely to have been exposed to during reading experience.

Our second task involves collecting data from children. We will ask 100s of children to read sets of words so that we can relate their performance to specific properties of those words, as identified in the corpus. We can then make connections between children’s reading experience, what they learn from this and how this relates to reading development. 

Our findings will make an important contribution to theoretical models of reading which have been largely silent regarding the role of experience. In the longer term, our project might have implications for how to best to teach reading so as to maximise children’s learning.

Professor Kate Nation
University of Oxford