Telling Our Story, Part 1: Mark Davies, Billion-Word Databases, and “Big” History

That the present era of “big data” should be characterized by a spirit of pragmatism may seem surprising after so many decades of “big theory” about the impact of new technologies. Scholars have grown familiar with arguments concerning the effects of the information age on what it means to be human: Donna Haraway’s 1985 treatise The Cyborg Manifesto, N. Katherine Hayles’s 1999 history of How We Became Posthuman, and Alan Liu’s subversive 2004 polemic The Laws of Cool are just three of dozens, even hundreds, of grand narratives about our brave new world. But meanwhile, electronic tools processing massive amounts of information are making it possible in literary studies, for example, to identify previously undetectable relationships between disparate corners of the canon or within the seams of individual texts. Similar tools are also revolutionizing the study and teaching of language, from linguistic corpora charting lexical and semantic variations to electronic programs improving the testing of language competency. (The latter was the subject of a plenary talk by Troy Cox, Coordinator of Technology and Assessment at BYU’s English Language Center, at the 2013 Conference of the Southwest Association for Language Learning Technology.) The interest inspired by this new techno-wizardry is abundantly evident in user statistics (Google Books, anyone?) as well as in growing numbers of digital humanities symposia, data visualization websites, and conference panels on new media. In this environment, theories about the meaning of our culture of information appear to be receding behind the sublime experience of new scholarly horizons and a growing conviction that we simply need to get on with this new work in these new forms. Indeed, such labor increasingly seems to be an intellectual imperative of the age.

Mark Davies of BYU’s Department of Linguistics and English Language is at the forefront of this revolution in scholarship. A designer of linguistic corpora (that is, of large collections of text used for linguistic analysis), Davies has undertaken work that is eminently useful but also possesses long-range consequences for how we think about language and, potentially, undertake research in the humanities. He is the recipient of five large national research grants: three from the National Endowment for the Humanities and two from the National Science Foundation. The first of these, which Davies received in 2001-2002, helped create a 100-million-word annotated, online corpus of Spanish from the 1200s through the 1900s. A few years later, Davies would co-create a 45-million-word corpus of Portuguese stretching from the 1300s to the 1900s. But with advancements in technology and greater access to resources, Davies has graduated to corpora of much more imposing size. From 2009-2011, NEH funded Davies in building a 400-million-word corpus of American English (1810-2009)—a body more than two hundred times as large as any other structured, historical corpus of English. What is more, this is the only corpus that can map ongoing changes in the language—how the language is evolving right now. And scholars have taken note, outside as well as within the humanities: it attracts hundreds of thousands of unique visitors each month from all over the world in fields as diverse as business, history, religious studies, and more, and also forms the basis for 150-200 publications (whether articles or book chapters) each year.

And yet, Mark’s horizons are still expanding. He is presently at work on a project called “GloWbE” (Global Web-based English), a two-billion-word corpus of words composed from twenty different English-speaking countries. This project will allow researchers to examine dialectal variation in English, making available for the first time a much clearer view of the massive flows of linguistic change within and across regions, nations, and continents. Such a project promises to provide scholars with a wealth of new information that will help them chart huge numbers of linguistic and conceptual developments pertaining to culture as well as language, and hence to the very course of history itself.

