global English language change and slang language technology Love English

Culturomics and n-grams

In December, Sharon mentioned Google’s Ngram viewer, a nifty new tool that lets you see how often words or phrases appear in more than five million texts in Google Books. Results appear in the form of a graph, which you can adjust by timeframe (1800–2000), degree of detail (rough–smooth), and corpus type (several languages and regions).

The project was formally introduced in a paper in Science. Its authors announced a new field: culturomics. There was some confusion about its pronunciation: did it have a short ‘o’ like in economics, or a long one like in genomics? The latter, as linguist Ben Zimmer explains; culturomics is the study of the data set called the culturome.



N-gram has a technical meaning in computational linguistics, as a sequence of a number (n) of items, generally strings of letters, in a text. One item is a 1-gram, etc. However, as people began sharing graphs of Google Books n-grams that caught their eye or their fancy, ngram (with or without the hyphen) soon took on a new meaning as a name for the graphs themselves.

Here are a few: you can see the rise of suburbia around the turn of the century; compare television, radio, cinema and DVD; and note the emerging discussion of certain prejudices. The graph of 1950, 1960 forms a simple and satisfying pattern, while the bump of trenches shows how rapidly historical events can surge and fade in the literature. To see how such graphs might be analysed, have a look at this brief history of neuroscience.

Culturomics is an impressive and valuable undertaking that has received much praise since its inception. It has also attracted constructive criticism: David Crystal and Mark Liberman drew attention to technical shortcomings, Geoffrey Nunberg assessed its possible effect on scholarship, and Mark Davies compared it with his Corpus of Historical American English, which allows more subtle and tailored searches. An official culturomics FAQ addresses some queries and problems.

It’s a fun toy for general readers and word lovers, as well as being a useful tool for students and scholars in any field where the quantitative history of words can cast light on the development of the subject – and, in turn, on us. Feel free to share your own n-grams; I’d love to hear them!

About the author

Stan Carey

Stan Carey is a freelance editor, proofreader and writer from the west of Ireland. Trained as a scientist and TEFL teacher, he writes about language, words, books and more on Sentence first, Macmillan Dictionary Blog and elsewhere. He tweets at @StanCarey.

13 Comments

Leave a Comment