Culturomics and n-grams

Posted by on February 15, 2011

In December, Sharon mentioned Google’s Ngram viewer, a nifty new tool that lets you see how often words or phrases appear in more than five million texts in Google Books. Results appear in the form of a graph, which you can adjust by timeframe (1800–2000), degree of detail (rough–smooth), and corpus type (several languages and regions).

The project was formally introduced in a paper in Science. Its authors announced a new field: culturomics. There was some confusion about its pronunciation: did it have a short ‘o’ like in economics, or a long one like in genomics? The latter, as linguist Ben Zimmer explains; culturomics is the study of the data set called the culturome.

N-gram has a technical meaning in computational linguistics, as a sequence of a number (n) of items, generally strings of letters, in a text. One item is a 1-gram, etc. However, as people began sharing graphs of Google Books n-grams that caught their eye or their fancy, ngram (with or without the hyphen) soon took on a new meaning as a name for the graphs themselves.

Here are a few: you can see the rise of suburbia around the turn of the century; compare television, radio, cinema and DVD; and note the emerging discussion of certain prejudices. The graph of 1950, 1960 forms a simple and satisfying pattern, while the bump of trenches shows how rapidly historical events can surge and fade in the literature. To see how such graphs might be analysed, have a look at this brief history of neuroscience.

Culturomics is an impressive and valuable undertaking that has received much praise since its inception. It has also attracted constructive criticism: David Crystal and Mark Liberman drew attention to technical shortcomings, Geoffrey Nunberg assessed its possible effect on scholarship, and Mark Davies compared it with his Corpus of Historical American English, which allows more subtle and tailored searches. An official culturomics FAQ addresses some queries and problems.

It’s a fun toy for general readers and word lovers, as well as being a useful tool for students and scholars in any field where the quantitative history of words can cast light on the development of the subject – and, in turn, on us. Feel free to share your own n-grams; I’d love to hear them!

Comments (13)
  • [...] This post was mentioned on Twitter by Alejandra, Macmillan Dictionary. Macmillan Dictionary said: Stan Carey on culturomics and n-grams: http://bit.ly/h1vBAh [...]

    Posted by Tweets that mention Culturomics and n-grams | Macmillan -- Topsy.com on 15th February, 2011
  • Another interesting n-gram, mentioned by Michael Rundell in his webinar a couple of weeks ago, is this one for ‘disabled’ vs ‘crippled’: http://ngrams.googlelabs.com/graph?content=disabled%2Ccrippled&year_start=1800&year_end=2000&corpus=0&smoothing=3

    Posted by Kati on 15th February, 2011
  • That is interesting, Kati. I played around a bit with that one, trying related terms like the ones mentioned here. One example, “lame”, is on a gradual decline according to the database, but this decline doesn’t reflect its popularity in colloquial speech.

    You mentioned Michael’s webinar — there’s a curious (and surely erroneous) spike for “webinar” in the early 1900s…

    Posted by Stan on 15th February, 2011
  • Thanks for the link! The first ngrams I did about evolution vs. creationism blew my mind away even more than the history of neuroscience ones, just in terms of the historical events that so clearly can be visualized. http://egosumdaniel.blogspot.com/2010/12/playing-with-google-ngram-evolution-vs.html

    Posted by Daniel on 16th February, 2011
  • You’re welcome, Daniel, and thanks for the link: a very interesting post. You had a lot of fun with those graphs!

    Posted by Stan on 17th February, 2011
  • Stan – you tried webinar and I just tried podcast which allegedly ‘spiked’ in the late nineteenth century! Clearly something odd going on with recent neologisms which don’t appear to always flatline in the way you’d expect. That aside, really interesting for established vocabulary. ‘Nuclear’ is another good one which gives a clear representation of the angst during my student days…

    Posted by Kerry on 18th February, 2011
  • Apologies for lowering the tone of the dicussion but I think this one illustrates the point David Crystal makes in his blog about the limitations of n-grams: http://ngrams.googlelabs.com/graph?content=Barbapapa&year_start=1800&year_end=2000&corpus=0&smoothing=3 (No spoken language, no popular culture.) Still, it’s a great tool and lots of fun to play around with.

    Posted by Kati on 18th February, 2011
  • Good point Kati, though ‘Teletubbies’ seems to produce the desired result: http://ngrams.googlelabs.com/graph?content=Teletubbies&year_start=1800&year_end=2000&corpus=0&smoothing=3 – a clear nod to ‘popular’ culture (I’m sorry to say I was one of the many Mums who had to suffer toddlers demanding ‘Tubby toast’ and ‘Tubby custard’ in the mid-nineties!)

    Btw, if you switch language to French – Barbapapa does a bit better….

    Posted by Kerry on 18th February, 2011
  • Posted by Kati on 18th February, 2011
  • Kerry: Oh, I hadn’t tried nuclear. It shows a very pronounced climb. Those early spikes with webinar and others must result from a glitch of some sort.

    Kati: You’re not lowering the tone at all! I’m glad the French graph covers what the English one omitted.

    Sticking with pop culture, there’s no surprise who tops this mini-list of musicians.

    Posted by Stan on 18th February, 2011
  • (I included Madonna at first, but the word’s older sense was clearly interfering.)

    Posted by Stan on 18th February, 2011
  • Stan – yep, definite glitch. Reading round the commentary on this (and there’s been a lot of it!) it looks like our experiences with webinar and podcast might be to do with data quality (typos etc), incorrect dating of sources, and the old chestnut that – duur – computers just can’t bloomin well read. So for instance words ‘pod’ and ‘cast’ were both alive and well in the late 19th century, and may well have occurred, by some quirk of language use or error, adjacent to each other…

    Posted by Kerry on 18th February, 2011
  • [...] explored the word together, chiefly its semantics and etymology. Last week I examined Google’s culturomics/n-grams, a remarkable project based on the Google Books corpus, and this week’s offering is a short [...]

    Posted by Writing at Macmillan Dictionary Blog « Sentence first on 4th April, 2011
Leave a Comment
* Required Fields Notify me of follow-up comments via email