language technology linguistics and lexicography Love English

How do words get into the dictionary? Part 3: the future


The lexicographer’s rule of thumb is that things always take longer than you expect. Samuel Johnson underestimated the time it would take him to complete his dictionary, and James Murray – the original Editor of the OED – fared even worse in the prediction business: what started as a 10-year project took over 40 years to complete. But when it comes to technology-driven changes, this has been turned on its head: things tend to happen more quickly than we think they’re going to. Less than two years ago, none of us knew what an iPad was, yet today they are everywhere, and there’s a David Hockney exhibition in London featuring artworks he created on one of these devices. And in the reference field, the speed at which dictionaries have moved from print to digital media has taken most of us by surprise. In the previous two posts in this series, we saw how Web-based technologies have brought enormous changes in the ways that lexicographers create dictionaries and users consult them. Inevitably, this raises questions about what dictionaries are for, and what they should include.

There is huge variation in what different dictionaries do. On the one hand, significant new terms which emerged from the global economic meltdown of 2008 (like credit crunch and quantitative easing) are still absent from several leading online dictionaries, despite  passing all the traditional tests. On the other hand, the 2010 edition of Collins’ flagship dictionary proudly advertised its inclusion of the word Cleggmania – a jokey term which came and went within a matter of weeks, and was only ever known in the UK.

Wordnik has a radical solution to the problem of what to include. With its slogan “All the Words”, it welcomes new words without asking too many questions. Crowd-sourcing is central to its approach, and once a word is included in Wordnik, its clever software “populates” the entry by bringing in examples from its corpus, from the Web, and from the Twittersphere, and (when appropriate) grabbing images from Flickr. As Stan said in a recent post, “‘wordplay is fun”, and Wordnik appeals to people’s enjoyment of language for its own sake. Whether this makes it a reliable source of lexical information is another question, but if you are looking for something that’s either very rare or very new, you are more likely to find it in Wordnik than anywhere else.

And what about Macmillan? Like other dictionaries that started life in print form, we have a core of vocabulary which was selected according to the principles described in earlier posts. But we are in the process of developing an inclusion strategy that is appropriate for the radically changed world we now live in. We aim to keep the main dictionary up to date (and we’ll soon be announcing another round of new additions). Meanwhile,  the Open Dictionary adds a valuable element of crowd-sourcing: it not only helps us keep abreast of language change, but also allows us to improve our coverage of more specialized vocabulary from fields such as biology, computer science, and economics – an important trend which will certainly develop further. Finally, the regular Buzzwords feature provides an opportunity for looking at language trends (and their social, political, or linguistic implications) in a bit more detail.

An obvious question that arises is whether we should keep these various strands separate, or whether we should integrate them and provide a single search box. This would mean you would find what you were looking for first time, as long as it existed somewhere in our data collection. That in turn raises other questions. For example, would users want to know the ‘status’ of the entry they are looking at: was it crowd-sourced or was it written by a lexicographer? Or would the gain in efficiency (with a simpler look-up process) offset any concerns about the source of the entry? There is plenty to think about here.

Meanwhile, things are changing fast on the technical front, and the ‘self-updating’ dictionary is on the horizon. A number of technologies are needed in order to achieve this, and most are either in development or becoming available. First, you need a computer system which can trawl the Web and automatically identify new vocabulary as it emerges. Finding completely new words (like showmance or nanodrone) is relatively straightforward, and frequency data can tell us whether these are genuine new items or just examples of creative wordplay (or ‘exploitations’). New compounds (like bounce rate or range anxiety) are a little more difficult to spot, since the words they are composed from usually exist already, so the computer has to work a bit harder to recognize them as distinct lexical items. Hardest of all is getting the software to  identify a new meaning of an existing word, like the (fairly) recent senses of cougar or toxic, but work is progressing in this area too. (Macmillan has recently contributed its own data on emerging word senses to a research project aimed at doing exactly this.)

Once computers have successfully worked out how to spot new words or meanings as they arise, the next step is to build dictionary entries for them. To do this, the software will use whatever information the language data supplies about what the new item means and how it combines with other words (its syntax, collocates, and so on), and will then find the best examples of the word in use, and add them to the entry too. Many of the elements of this operation are already in place (and we will come back to this in a later post), so it’s only a matter of time before all of this becomes possible.

Traditional inclusion criteria served us well during the age of printed dictionaries, but it is clear that they now need a serious rethink. We are in new territory here, and it may take a while before a new methodology emerges. If we are looking for a general principle, it would be hard to improve on the guidelines provided for contributors to the Wiktionary project: “A term should be included if it’s likely that someone would run across it and want to know what it means”. This still leaves room for individual judgement, but selecting words for a dictionary will never be an exact science. As Dr Johnson said, “in lexicography, as in other arts, naked science is too delicate for the purposes of life”.

Email this Post Email this Post


About the author

Michael Rundell

1 Comment

Leave a Comment