language technology linguistics and lexicography Love English

The wisdom of crowds: can it work for dictionaries?

Crowd-sourcing refers to a ‘distributed’ method for solving problems or completing complex tasks, where large numbers of people contribute their time, knowledge and expertise in a collaborative way. The term was coined in 2006 – as I learned when re-reading Kerry Maxwell’s interesting ‘Buzzword’ article on it, written when the word was still quite new. An early and successful use of crowd-sourcing is the Linux operating system, a piece of open-source software whose basic version has been (and continues to be) refined and improved by legions of computer scientists worldwide. In this model, the contributors are experts (even if amateur experts) but increasingly, as the Web becomes ever more widely available, we find ordinary members of the public joining in.

The ‘products’ of crowd-sourcing range from the trivial to the highly technical, and can include things of genuine political or social importance. At the less serious end of the scale, a magazine’s readers might collectively compile a list of ‘favourite R&B tracks’, or people may send in photos of snow-covered gardens to the TV weather show. A more significant recent example was the BBC’s ‘Britain in a day’ project, when thousands of people filmed what they were doing on one day in November 2011: 750 hours of footage were distilled into a moving and entertaining film. (There’s also an online archive where you can see full-length versions of the video clips that made up the programme.) And the work of citizen journalists has often been critical to the reporting of  events in the ‘Arab Spring’.

In the latter case, ordinary citizens are taking on a job usually done by trained professionals – often because the professionals are banned from the scene. This raises questions about the value and reliability of crowd-sourced content, an issue of particular relevance to those of us working in the reference field. In 2005, the journal Nature ran a comparison between Wikipedia (crowd-sourced) and Encyclopedia Britannica (‘expert-sourced’), selecting 100 articles of comparable length on a range of scientific topics (50 from each encyclopedia), and sending them to experts in the relevant subjects, whose job was to evaluate their accuracy. The experts weren’t told which encyclopedia the articles came from. They found a total of eight ‘serious’ errors, in the articles –  four from each source. There were plenty of minor points where the experts disagreed with the encyclopedias, but although Britannica performed better, the ‘error rate’ of the two sources wasn’t dramatically different.  Broadly speaking, the results provided a vote of confidence for crowd-sourcing. And it’s worth adding that you would expect any errors in the Wikipedia entries to be ironed out over time: the crowd-sourced model means that a reader who spots a mistake can go in and fix it, so entries should be continually improved.

Can this approach work for lexicography? In fact, there has always been an element of crowd-sourcing in dictionary-making. As far back as 1857, an army of volunteers started collecting quotations from books and other sources illustrating how words behave in context – and these ‘citations’ provided the raw materials from which the OED was eventually created. With a huge task shared between large numbers of ordinary people, this was crowd-sourcing 150 years before the word was invented.

Now we have the Web, and an explosion of ‘user-generated content’ (UGC). As in other fields, the crowd-sourced content in dictionaries varies alarmingly. The Urban Dictionary has high entertainment value, and provides a good record of contemporary US slang. But when a single term like Republican includes 256 highly subjective ‘definitions’, we know we’re not dealing with an entirely serious dictionary. Macmillan’s experience with our our own crowd-sourced Open Dictionary suggests that the most fruitful areas for UGC are neologisms, regional varieties, and technical terms. This relates to the issue of ‘expertise’: you don’t have to be a language expert to spot a new usage or be familiar with a word or phrase that’s characteristic of the place where you live. Technical terms are a different matter: here, expertise in the subject is what counts – but as Wikipedia demonstrates, there are plenty of experts around. Many of them are multilingual, too. One impressive resource is Eijiro Pro on the Web, a bidirectional Japanese-English dictionary, whose exceptional coverage of technical vocabulary derives largely from crowd-sourced content.

There is probably an argument that lexicographers are still best placed to describe ‘general’ vocabulary. After all, there are plenty of experts out there on transpiration or quantitative easing, but you can’t be an expert on clear or decide – unless you’re a lexicographer armed with large corpus resources and the skills to use them. This is still all quite new, and there’s a debate in the dictionary community about crowd-sourced content and how we deal with it. Hoping it will go away is certainly not the answer. A more optimistic view would be that enlisting enthusiastic amateurs and subject-field specialists could help us to develop even better resources in the long run.

Email this Post Email this Post

About the author


Michael Rundell


Leave a Comment