The future of lexicography: does lexicography even have a future?Posted by Michael Rundell on November 11, 2011
The conference got off to a rip-roaring start as Simon Krek (one of the organizers) outlined a radical vision for a future in which a range of intelligent language tools would be freely available to make communication easier. The functions Simon mentioned include real-time subtitling (so a person in China could follow the U.S. elections on American TV, since everything would be translated on the spot), automatic summarization of complex documents (in your own language or another one), and (most ambitiously) instant speech-to-speech translation.
Some of this may sound like science fiction. But the same could be said of the ideas that visionaries like John Sinclair were putting forward 25 years ago – some of which are now quite mature technologies that we already take for granted.
Dictionaries – in their familiar form, at least – aren’t necessarily part of this longer-term vision. After all, dictionaries evolved in order to fulfil a number of communicative and informational needs that people have, but there may be more efficient ways of meeting those needs. It’s already the case that many people (especially digital natives) no longer turn to a dictionary to find out what something means or how it is pronounced or spelled. In this new model, the user simply expresses a need to resolve a particular language problem, and the computer does the rest. What matters is getting the answer (quickly and reliably) – and the ‘container’ that holds the answer is of no importance. Adam Kilgarriff (a regular contributor to this blog) put it like this: the dictionary may simply ‘dissolve’ to form one component – albeit an important one – in a much broader operation which can be characterized as ‘search’.
For ‘search’ to work optimally, we need high-quality language resources. Dictionaries are certainly part of this, so are encyclopedias, and corpora. There is interesting work going on in the ‘semantic tagging’ of corpora: the Dutch computational linguist Piek Vossen reported on a project of this type. Essentially, it’s about training computers to do something that human beings are very good at, namely recognising which meaning is meant when several are possible. The challenge is to get the computer to recognise that when it sees the word mouse in a particular context, it refers to a small rodent, or a computer device, or a shy, quiet person. This is extremely difficult work, but Piek’s paper demonstrated that the task was doable, and that the computer’s success rate was gradually improving.
I can’t do justice to the range of excellent papers I’ve attended so far; I feel I’ll need a couple of weeks in a quiet place just to digest it all. Many talks dealt with issues we’ve looked at regularly on this blog. The importance of pragmatics, for example, was highlighted by Mojca Šorli, who had some good proposals for improving the way dictionaries present this information. Collocation, not surprisingly, featured in several sessions: how to extract them automatically from corpora; how to integrate them into reference resources; how to cater for users who need more specialized collocational information (for example when writing academic texts); and how computer systems can be trained to identify ‘miscollocations’ and propose corrections. And as always, there’s a lot to learn from resources being developed for languages other than English.
Erin McKean rounded off the first day with an entertaining talk about the amazing Wordnik site – one example of a resource that goes way beyond what we’d expect in a conventional dictionary, and shows (among other things) how engaging user-generated content can be. As good a demonstration as any of a well-known quote by science fiction writer William Gibson: “The future is already here – it’s just not very evenly distributed”.
There’s no space (or time) to say more, but there will be a final round-up early next week.
(For a final summary of eLEX2011, see this post.)
[...] The future of lexicography. [...]