language technology linguistics and lexicography Love English

A dog named ‘Corpus’

Another conference report, this time from the first Asia-Pacific Corpus Linguistics Conference (APCLC), recently held in Auckland, New Zealand. Corpus linguistics involves using corpus data as the raw materials for studying language – so in a sense, dictionary-writers are the ultimate corpus linguists. But while the e-Lexicography conference we covered a few months ago focussed on what dictionaries might look like (or turn into) in a few years’ time, APCLC was very much about language: what we can learn about it by studying corpus data, and how this can be applied to improving the teaching of languages. I won’t try to summarize all the papers, and in any case we’ve invited some of the speakers at the conference to contribute posts to the blog over the next few months. But three of the talks are especially worth mentioning.

Susan Conrad  (co-author of a well-known book on corpus linguistics) described a project which applies a corpus linguistic approach to the teaching of writing. Her subjects are civil engineers, and her starting point ‘the mismatch between the writing skills of engineering graduates and the demands of writing in the workplace’. Employers noted that the graduates may be technically very proficient, yet ill-equipped for the writing they have to do. And that was an eye-opener: non-engineers like me tend to think of engineering as a hands-on profession, so it was surprising to hear just how much writing is required: what with reports, surveys, proposals, and technical memoranda, engineers spend a significant amount of their time producing written texts. So it makes sense that their training should address this aspect of the job – but generally it doesn’t. Conrad compared corpora of texts produced by engineering students and practising engineers. She found that the students tended to favour complex sentences, passives, and other devices which (they thought) made their texts sound impressively technical – whereas the practitioners preferred a plainer style and argued for clarity and precision. On the basis of this research, she has set up a website which summarizes her findings and will (when complete) include a range of teaching materials to address recurrent problems and help budding engineers to become effective writers. What was striking was that the engineering companies were so open to advice from a corpus linguist. This was partly because they recognized that poorly written documents could undermine a firm’s credibility, and partly because – as scientists – they could see the value of a ‘scientific’, evidence-based approach to analyzing language and solving problems. Altogether, a nice example of corpus linguistics in the workplace.

Paul Nation is well-known for his research in vocabulary acquisition, and he has written extensively on vocabulary size. His talk gave some estimates of how much text (books, magazines etc) you would need to consume in order to encounter the core words of a languge often enough for them to be absorbed into your personal lexicon. He accepted there were some problems with his calculations (which are fairly crude), but concluded that somewhere between five and six hours’ of ‘input’ per week could, within a year, provide the right level of exposure. As he pointed out, ‘learners need a vocabulary of around 7000 to 8000 words before unsimplified written input is likely to be comprehensible’. Encouragingly for us, this figure is in the same ballpark as the Macmillan Dictionary’s ‘core vocabulary’ – the 7500 red words.

Yukio Tono, Japan’s most prolific corpus linguist, gave a brilliant overview of recent and current developments in corpus-based language teaching. These include work going on with the Common European Framework of Reference (CEFR). Researchers (including Tono himself) are using data from native-speaker corpora, learner corpora, and a corpus of language-teaching textbooks to identify the vocabulary, structures, and other linguistic features that can be said to correspond to the different CEFR proficiency levels. Another trend he mentioned was the arrival of language-learning websites from ‘non-traditional’ suppliers – notably, a new site being developed by NTT (Japan’s premier telecoms company). NTT are in the early stages of setting up an online education service (called ‘Education Square by ICT’) which is based on use of tablet computers. At the moment they are trialling the service on a small scale, but it looks like an ambitious project, and NTT could become a significant player in the language-teaching business.

Though well-known in academic circles as a corpus linguist, Yukio Tono is something of a celebrity in Japan. He devised a long-running TV series in which each programme was devoted to one of the ‘100 English keywords’, with frequent references to ‘the corpus’ as the source of information on how these words behaved. The programme was watched by millions, and Tono himself appeared in it as the language guru. The result has been to popularize the term ‘corpus’, which is now so well known in Japan that there are, apparently, several dogs whose owners have called them ‘Corpus’. Whether this will catch on as a popular doggy name, only time will tell.

(There is a link here to Tono’s TV programme, but this may not work everywhere.)

Email this Post Email this Post

About the author


Michael Rundell

Leave a Comment