Today’s guest post comes from Tony McEnery, Professor of Linguistics and English Language at Lancaster University, and Robbie Love, Research Student at the ESRC Centre for Corpus Approaches to Social Science at Lancaster University.
______________
Twenty years ago, a consortium of researchers from dictionary publishers, universities, and the British Library released the British National Corpus (BNC) – a 100-million-word corpus of written and spoken English. Since then, the BNC has helped to produce dictionaries, and has been used in tens of thousands of academic books, lectures, theses, and research papers. There is no doubt that the efforts of the BNC’s compilers twenty years ago helped to convince many people that corpus linguistics had something to offer that was innovative and, importantly, useful – so much so that this year, Lancaster University launched the first and only massively open online course (MOOC) for corpus linguistics, which began its second run on 29 September 2014. The course has been taken by thousands of students worldwide.
As useful as the BNC has proven to be, there are some applications which it is becoming less and less suitable for. Simply put, we can no longer claim that the language in the BNC is representative of present day British English. So much has changed in British society in the intervening years that we can no longer accept discussions of the poll tax or Prime Minister John Major as the language of today – not to mention the increasingly conspicuous absence of words like Facebook, smartphone and Google. So, twenty years later, we at the ESRC Centre for Corpus Approaches to Social Science (CASS) at Lancaster University have teamed up with Cambridge University Press, with the aim of solving this problem by collecting a new British National Corpus, called the BNC2014. This will not put the original BNC ‘out of business’. Far from it. The BNC will exist alongside the BNC2014 as both a standalone resource and a tool for analysing how British English has changed over the last two decades by comparing it to its successor.
We are starting our ambitious project with the spoken component (the Spoken BNC2014), which will comprise approximately ten percent, or ten million words, of the entire corpus. We are making a collection of conversations between people from across the UK whose first language is British English. We are doing this by inviting participants to record conversations from their home life, using the audio recording feature on their mobile phone, or any other device that records in .MP3 format, and emailing the recordings to us. We are accepting recordings from any and all settings, including family meal times, meeting friends for a coffee, visiting grandparents, car journeys, or simply relaxing in the living room. We will then transcribe the conversations and add them to our collection of real-life language data.
So far, we have collected and transcribed two million words of conversations from over two hundred speakers across England. We compared this data to the original spoken BNC and found out which words had most radically decreased and increased in use between the two. The words which occurred much less in the new data compared to the old data, and which may be said to characterize the conversation of the early 1990s when compared to today, include fortnight, marvellous, fetch, walkman, poll, catalogue, pussy cat, marmalade, drawers, and cheerio. In contrast, the words which, relatively, are much more frequent in the new data, and as such characterize the conversation of today when compared to the early 1990s, include facebook, internet, website, awesome, email, google, smartphone, iphone, essentially, and treadmill. We hope that, once we have collected much more real-life language data, we will be able to analyse in detail the contexts that such words occur in, and ultimately pinpoint exactly how they are defined by British society in the modern day.
But corpus linguistics is not just about counting words. We believe a corpus the size of the Spoken BNC2014, once complete, could be as useful, if not for some purposes more useful, than survey data for gauging the tastes and opinions of the public. As an example of how much has changed since the 1990s, we used the data we have collected so far to track how attitudes towards women have changed over time. In the original spoken BNC, the most commonly used adjectives to describe women were old, young, stupid, pretty, big, naked, nice, silly, married, and beautiful. In the Spoken BNC2014 data so far, the commonest adjectives are old, young, other, little, many, international, different, crazy, and fifty-year-old. The most striking difference is that women seem much less likely to be described based on their appearance or, worse, sexualized. In other words, we can see real change taking place before our very eyes in the data already, and in order to confirm our existing findings and, more importantly, make even bigger, more meaningful claims, we need to collect more.
The Spoken British National Corpus 2014 project is ongoing, and the researchers invite you to participate by emailing corpus@cambridge.org.
