Crawling the Web for new wordsPosted by Daphné Kerremans on February 26, 2014
Today’s guest post, from Daphné Kerremans of the Ludwig Maximilian University in Munich, is another in our occasional series on developments in language technology (and how they help us produce better dictionaries). Daphné is a linguist interested in the socio-cognitive mechanisms of language processes, specifically regarding the adoption of linguistic innovations by individuals and the speech community at large. She earned her PhD with a project investigating the diffusion process of English neologisms on the Internet.
On hearing that I’m a linguist, people at parties often ask me – after the obligatory question of how many languages I speak – whether this or that new word they have heard is likely to stay, like selfie or amazeballs, or whether it will fade from active use, like vuvuzela and Y2K. Ten years ago, I would probably have mumbled something about ‘the unpredictability of the language system’ and ‘invisible processes’, before trying to escape and avoid fuelling the image of the quixotic linguist in their stuffy library.
Recently, however, this question no longer requires crystal-ball-inspired guesswork. We can now investigate neologisms with the help of the Web and a sophisticated toolbox which monitors and visualizes the life cycle of new words as they emerge. At our department we have developed a customized webcrawler tool, the NeoCrawler, which weekly scours the Internet for occurrences of any new word, and downloads the relevant pages into a dynamic database. Compared to a standard corpus, where time passes between updates (if it is updated at all) and words that were once new either become mainstream or simply disappear, the NeoCrawler allows us to continuously track the diffusion of new words in real time.
Prior to storing them, the NeoCrawler sweeps through the web pages with a broom, cleaning up programming code and linguistically irrelevant material (pictures, URLs, lists, ads etc). This is necessary in order to reduce the cluttered surface of web pages, which can complicate further linguistic analysis. Next, duplicates (identical or near-identical pages) are filtered out and the remaining pages are then ‘tokenized’ – that is, the text they contain is split into individual words. Tokenization yields a first, rudimentary overview of the frequency development of the neologism in question. It also serves as the input for the ‘false-positive detection’ tool. If the NeoCrawler does not find a single token of the queried neologism on the retrieved page, the page is removed. Such false positives are an annoying side effect of the reliance on the Google index to access the Web. If this index hasn’t been updated fast enough, changes in Web content may have gone unnoticed, so that a previously present neologism has disappeared. Since such false positives can distort any conclusions about a neologism’s frequency and dispersion, early detection is of utmost importance.
Although frequency graphs are fine and exciting data, I believe it’s much more valuable for linguists, lexicographers and logophiles to find out who has been using the word and in what contexts. The NeoCrawler toolbox therefore also contains analysis tools for linguistic profiling. A manual text-type classification tool categorizes the results into blogs, portals, social networks, microblogging services and so on. In addition, the NeoCrawler also allows us to classify the pages according to discourse topics such as sports, entertainment, business and finance or politics. On a more granular level, it can also capture stylistic, textual and grammatical features, and thus enhance the neologism’s profile.
What we can do with these profiles? I could talk about what I do with them and provide more evidence for the pattern Stan Carey describes regarding the sudden rise in the use of amazeballs. But that’s a different geek story.Email this Post