language technology linguistics and lexicography Love English

The ‘orderliness’ of language

Here’s a made-up sentence. See if you can spot the three most infrequent words in it:

If you eschew expensive restaurants, you will save more than an exiguous amount, and this will ameliorate your financial situation.

That wasn’t too difficult, was it? We’ve discussed before the idea that very few words are exact synonyms of one another, but in semantic terms these three are more or less the same as three more familiar words: eschew means ‘avoid’, exiguous means ‘small’, and ameliorate means ‘improve’. There are differences in register (all three are labelled ‘very formal’) but the really striking difference is in frequency: eschew, exiguous, and ameliorate have a combined frequency of 3793 occurrences in our corpus, whereas improve alone occurs over 385,000 times (and small is twice as frequent as improve).

Why does this matter? If a word is very common, it belongs to the ‘core vocabulary‘ of a language. For receptive purposes, you need to be familiar with core vocabulary because a high percentage of everything you read or listen to consists of these words. For productive purposes, you need to really ‘know’ core vocabulary words – not just know what they mean but know how they combine with other words, syntactically or collocationally. Conversely, if a word is rare, there isn’t much point in memorizing it, because the likelihood is that you will never encounter it again.

As most readers will know, we show the ‘core’ words in red in the Macmillan Dictionary, with a ‘star rating’ as a guide to their frequency: avoid, improve, and small are all three-star words, so they belong to the commonest 2500 words in English. With today’s big corpora and smart software, getting reliable frequency data is pretty straightforward. But one man discovered the ‘asymmetrical’ way vocabulary is distributed in languages – with a small number of frequent words and large number of rare ones – long before computer corpora existed. In 1935, the Harvard professor G.K. Zipf published The Psycho-Biology of Language, in which he applies statistical techniques to investigate the distribution of words in English, Latin and Chinese. In all three languages, a similar pattern emerges: most of the words in a language are rare (and there are tens of thousands of these) – but most of the words in a text are common words because the same high-frequency words are used over and over (which is why they’re frequent of course).

Zipf’s observations demonstrated what he refers to as the ‘orderliness’ of language, and he expressed this in statistical terms. His formula indicates a constant relationship between the rank (r) of a word in a frequency list (e.g. whether it is the 5th, 50th, or 5000th most frequent) and its frequency (f) in texts: thus r x f always produces a similar number. To give a concrete example: in the British National Corpus, the 10th most common word form is ‘was’, with a frequency of about 924,000; the 100th most frequent, ‘made’, occurs about 92,000 times;  and the 1000th most frequent, ‘advice’, has a frequency of just over 10,000. If you apply the r x f formula, you get (respectively) results of 9,240,000; 9,200,00; and 10,400,000 – not exactly the same, but a good illustration of the orderliness Zipf talked about. With more data at our disposal, we can now establish that the commonest 7500 English words (the red words in the Macmillan Dictionary) provide ‘coverage’ of almost 95% of non-technical texts: that is, when you read a newspaper or novel, for example, about 95% of the text will consist of ‘red words’.

Subsequent studies have shown that a similar pattern occurs in most languages. There are other implications too. For example, frequent words are usually shorter than rare ones, and generally have several meanings and complex behaviour in terms of their syntax and collocations. Rare words tend to be simpler, but also longer. It is not clear why languages work this way, but it seems to be related to the limits of memory and the ‘principle of least effort’: if a fairly small number of words enable us to communicate most of what we need to say, this is more efficient than having a single distinct word for every possible concept we might want to express.  Interestingly, a ‘Zipfian’ model can be seen in other areas, such as the way income is distributed in a country: a small number of rich people have a high percentage of the wealth, while a far larger number of poor or less well-off people share what is left – but that’s a subject I will eschew for now.

Email this Post Email this Post

About the author

Michael Rundell


Leave a Comment