Purses & wallets: corpus sex differencesPosted by Jamie Keddie on May 18, 2009
In the English class, teachers can make use of internet search engines to answer students’ linguistic queries such as:
“Which is more common – look after yourself or take care of yourself?”
In the following pie chart, figures refer to the number of Google hits obtained by running a web search of these two items:
Of course, the Internet was never designed for language investigation and so results like these should be taken with a pinch of salt (i.e. should not be taken too seriously). Apart from the problem of underrepresented genres and registers (e.g. spoken English, literary English, etc), there is also the problem of knowing exactly what type of language the Internet does consist of (i.e. chat room English, blogging English, spam, IT English, pornographic English, etc) and in what proportions.
Even if we did have an accurate linguistic breakdown of the World Wide Web, we couldn’t be sure which parts the search engines are reaching. And why do results for any given search vary from day to day or even minute to minute?
And then there are what Michael Rundell refers to as ‘distractors’. His example was the book, Bored of the rings. Its prominent online presence would give an unfair advantage to the collocation bored of if we wanted to compare its frequency with that of bored with.
But despite all of these problems, the universal familiarity of Google makes it a fun tool for learners to compare frequencies of words, collocations and other lexical items. As a way of introducing my own students to the corpus principle, I recently made the following clip.
Although I can’t imagine anyone taking this clip too seriously, I now think that it could have been better.
Let’s start with the sentences which are marked by the subject pronouns he and she:
“He talks nonsense” 2,500 Google hits
“She talks nonsense” 619 Google hits
“He likes dogs” 10,800 Google Hits
“She likes dogs” 2,980 Google Hits
In both of these examples, the ratio of he to she sentences is of the order 4:1. But the interesting thing is not that he talks 4 times more nonsense than she does. Neither is it that he likes dogs 4 times more than she does. The interesting thing is that he is represented 4 times more than she is on the internet, at least according to the results obtained by Google:
he 3,820 million Google hits
she 1,080 million Google hits
This gender disparity is also observable on the British National Corpus. A simple search gives the following frequencies:
he 640,736 results
she 352,872 results
So perhaps if we want to find corpus sex differences using subject pronouns as markers, it would be an idea to multiply the results of she sentences by 4 when using Google and by 2 when using the BNC.
The same imbalance applies to the possessive markers his and her
|Google Hits||BNC results|
The problem with these corpus searches, however, is that as well as being possessive markers, her is also an object pronoun and his is also a possessive pronoun:
|Object pronoun||Possessive||Possessive pronoun|
|I love him||I love his mother||I love his|
|I love her||I love her mother||I love hers|
However, the higher frequency of the male possessive marker is confirmed with a few searches of “his/her + noun” pairs:
|Google hits||BNC results|
So as with the subject pronoun markers, any searches involving possessive markers are destined to fail from the start.
But despite all this, the search results presented in the clip were never intended to be taken too seriously. Rather, the intention was:
- To demonstrate the corpus principle to language learners.
- To remind students that when dealing with language and life, things are not always black and white (some men have purses, some women have wallets).
- To introduce students to one level of diversity in language (the gender level – women may use the adjective lovely more than men).