E-Mail 'What can language research tell us about the ‘real world’? Part 3' To A Friend

Email a copy of 'What can language research tell us about the ‘real world’? Part 3' to a friend

* Required Field

Separate multiple entries with a comma. Maximum 5 entries.

Separate multiple entries with a comma. Maximum 5 entries.

E-Mail Image Verification

Loading ... Loading ...


  • As promised many moons ago on June 25th, here are the top 20 two-word expressions to be found in quotation marks in the giant ukWaC corpus (1,318,612,719 words). The numbers represent the raw frequency in the corpus:

    1. 413 as is
    2. 319 The Prisoner
    3. 315 thank you
    4. 294 real world
    5. 285 hands on
    6. 239 Third Way
    7. 231 best practice
    8. 226 at risk
    9. 202 out there
    10. 194 Plymouth Brethren
    11. 193 Contact Us
    12. 183 AS IS
    13. 180 must have
    14. 180 how to
    15. 164 I AM
    16. 156 regime change
    17. 129 what if
    18. 128 real life
    19. 127 Thank You
    20. 123 Thank you

    There are quite a few interesting expressions in this list, but also a few oddities. What on earth are the Plymouth Brethren doing there for instance? This may be an effect of the composition of the corpus. I hope to return to this list in future blog posts.

    For the technically-minded, the original corpus search – for two-word expressions enclosed in quotation marks – consisted of a CQL query in SketchEngine:

    [lemma = “\””] [word = “[a-zA-Z][a-zA-Z]*”] [word = “[a-zA-Z][a-zA-Z]*”] [lemma = “\””]

    The sequence [word = “[a-zA-Z][a-zA-Z]*”] represents any single alphabetic word. I needed it twice because I was searching for two-word expressions. (In theory, two sets of square brackets alone [] [] should have done the trick, but when I tried it the query also found single words in quotation marks. Perhaps someone at SketchEngine can tell me why.)

    I then downloaded and saved the concordances to a file (let’s call it File.txt) and (after much head scratching) applied the following Unix command to it:

    cat File.txt | grep ‘< \"' | sed 's/^.*.*$//g’ | sort | uniq -c | sort -nr > FileSorted.txt

    If anyone would like to know what each part of that string of algebra is doing, please contact me privately. If you’ve got Linux or cygwin on your computer, you can try it for yourself.

    And so to the prize. We promised a Macmillan Collocations Dictionary to the first person to correctly predict any five of the above top 20 expressions. The undoubted winner is Alexander Bochkov with 7 correct predictions (but Alexander, were you actually predicting, or counting?). Alexander – could you email your address details to medoblog.admin -at- googlemail.com , and a dictionary will find its way to you sometime after August 13th.

    Honourable mentions go to Monika Sobejko and to Caroline, not only for their correct *predictions* and near misses, but also for their excellent comments, nailing down why we actually write some expressions in quotation marks.

    Thank you to everyone who took part.

  • The Unix command did not appear correctly in the above comment. This is the correct version:

    cat File.txt | grep ‘< \"' | sed 's/^.*.*$//g’ | sort | uniq -c | sort -nr > FileSorted.txt


  • No that’s not the right command either – it’s just the same as before. Something is happening between Copy, Paste, and Submit. Ah computers. Please email me privately if you would like the correct Unix command. j

  • Great news! Thank you very much. I sent an email to the above-mentioned address but I haven’t received a reply yet. Also, how can I contact you to get the Unix command?