How nouny is vocabulary?
Introduction
I am keen to find out how our vocabulary is distributed among the major word classes (aka Parts of Speech – Noun, Verb and so on), and whether the proportion changes as vocabulary becomes rarer. So what percentage of word types (more accurately, lemmas) are nouns, what percentage are verbs, and so on?
This question is related to a long-standing interest in the distribution of word tokens among the word classes. In particular, the proportion of running words in a text which are nouns (including proper as well as common) seems to be rather constant, but also seems to increase with formality and age (Hudson 1994). Why should nouns be (slightly) more common in more formal output by older people?
One possible explanation is that the proportion of nouns in general vocabulary increases with rarity – in other words, the less common a word is, the more likely it is to be a noun. A priori this seems plausible, but it would be reassuring to have statistical evidence.
After failing to find anything relevant via Google, I broadcast two queries via the Linguist List (18.2686, 15 Sept 2007 and #2009-24, 29 Jan 2009). As usual, the results were very helpful, and are summarised in the following section.
Data
- Thanks to Gwillim Law, of Measurement, Inc. Figures based on Mark Davies’s Corpus of Contemporary American English (http://www.americancorpus.org/). Law writes: “This corpus is tagged with the CLAWS7 tagset (http://ucrel.lancs.ac.uk/claws7tags.html), truncated to three or fewer characters. I started with a purchased dataset containing the most frequent 55,000 lemmas in the corpus, with the frequency and the tag of each wordform. There are 111,821 distinct wordform-tag pairs in the file. For example, “feed” appears 2,837 times in the corpus with the tag “nn1”, 11 times with the tag “nnu”, 3,396 times tagged “vv0”, and 6,219 times tagged “vvi”; that accounts for four of the wordform-tag pairs. Since you’re only interested in “major” word classes, I truncated the tags further, leaving only the first character (shown in row 1 of the spreadsheet). That first character relates roughly to a major word class (shown in row 2). Since you wanted to see how the percentage varies with frequency, I calculated percentages for the 256 wordform-tag pairs with the highest frequency, and then the 512 highest-frequency pairs, the 1,024 highest-frequency pairs, and so on (numbers shown in column A of the spreadsheet). The percentages are shown in the body of the table.” Click here for Gwillim Law’s spreadsheet (‘raw’). I then added a second page (called ‘rotated’) where the raw table is rotated and the main word classes are extracted, with a graph showing trends.
- (from Myq Larson) The first, second and third thousand lemmas (in terms of token frequency) for spoken English from Leech, G., Rayson, P., & Wilson, A. (2001). Word Frequencies in Present-day Written and Spoken English: Based on the British National Corpus. Harlow: Longman. Click here for his raw data for the most common word classes (where ‘noun’ includes both common and proper nouns), and a graph showing trends across frequency bands.
- (from Jasper Holmes) On Adam Kilgarriff‘s website, the first six thousand lemmas (in terms of frequency) for spoken English, again in the BNC, with word classes as percentages of total tokens and total types; click here for these figures. Also, click here for Adam Kilgarriff’s raw data for all the main word classes (where ‘noun’ includes both common and proper nouns) classified for six frequency bands, and a graph showing trends across frequency bands for the four most common classes.
Conclusions
- The proportion of nouns increases with increasing rarity.
- Among common lemmas, nouns increase as minor word classes decrease.
- Other major classes, collectively, account for a roughly constant proportion of vocabulary, regardless of frequency, but:
- adjectives and verbs increase as adverbs and prepositions decrease
- among rare words, the trend for nouns shows a gradual decrease between 8K and 65K, which is reversed between 65K and 130K.
Note
Lemma (aka lexeme or lexical item): a single dictionary entry which may include:
- a number of inflected forms; e.g. was, were, be, am, is, are, being all belong to the same lemma, called BE.
- a number of different but fairly similar meanings; e.g. the lemma CLIMB includes not only physical climbing (He climbed the tree) but also metaphorical climbing (The prices are climbing again.