What if I told you that just by using a simple formula, I can calculate the number of times any word comes in this article, or in a book, or even across the entire internet…?
Zipf’s Law allows you to do exactly that with math that even a second grader can understand.
The law states that “Given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.”
Now what this essentially means is any word which is the nth most common word will occur x times where
X= Number of times the most common word is used
This extremely overpowered. This is mainly because in any language, English, German, French and even the ones which we have not been able to decipher yet follow a peculiar system.
The most used word is used almost two times more than the second most used word, three times more than the third most common word, four times as common as the fourth most common word and so on and so forth.
In the case of English, “The” happens to be the most common word, “of” comes in the second place, “and” in the third.
Now “the” accounts for about 7% of the total words.
So going by the above formula, “of” would account for about 7/2% = 3.5% of the total words and this actually holds true!
“And” accounts for about 7/3% =2.3% of the total words and this also comes out to be correct.
Numerically, “the” comes 69,971 times, “of” comes 36,411 times, “and” comes 28,852 times. These are out of a net total of about million words.
Go ahead do the math, verify it for yourself!
(All the above readings are from the Brown Corpus of American English text)
Now comes the really fun part,
The graph below shows the number of occurrences of the 7 most popular words in the Brown Corpus.
Notice the curve!
Also for anybody who knows log, it gets even more interesting
The graph below is a log-log chart.
It depicts practically the same thing just with a log graph.
Moreover, this law does not only limit itself to languages, it is even true for
- Page hits to Popularity rank of the web page
- Population ranks of cities to City size
- Corporation Sizes to corporation ranking
- Income ranking to net income
- Popularity of TV channel to its TRP
It is like this law is everywhere as if it has been built into the human brains and psychology.
The law is among the most mysterious and there is no specific reason as to why this happens. There are a bunch of theories all over the internet but none do justice to the law and hence have not been included in this article.
This law does not end here, just as any other principle or law in math it is connected to many others and some of the ones are
1. Benford’s LawPareto principle
All of those will be covered in the posts that follow. Zipf’s law does not end here…
PS- For the above content:
“The” comes 48 times out of 527 words (approx. 9%)
“Of” comes 24 times (Exactly Half!= 4.5%)
“And” comes 15 times (approx. 1/3 = 3%)
While the sample size is very small the law still holds true.
Need I justify this law any further?