Zipf's law is a neat, general fact about word frequency distribution. G K Zipf discovered that the frequency of the kth most frequent word is proportional to 1/k (Human Behavior and the Principle of Least Effort, an Introduction to Human Ecology (Reading, MA, Addison-Wesley, 1949), cited in Knuth, The Art of Computer Programming: vol 3, Sorting and Searching (Reading, MA: Addison-Wesley, 1973), 397). The top hundred words in this database adhere to the law quite well.
frequency cumulative frequency alphabet (per milln) frequency rank rank the 68351.63 68351.63 1 318525 of 33008.66 101360.29 2 212425 and 28651.11 130011.40 3 11331 to 27599.22 157610.62 4 322312 a 23160.48 180771.10 5 1 in 20670.81 201441.91 6 149032 is 10571.15 212013.06 7 156934 that 10549.02 222562.08 8 318470 was 9939.26 232501.34 9 356587 it 9882.90 242384.23 10 157771 for 9309.44 251693.67 11 114281 on 7636.66 259330.33 12 213645 with 7171.07 266501.39 13 361235 he 7167.84 273669.23 14 134413 be 7153.17 280822.40 15 27945 I 7036.88 287859.28 16 146205 by 5866.89 293726.17 17 44040 as 5793.35 299519.52 18 19178 at 5154.12 304673.64 19 20631 you 5043.27 309716.91 20 364651 are 5000.14 314717.05 21 17618 his 4963.47 319680.52 22 139433 had 4922.27 324602.79 23 131212 not 4899.77 329502.56 24 209444 this 4789.41 334291.97 25 319827 have 4685.82 338977.79 26 134106 from 4625.21 343603.01 27 117354 but 4616.26 348219.26 28 43732 which 4131.11 352350.37 29 358956 she 3991.77 356342.14 30 285912 they 3982.95 360325.09 31 319435 or 3975.58 364300.67 32 214838 an 3836.07 368136.73 33 10593 her 3692.13 371828.86 34 137067 were 3482.45 375311.31 35 358233 there 3025.87 378337.18 36 319027 we 2953.92 381291.10 37 357241 their 2929.78 384220.88 38 318680 been 2924.28 387145.16 39 28958 has 2873.74 390018.90 40 133676 will 2775.94 392794.84 41 360225 one 2764.69 395559.53 42 213720 all 2630.80 398190.33 43 7706 would 2617.11 400807.44 44 362548 can 2355.35 403162.80 45 46162 if 2247.43 405410.22 46 147000 who 2226.26 407636.48 47 359548 more 2195.16 409831.64 48 196881 when 2193.48 412025.12 49 358850 said 2149.41 414174.53 50 274265 do 2139.12 416313.65 51 88648 what 2053.98 418367.63 52 358673 about 1907.52 420275.15 53 652 its 1888.51 422163.66 54 157935 so 1844.57 424008.24 55 293328 up 1816.81 425825.05 56 347711 into 1803.28 427628.33 57 155127 no 1789.08 429417.41 58 205310 him 1787.13 431204.53 59 138999 some 1783.31 432987.85 60 294419 could 1753.24 434741.08 61 68666 them 1668.31 436409.39 62 318729 only 1646.85 438056.24 63 213824 time 1609.99 439666.22 64 321515 out 1547.86 441214.09 65 217118 my 1526.21 442740.30 66 200056 two 1514.46 444254.76 67 330909 other 1513.23 445767.98 68 216850 then 1475.27 447243.25 69 318748 may 1455.47 448698.73 70 184593 over 1443.56 450142.28 71 218315 also 1409.47 451551.75 72 8585 new 1404.41 452956.16 73 204064 like 1366.44 454322.60 74 173657 these 1328.58 455651.18 75 319382 me 1316.41 456967.59 76 185895 after 1302.93 458270.52 77 4998 first 1287.14 459557.66 78 111382 your 1285.88 460843.54 79 364711 did 1283.43 462126.98 80 84058 now 1281.59 463408.56 81 209859 any 1279.86 464688.42 82 15074 people 1215.83 465904.26 83 229078 than 1203.22 467107.47 84 318396 should 1172.27 468279.75 85 287398 very 1159.18 469438.93 86 352460 most 1112.14 470551.07 87 197488 see 1097.46 471648.52 88 281471 where 1096.15 472744.67 89 358869 just 1060.74 473805.41 90 160985 made 1050.69 474856.10 91 179480 between 1031.01 475887.12 92 31750 back 1022.58 476909.69 93 24006 way 984.89 477894.58 94 357170 many 981.20 478875.78 95 182122 years 981.16 479856.94 96 364108 being 973.72 480830.66 97 29466 our 970.28 481800.94 98 217097 how 969.81 482770.75 99 142630 work 956.09 483726.84 100 362239