Smaller word clusters used in experiments:
filename | #wordtypes | #tweets | #tokens | #clusters | min count | tweet source |
---|---|---|---|---|---|---|
6mpaths | 111,844 | ~6,000,000 | 1,575,589 | 800 | 10 | 10k tweet/day sample, 9/10/08 to 7/18/12 |
3mpaths | 124,731 | 3,000,000 | 1,006,324 | 800 | 5 | subsample |
750kpaths | 50,780 | 750,000 | ? | 800 | 5 | subsample |
100kpaths | 21,345 | 100,000 | ? | 800 | 3 | subsample |
10kpaths | 6,944 | 10,000 | ? | 800 | 2 | subsample |
1kpaths | 4,142 | 1000 | 15,159 | 800 | 1 | subsample |