What I did
- I used Yahoo Site explorer and wget to download 1000 pages from
dailykos and redstate.com. I believe this is the top 1000 pages on
the site by #inlinks. I filtered these to get blog entries, including
comments.
- I extracted the words from dkos & redstate blog entries, and the
corresponding comments, using a perl script (that uses an extendable
perl HTML parser, and site-specific "class" tags on the comment and
entry DOM nodes). The redstate comments are a little messier, since
I could not easily strip out signatures.
- I tokenized, stoplisted, counted a bunch of word frequencies, and
saved all the words that appear >= 5 times in dkos entries, redstate
entries, dkos comments, redstate comments, etc.
- I estimating a bunch of relative-frequency/MI sort of statistics.
What seemed most reasonable was to look for "non-general English"
words that are "more common in context X than context Y", which
I express with this score
log[ P(w|X) / P(w|Y)*P(w|GE) ]
Stats for "general English" were from the brown corpus. I smoothed
with a Dirichlet, and probably more importantly, by replacing zero
counts for P(w) with counts of 4 (since I only stored counts>=5).
Then for each X,Y I looked at, I took the top 200 scoring words,
broke them into 10 equal-frequency bins, and built a "tagcloud"
visualization of them. The top 200 ignored a handful of stuff that I
decided was noise: signature line tokens, like ----; words like
"pstin", which seem to be poorly-tokenized dkos words; date, time and
number words; and words like kos, dailykos, entry, diary, fold, and
email.
===================================================================
X Y file name
===================================================================
dkos entries redstate blog entries blue-red-entry.html
dkos comments redstate blog comments blue-red-comment.html
dkos anything redstate blog anything blue-red-all.html
redstate entries dkos entries red-blue-entry.html
redstate comments dkos comments red-blue-comment.html
redstate anything dkos anything red-blue-all.html
redstate comment redstate entry redComment-redEntry.html
dkos comment dkos entry blueComment-blueEntry.html
===================================================================
For a few other context's I scored as
log [ P(w|X)*P(W|Y) / P(w|GE) ]
ie "non-general English" words that are "common in both context X and
context Y"
===================================================================
X,Y file name
===================================================================
dkos,redstate comments blue+red-comment.html
dkos,redstate entries blue+red-entry.html
dkos,redstate anything blue+red-all.html
===================================================================
- I also wrote code to pick up subject-matter 'tags' from dailykos
(like the delicious tagging scheme), which turned out to be pretty
noisy (eg, "republican" and "repulican party" are both tags, as are
"iraq" and "iraq war".) I set up some additional contexts X = "dkos
comments for entries tagged with something that contains the word T"
and compared them to Y="all dkos comments"
===================================================================
T file name
===================================================================
elections blueElections-blue-comment.html
iraq blueIraq-blue-comment.html
media blueMedia-blue-comment.html
===================================================================
- Sizes of all of this, in words:
==============================
brown 480098
dkos-all 3351061
dkos-comment 3311702
dkos-entry 39359
redstate-all 1152883
redstate-comment.freq 940241
redstate-entry 212642
dkos-iraq-comment 341238
dkos-elections-comment 256129
dkos-media-comment 160413
==============================
Observations
- Redstate has way less comment text total than dkos, but way more
entry text. There are also more entries in redstate than dkos (788 vs
351) so the amount of entry text may simply be more because dkos has a
bunch of high-inlink pages that are not comment-containing.
- Dems are way less polite in comments than in blog entries.
Republicans, not so much.
- There are apparently pretty big differences in vocabulary in
comments pertaining to different entries (e.g., media vs Iraq). There doesn't seem to be
a major impact of the actual vocabulary used in the entries though -
eg I don't see the term "iraq" in the Iraq-related comments).
- A lot of the "vocabulary" from the comments may be user names.
- There seems to be a lot of argumentation in the comment sections
(agree, aren't, doesn't, don't, etc)