In stage 1, the cue lexicon is built by running SAGE on ~120 labeled books and magazines.
Obtain ark-sage-0.1.jar
from here.
ark-sage is a Java library I have written to make it easier to perform SAGE inference on text.
We will be using the SupervisedSAGE
module of the library to perform inference on the ideological book corpus.
Preparing the data. SupervisedSAGE
takes as input data of the format
label1 label2 ... labelN<TAB>token1:count1 token2:count2 ... tokenN:countN label1 label2 ... labelN<TAB>token1:count1 token2:count2 ... tokenN:countN ...In our case,
token
are bigrams and trigrams (words separated by _
), which have undergone typical NLP text processing (stemming, removing punctuations, stopwords, vocabulary reduction, etc).
Due to the nature of our data, we are not able to provide the actual text of the books/magazines. Instead, you can find the following labeled input (with a reduced vocabulary of 28,731 bigrams/trigrams) to SupervisedSAGE
here:
Furthermore, we have normalized the counts of terms on a sentence level, i.e each token in a sentence of length N
has count 1/N
.
Extracting the cue lexicon can be accomplished by running
supervised-SAGE.sh --input-counts no_uni+stem+no_stopw+topics.word_counts --output no_uni+stem+no_stopw+topics-30.sage --iterations 20 --config-file l1_weightswhere the file
l1_weights
contains the regularization weight (λ
in the paper) for each label (effect) in the form --l1-weight <label>=<value>
.
The weights file that we used for the paper (λ=30
): without topics, and with topics.
The output cue lexicon used in our paper: SAGE file and as a list of terms. You can also explore the terms here.
In stage 2, we use the cue lexicon found in stage 1 to perform inference on candidates speeches to obtain their ideological proportions.
Preparing the candidate speeches. The original collection of candidate speeches are first pre-processed (tokenized, stemmed, normalized, etc). [stemmed speeches]
Using the stemmed speeches and cue lexicon as input, we build the cue-lag representation using these (hacky) Python scripts: create-model-terms.py
and create-lag-data.py
.
python create-model-terms.py ${SAGE_FILE} > ${MODEL_FOLDER}/terms.sage python create-lag-data.py ${TERMS_FILE} ${SPEECH_FOLDER} ${MODEL_FOLDER}where
${SAGE_FILE}
and ${TERMS_FILE}
are the output of stage 1 as a ASGE file and a list of terms respectively. ${SPEECH_FOLDER}
and ${MODEL_FOLDER}
are the folders containing stemmed speeches and folder to output the cue-lag representation to respectively.
create-model-terms.py
creates a tabular file where each row is a cue term and each column correspond to weights of the term under each ideology. create-lag-data.py
converts tokenized .txt
files (tokens separated by space, a sentence on ach line) into ".lag"
files, which look like
__START_OF_SPEECH__ 0 william_penn 17 found_father 4 bear_wit 13 unit_state 71 state_capitol 12 ben_franklin 29 unit_state 30 presid_obama 27 social_medicin 3 entitl_spend 36 social_engin 6 taxpay_fund 54 deepli_troubl 2 american_citizen 16 polit_power 51 repeal_obamacar 47 republican_senat 14 turn_point 12 futur_gener 26 unit_state 67 state_senat 0 barack_obama 17 obama_polici 0 live_free 31 croni_capit 71 govern_regul 7 tax_code 3 american_energi 11 energi_product 0 american_famili 3 tradit_marriag 13 religi_liberti 9 american_peopl 10 presid_obama 143 liber_polici 3 repeal_obamacar 8 american_peopl 21 repeal_obamacar 98 american_peopl 59 full_measur 25 straw_poll 49 south_carolina 86 god_plan 72 presidenti_campaign 54 god_bless 4 unit_state 6 __END_OF_SPEECH__ 3 __SPEECH_LENGTH__ 1457
You can get the candidate speeches in cue-lag format here.
Running CLIP. The Java sampler code for CLIP, along with some configuration files is available here. To run it on a candidate (do edit run-model.sh
to point to the correct directories, they are currently hardcoded to my working directory),
./run-model.sh --config-file candidates.settings --output-dir sampler_output --data-dir model-data/bachmann-no_uni+stem+no_stopw+topics-30 --terms-file model-data/bachmann-no_uni+stem+no_stopw+topics-30/terms.sage
In the sampler code directory, one can find several .settings
file, which specify parameters for the Gibbs sampler such as iterations, initial hyperparameters, etc. There are also separate .settings
files for the other baselines used in the paper.
After the sampler finishes, the individual samples can be found in sampler_output/samples
as individual .gz
, containing tab-separated lines for every speech with (ideology, restart) sampled for every term.
It should be straightforward to write scripts in your favorite language to process these samples and perform the needed analysis. I am not posting the analysis scripts (including those for creating this website) here as they are too "hardcoded" and hacky to be useful anywhere else outside of my work directory.
For the posterior samples, or any other data/requests that you may need/have, feel free to contact me here.