Analysis of Main vs. Print Headlines: Phase 1

text as data NYT text analysis project sentiment analysis co-occurrence matrix

Text as Data Project Headline Comparison Research Using API Query “Afghanistan Withdrawal”

Kristina Becvar https://kbec19.github.io/NYT-Analysis/ (UMass DACSS Program (My Academic Blog Link))https://kristinabecvar.com
04/19/2022

Comparing Main & Print Headlines

Research Background

Prior Project Details

During the Fall 2021 semester, my research group hand coded PDF copies of articles resulting from a simple search on the websites of the New York Times and Wall Street Journal from February 29, 2020 through September 30, 2021 using the term “Afghanistan withdrawal”. The basic process was as follows: Utilizing the basic New York Times search interface and printing to PDF was step 1, and then saving the page in a Zotero bibiography for reference was step 2 for each article in the result that fell into the world or U.S. news sections.

One thing I noticed was that when loading the PDF articles into NVivo12 for coding was that it was difficult to match the New York Times PDF titles generated by the site to the article citation information in Zotero for many of the articles because they did not match. I realized that in the process of saving the articles in Zotero, the headline/title was saved from on the web version of the article; however, once the article had been preserved by using the site’s “Print to PDF” function, the article title that it used as a default file name was different than the web version.

Current Project Initial Plan

This semester, I initially began this project to be one expanding on last semester’s research and looking to expand a machine analysis of articles pulling articles from a larger time period than the one used in our previous research. For my initial text collection, I collected articles using the New York Times API for the search query “Afghanistan”, and hoped to be able to analyze the full text of a larger range of articles. However, I found that I am limited in that the article search API for the New York Times does not pull the entire article. Rather, I was able to pull for each article the abstract/summary, lead paragraph, and snippet as well as the keywords, authors, sections, and url (with other various metadata). In an unexpected turn of events, I found that the API also returned the article title text for both the print and online versions of each article.

The API’s lack of full article text was not optimal for my purpose, which was ultimately to examine sentiment and co-occurence of various sources. As sources are not necessarily detailed in the lead paragraph or abstract of an article, I knew I needed to move to a different research path.

Current Project New Path

Remembering the differences in headlines from our manual coding research and noting that the API provides both headlines in the article search API, I turned my attention to analyzing the differences in the main vs. print headlines for articles from the same research period as our first examination. Ideally, this will be something I can then expand to the entirety of the available data form the New York Times API from September 11, 2001 onward. But I want to start with the same time frame as the last project for my initial research because as part of that project, we engaged in a random, stratified sample of the articles obtained through the same search term and hand-coded them for sentiment after reaching inter-coder reliability of 83.92%. It seems worthwhile to take a look at how that manual sentiment analysis corresponds to statistical methods we have been researching throughout this course.

This way, I can potentially use a sample of the full articles collected in our previous research and take the additional step of analyzing the sentiment of full articles and how they may or may not relate to the differing headlines.

Making Choices on Inclusion of Observations

In my initial look at the headline data, it was clear that not all of the articles had different headlines; some are the same entries, and some have “N/A” in the “print” version only, indicating they were online-only stories. Although I initially felt inclined to leave the “N/A” observations in the analysis, I removed those observations as they would not be relevant to my new research questions comparing the framing for different audiences.

I also removed whole sections where the API returned an observation as there was apparently use of the term “Afghanistan withdrawal” somewhere in the article/entry, but the type of entry was clearly not being represented in the headline. For example, “Corrections” entries have headlines consisting only of the term “Corrections” and the corresponding date. Similar choices were made on the “Arts”, Books”, and “Podcasts” sections when entries are primarily the names of the things being reviewed that may have a reference to the Afghanistan withdrawal somewhere in the text, but it is not relevant specifically to the withdrawal time period being analyzed.

With few exceptions, this left the entirety of the “U.S.” and “World” news sections, even if the content related to Afghanistan is not readily observable by reading only the headline.

Gathering Data

API Process Not Run

To pull the data, I had to reduce the queries into more workable groups that would not time out, given the NYT API limits. I was able to pull the articles by chunk, and bind them together.

Show code
# For articles from February 29, 2020 through April 30, 2021
#url1 <- ('https://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20200229&end_date=20210430&q=afghanistan&withdrawal&api-key=XXXXX')

#query1 <- fromJSON(url1)

#max.pages1 <- ceiling((query1$response$meta$hits[1] / 10)-1) 

#pages1 <- list()
#for(i in 0:max.pages1){
  #search1 <- fromJSON(paste0(url1, "&page=", i), flatten = TRUE) %>% data.frame() 
  #message("Retrieving page ", i)
  #pages1[[i+1]] <- search1
  #Sys.sleep(10)
  #}

#pages1[[i+1]] <- search1 
#afghanistan_withdrawal_articles1 <- rbind_pages(pages1)

#save(afghanistan_withdrawal_articles1,file="afghanistan_withdrawal_articles1.Rdata")

#For May 1 through September 30, 2021

#url2 <- ('https://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20210501&end_date=20210930&q=afghanistan&withdrawal&api-key=XXXXX')

#query2 <- fromJSON(url2)

#max.pages2 <- ceiling((query2$response$meta$hits[1] / 10)-1) 

#pages2 <- list()
#for(i in 0:max.pages2){
  #search2 <- fromJSON(paste0(url2, "&page=", i), flatten = TRUE) %>% data.frame() 
  #message("Retrieving page ", i)
  #pages2[[i+1]] <- search2
  #Sys.sleep(10)
  #}

#pages2[[i+1]] <- search2
#afghanistan_withdrawal_articles2 <- rbind_pages(pages2)

#save(afghanistan_withdrawal_articles2,file="afghanistan_withdrawal_articles2.Rdata")


# Create shell for data

#afghanistan_withdrawal_articles <- c()
#afghanistan_withdrawal_articles <- rbind_pages(c(pages1, pages2))
#saveRDS(afghanistan_withdrawal_articles,file="afghanistan_withdrawal_articles_all.Rdata")

After compiling the data, I re-formatted the date column and saving the formatted tibble for offline access.

Show code
#afghanistan_withdrawal_table<- as_tibble(cbind(
  #date=afghanistan_withdrawal_articles$response.docs.pub_date,
  #abstract=afghanistan_withdrawal_articles$response.docs.abstract,
  #lead.paragraph=afghanistan_withdrawal_articles$response.docs.lead_paragraph,
  #snippet=afghanistan_withdrawal_articles$response.docs.snippet,
  #section.name=afghanistan_withdrawal_articles$response.docs.section_name,
  #subsection.name=afghanistan_withdrawal_articles$response.docs.subsection_name,
  #news.desk=afghanistan_withdrawal_articles$response.docs.news_desk,
  #byline=afghanistan_withdrawal_articles$response.docs.byline.original,
  #headline.main=afghanistan_withdrawal_articles$response.docs.headline.main,
  #headline.print=afghanistan_withdrawal_articles$response.docs.headline.print_headline,
  #headline.kicker=afghanistan_withdrawal_articles$response.docs.headline.kicker,
  #material=afghanistan_withdrawal_articles$response.docs.type_of_material,
  #url=afghanistan_withdrawal_articles$response.docs.web_url
  #))

#afghanistan_withdrawal_table$date <- substr(afghanistan_withdrawal_table$date, 1, nchar(afghanistan_withdrawal_table$date)-14)

#afghanistan_withdrawal_table$date <- as.Date(afghanistan_withdrawal_table$date, "%Y-%m-%d")

#save(afghanistan_withdrawal_table,file="afghanistan_withdrawal_table.Rdata")

#write.table(afghanistan_withdrawal_table, file = "~/GitHub/DACSS.697D/Text as Data Spring22/afghanistan_withdrawal_table.csv", sep=",", row.names=FALSE)

Load Data

Now to the active review of the data. Loading the data from my collection phase.

Show code
#load data
main_headlines <- read.csv("afghanistan_withdrawal_main.csv")
main_headlines <- as.data.frame(main_headlines)
#turn into data frame
print_headlines <- read.csv("afghanistan_withdrawal_print.csv")
print_headlines <- as.data.frame(print_headlines)
#inspect data
head(main_headlines)
  article_id      date
1          1 2/29/2020
2          2 2/29/2020
3          3  3/1/2020
4          4  3/2/2020
5          5  3/2/2020
6          6  3/3/2020
                                                                 headline_main
1                              4 Takeaways From the U.S. Deal With the Taliban
2    Taliban and U.S. Strike Deal to Withdraw American Troops From Afghanistan
3           Afghanistan War Enters New Stage as U.S. Military Prepares to Exit
4                 At Center of Taliban Deal, a U.S. Envoy Who Made It Personal
5 U.S. Announces Troop Withdrawal in Afghanistan as Respite From Violence Ends
6                                           Trump Speaks With a Taliban Leader
Show code
head(print_headlines)
  article_id      date
1          1 2/29/2020
2          2 2/29/2020
3          3  3/1/2020
4          4  3/2/2020
5          5  3/2/2020
6          6  3/3/2020
                                                        headline_print
1                                 Table Is Set For a Pullout And Talks
2                              U.S. and Taliban Sign Withdrawal Accord
3                                      A Mission Shift for Afghanistan
4 At the Center of the Taliban Deal, a U.S. Envoy Who Made It Personal
5                           U.S. Troop Reduction Begins in Afghanistan
6               Pursuing Exit, Trump Talks  To a Leader Of the Taliban

Notably, the number of observations has been significantly reduced from the API pull of ~700 to 346 due to the fact that I am only examining headlines that are different and have eliminated incidences where there is no ‘alternative’ print article headline.

Create Corpus

Show code
main_corpus <- corpus(main_headlines, docid_field = "article_id", text_field = "headline_main")
print_corpus <- corpus(print_headlines, docid_field = "article_id", text_field = "headline_print")

Assign Type to Docvars

Show code
main_corpus$type <- "Main Headline"
print_corpus$type <- "Print Headline"
docvars(main_corpus, field = "type") <- main_corpus$type
docvars(print_corpus, field = "type") <- print_corpus$type

Tokenization & Pre-Processing

Next, I need to take the corpus and create tokens, which are lists of character vectors where each element of the list corresponds to an input document. This is where the pre-processing takes place.

After many process posts, I finally realized how to remove the “�” symbol that has plagued me since starting working with this API by using “remove_symbols=TRUE” in addition to removing the punctuation when tokenizing. I also want to remove stopwords. Finally, since I am removing symbols, the first run of this process left me with many abandoned “s” characters, so I am going to remove those specifically as well.

I have decided NOT to engage in the stemming of words on this initial analysis.

Main Headlines

Show code
main_tokens <- tokens(main_corpus) %>%
  tokens(main_corpus, remove_punct = TRUE) %>%
  tokens(main_corpus, remove_numbers = TRUE) %>%
  tokens(main_corpus, remove_symbols = TRUE) %>%
  tokens_remove(stopwords("english")) %>%
  tokens_remove(c("s"))

main_dfm <- dfm(main_tokens)

length(main_tokens)
[1] 346
Show code
print(main_tokens)
Tokens consisting of 346 documents and 2 docvars.
1 :
[1] "Takeaways" "U.S"       "Deal"      "Taliban"  

2 :
[1] "Taliban"     "U.S"         "Strike"      "Deal"       
[5] "Withdraw"    "American"    "Troops"      "Afghanistan"

3 :
[1] "Afghanistan" "War"         "Enters"      "New"        
[5] "Stage"       "U.S"         "Military"    "Prepares"   
[9] "Exit"       

4 :
[1] "Center"   "Taliban"  "Deal"     "U.S"      "Envoy"    "Made"    
[7] "Personal"

5 :
[1] "U.S"         "Announces"   "Troop"       "Withdrawal" 
[5] "Afghanistan" "Respite"     "Violence"    "Ends"       

6 :
[1] "Trump"   "Speaks"  "Taliban" "Leader" 

[ reached max_ndoc ... 340 more documents ]
Show code
print_tokens <- tokens(print_corpus) %>%
  tokens(print_corpus, remove_punct = TRUE) %>%
  tokens(print_corpus, remove_numbers = TRUE) %>%
  tokens(print_corpus, remove_symbols = TRUE) %>%
  tokens_remove(stopwords("english")) %>%
  tokens_remove(c("s"))

main_dfm <- dfm(print_tokens)

length(print_tokens)
[1] 346
Show code
print(print_tokens)
Tokens consisting of 346 documents and 2 docvars.
1 :
[1] "Table"   "Set"     "Pullout" "Talks"  

2 :
[1] "U.S"        "Taliban"    "Sign"       "Withdrawal" "Accord"    

3 :
[1] "Mission"     "Shift"       "Afghanistan"

4 :
[1] "Center"   "Taliban"  "Deal"     "U.S"      "Envoy"    "Made"    
[7] "Personal"

5 :
[1] "U.S"         "Troop"       "Reduction"   "Begins"     
[5] "Afghanistan"

6 :
[1] "Pursuing" "Exit"     "Trump"    "Talks"    "Leader"   "Taliban" 

[ reached max_ndoc ... 340 more documents ]

Document Feature Matrix

Creating the DFM

In order to perform statistical analysis, I first have to extract a matrix that will associate values for certain features of each document using the quanteda “dfm()” function. This will show me the occurrence of words within each ‘doc’ or headline observation.

Show code
#print dfm
print_dfm <- dfm(print_tokens)
#main dfm
main_dfm <- dfm(main_tokens)
#look at each dfm
print_dfm
Document-feature matrix of: 346 documents, 1,155 features (99.43% sparse) and 2 docvars.
    features
docs table set pullout talks u.s taliban sign withdrawal accord
   1     1   1       1     1   0       0    0          0      0
   2     0   0       0     0   1       1    1          1      1
   3     0   0       0     0   0       0    0          0      0
   4     0   0       0     0   1       1    0          0      0
   5     0   0       0     0   1       0    0          0      0
   6     0   0       0     1   0       1    0          0      0
    features
docs mission
   1       0
   2       0
   3       1
   4       0
   5       0
   6       0
[ reached max_ndoc ... 340 more documents, reached max_nfeat ... 1,145 more features ]
Show code
main_dfm
Document-feature matrix of: 346 documents, 1,208 features (99.40% sparse) and 2 docvars.
    features
docs takeaways u.s deal taliban strike withdraw american troops
   1         1   1    1       1      0        0        0      0
   2         0   1    1       1      1        1        1      1
   3         0   1    0       0      0        0        0      0
   4         0   1    1       1      0        0        0      0
   5         0   1    0       0      0        0        0      0
   6         0   0    0       1      0        0        0      0
    features
docs afghanistan war
   1           0   0
   2           1   0
   3           1   1
   4           0   0
   5           1   0
   6           0   0
[ reached max_ndoc ... 340 more documents, reached max_nfeat ... 1,198 more features ]

I can come back to this function and use it to further pre-process my data, for example, to concatenate multi-word expressions if I see words have been tokenized that should be together. For example:

tokens(“New York City is located in the United States.”) %>% tokens_compound(pattern = phrase(c(“New York City”, “United States”)))

I can also come back and remove any word that may not have been a stopword but I later note to be non-significant to my analysis. This is also where I can come back to employ stemming, if I feel it is appropriate. Basically, any pre-processing done during tokenization can also be done in this process, if needed. For example:

dfmat_inaug_post1990 <- dfm(dfmat_inaug_post1990, remove = stopwords(“unnecessary”), stem = TRUE)

Creating Word Frequency Rankings

I can take a preliminary look at the data frame from each of the headlines to see the most frequent words after pre-processing.

Show code
#create a word frequency variable and the rankings
#main headlines
main_counts <- as.data.frame(sort(colSums(main_dfm),dec=T))
colnames(main_counts) <- c("Frequency")
main_counts$Rank <- c(1:ncol(main_dfm))
head(main_counts)
            Frequency Rank
u.s               100    1
afghanistan        87    2
afghan             85    3
taliban            65    4
biden              53    5
war                30    6
Show code
#print headlines
print_counts <- as.data.frame(sort(colSums(print_dfm),dec=T))
colnames(print_counts) <- c("Frequency")
print_counts$Rank <- c(1:ncol(print_dfm))
head(print_counts)
            Frequency Rank
u.s               100    1
taliban            66    2
afghan             64    3
afghanistan        56    4
biden              41    5
exit               27    6

Feature Co-Occurrence Matrix

Now I can take a look at this network of feature co-occurrences for the main headlines (FCM):

Show code
#create fcm from dfm
main_fcm <- fcm(main_dfm)
#check the dimensions (i.e., the number of rows and the number of columns of the matrix we created
dim(main_fcm)
[1] 1208 1208
Show code
#pull the top features
myFeatures <- names(topfeatures(main_fcm, 20))
#retain only those top features as part of our matrix
smaller_main_fcm <- fcm_select(main_fcm, pattern = myFeatures, selection = "keep")
#check dimensions
dim(smaller_main_fcm)
[1] 20 20
Show code
#compute size weight for vertices in network
size <- log(colSums(smaller_main_fcm))
#create plot
textplot_network(smaller_main_fcm, vertex_size = size / max(size) * 3)

and for the print headlines:

Show code
# create fcm from dfm
print_fcm <- fcm(print_dfm)
# check the dimensions (i.e., the number of rows and the number of columnns)
# of the matrix we created
dim(print_fcm)
[1] 1155 1155
Show code
# pull the top features
myFeatures <- names(topfeatures(print_fcm, 20))
# retain only those top features as part of our matrix
smaller_print_fcm <- fcm_select(print_fcm, pattern = myFeatures, selection = "keep")
# check dimensions
dim(smaller_print_fcm)
[1] 20 20
Show code
# compute size weight for vertices in network
size <- log(colSums(smaller_print_fcm))
# create plot
textplot_network(smaller_print_fcm, vertex_size = size / max(size) * 3)

This brings me to where I had previously stopped in my comparison and analysis, and now that I’m able to produce a cleaner result, I’ll move on to further analysis using the quanteda dictionary.

Dictionary Analysis

For my initial sentiment analysis, I am going to use the three dictionaries we used in the course tutorial, the NRC, LSD(2015) and General Inquiry dictionaries.

NRC

I am first using the “liwcalike()” function from the quanteda.dictionaries package to apply the NRC dictionary. I can take a look at the head or tail and choose to look at a snapshot of the sentiments that have been applied to the corpus for each text group. Just at first glance, I can see some differences in the scoring.

Show code
#use liwcalike() to estimate sentiment using NRC dictionary
#for main headlines
main_sentiment_nrc <- liwcalike(as.character(main_corpus), data_dictionary_NRC)
head(main_sentiment_nrc)[7:12]
  anger anticipation disgust  fear   joy negative
1  0.00        10.00       0  0.00 10.00     0.00
2  8.33         8.33       0  0.00  8.33    16.67
3  0.00         0.00       0 16.67  0.00     8.33
4  0.00         7.14       0  0.00  7.14     0.00
5  8.33         0.00       0  8.33  8.33     8.33
6  0.00         0.00       0  0.00  0.00     0.00
Show code
#and print headlines
print_sentiment_nrc <- liwcalike(as.character(print_corpus), data_dictionary_NRC)
head(print_sentiment_nrc)[11:16]
   joy negative positive sadness surprise trust
1 0.00        0     0.00       0     0.00  0.00
2 0.00        0    14.29       0     0.00 14.29
3 0.00        0     0.00       0     0.00  0.00
4 6.25        0    12.50       0     6.25 18.75
5 0.00        0     0.00       0     0.00  0.00
6 0.00        0     9.09       0     9.09  9.09

NRC as DFM

I can also put the results into a document feature matrix for each text group:

Show code
# convert tokens from each headline data set to DFM using the dictionary "NRC"
main_nrc <- dfm(main_tokens) %>%
  dfm_lookup(data_dictionary_NRC)
print_nrc <- dfm(print_tokens) %>%
  dfm_lookup(data_dictionary_NRC)

dim(main_nrc)
[1] 346  10
Show code
main_nrc
Document-feature matrix of: 346 documents, 10 features (69.36% sparse) and 2 docvars.
    features
docs anger anticipation disgust fear joy negative positive sadness
   1     0            1       0    0   1        0        1       0
   2     1            1       0    0   1        2        1       1
   3     0            0       0    2   0        1        0       0
   4     0            1       0    0   1        0        2       0
   5     1            0       0    1   1        1        1       1
   6     0            0       0    0   0        0        1       0
    features
docs surprise trust
   1        1     1
   2        1     1
   3        0     0
   4        1     3
   5        0     1
   6        1     1
[ reached max_ndoc ... 340 more documents ]
Show code
dim(print_nrc)
[1] 346  10
Show code
print_nrc
Document-feature matrix of: 346 documents, 10 features (71.47% sparse) and 2 docvars.
    features
docs anger anticipation disgust fear joy negative positive sadness
   1     0            0       0    0   0        0        0       0
   2     0            0       0    0   0        0        1       0
   3     0            0       0    0   0        0        0       0
   4     0            1       0    0   1        0        2       0
   5     0            0       0    0   0        0        0       0
   6     0            0       0    0   0        0        1       0
    features
docs surprise trust
   1        0     0
   2        0     1
   3        0     0
   4        1     3
   5        0     0
   6        1     1
[ reached max_ndoc ... 340 more documents ]

NRC Polarity Plot

And use the information in a data frame to plot the output as represented by a calculation for polarity:

Show code
library(cowplot)
#for the main headlines
df_main_nrc <- convert(main_nrc, to = "data.frame")
df_main_nrc$polarity <- (df_main_nrc$positive - df_main_nrc$negative)/(df_main_nrc$positive + df_main_nrc$negative)
df_main_nrc$polarity[which((df_main_nrc$positive + df_main_nrc$negative) == 0)] <- 0

ggplot(df_main_nrc) + 
  geom_histogram(aes(x=polarity), bins = 15) + 
  theme_minimal_hgrid()
Show code
#and the print headlines
df_print_nrc <- convert(print_nrc, to = "data.frame")
df_print_nrc$polarity <- (df_print_nrc$positive - df_print_nrc$negative)/(df_print_nrc$positive + df_print_nrc$negative)
df_print_nrc$polarity[which((df_print_nrc$positive + df_print_nrc$negative) == 0)] <- 0

ggplot(df_print_nrc) + 
  geom_histogram(aes(x=polarity), bins = 15) + 
  theme_minimal_hgrid()

NRC Sample Results

Looking at the headlines that are indicated as “1”, or positive in sentiment, I can see why these specific headlines are being evaluated as ‘positive’. Some of the aberrations are likely due to the word ‘peace’ in a headline even if it is speaking of it in the past tense.

Show code
head(main_corpus[which(df_main_nrc$polarity == 1)])
Corpus consisting of 6 documents and 2 docvars.
1 :
"4 Takeaways From the U.S. Deal With the Taliban"

4 :
"At Center of Taliban Deal, a U.S. Envoy Who Made It Personal"

6 :
"Trump Speaks With a Taliban Leader"

8 :
"After Tours in Afghanistan, U.S. Veterans Weigh Peace With t..."

9 :
"Javier Pérez de Cuéllar Dies at 100; U.N. Chief Brokered Pea..."

10 :
"From the Afghan Peace Deal, a Weak and Pliable Neighbor for ..."
Show code
head(print_corpus[which(df_print_nrc$polarity == 1)])
Corpus consisting of 6 documents and 2 docvars.
2 :
"U.S. and Taliban Sign Withdrawal Accord"

4 :
"At the Center of the Taliban Deal, a U.S. Envoy Who Made It ..."

6 :
"Pursuing Exit, Trump Talks  To a Leader Of the Taliban"

7 :
"Attacks on Afghans by Taliban Rise After Signing of Peace De..."

8 :
"After Afghanistan Tours,  U.S. Veterans Appraise  Peace Deal..."

9 :
"Javier Pérez de Cuéllar, U.N. Chief  Behind Vital Peace Pact..."

LSD 2015

I am going to want to look at multiple dictionaries to see if one can best apply to this data. Next, the LSD 2015 dictionary:

Show code
# convert main corpus to DFM using the LSD2015 dictionary
main_lsd2015 <- dfm(main_tokens) %>%
  dfm_lookup(data_dictionary_LSD2015)
# create main polarity measure for LSD2015
main_lsd2015 <- convert(main_lsd2015, to = "data.frame")
main_lsd2015$polarity <- (main_lsd2015$positive - main_lsd2015$negative)/(main_lsd2015$positive + main_lsd2015$negative)
main_lsd2015$polarity[which((main_lsd2015$positive + main_lsd2015$negative) == 0)] <- 0
# convert print corpus to DFM using the LSD2015 dictionary
print_lsd2015 <- dfm(print_tokens) %>%
  dfm_lookup(data_dictionary_LSD2015)
# create print polarity measure for LSD2015
print_lsd2015 <- convert(print_lsd2015, to = "data.frame")
print_lsd2015$polarity <- (print_lsd2015$positive - print_lsd2015$negative)/(print_lsd2015$positive + print_lsd2015$negative)
print_lsd2015$polarity[which((print_lsd2015$positive + print_lsd2015$negative) == 0)] <- 0

LSD Sample Results

Looking at the headlines that are indicated as “1”, or positive in sentiment, I can again see why these specific headlines are being evaluated as ‘positive’. At least one of the aberrations is likely due to a headline referencing someone being set free, though it also references him having shot someone. So it’s a mixed bag, as usually seems to be the case.

Show code
head(main_corpus[which(main_lsd2015$polarity == 1)])
Corpus consisting of 6 documents and 2 docvars.
8 :
"After Tours in Afghanistan, U.S. Veterans Weigh Peace With t..."

11 :
"A Secret Accord With the Taliban: When and How the U.S. Woul..."

18 :
"To Save Afghan Peace Deal, U.S. May Scale Back C.I.A. Presen..."

22 :
"Afghan Sides Agree to Rare Cease-Fire During Eid al-Fitr"

37 :
"Taliban Announce Brief Cease-Fire, as Afghan Peace Talks Loo..."

51 :
"Afghan Peace Talks Begin This Week. Here’s What to Know."
Show code
head(print_corpus[which(print_lsd2015$polarity == 1)])
Corpus consisting of 6 documents and 2 docvars.
2 :
"U.S. and Taliban Sign Withdrawal Accord"

8 :
"After Afghanistan Tours,  U.S. Veterans Appraise  Peace Deal..."

18 :
"To Save Peace Deal With Taliban, U.S. May Reduce C.I.A. Pres..."

19 :
"Prominent Retired General Joins Taliban, Stunning the Afghan..."

22 :
"For the First Time Since 2018, the Taliban Agree to a Cease-..."

25 :
"Man Who Shot U.S. Advisers Is Set Free In Afghanistan"

LSD Polarity Plot

And use the information in a data frame to plot the output as represented by a calculation for polarity:

Show code
#for the main headlines
ggplot(main_lsd2015) + 
  geom_histogram(aes(x=polarity), bins = 15) + 
  theme_minimal_hgrid()
Show code
#and the print headlines
ggplot(print_lsd2015) + 
  geom_histogram(aes(x=polarity), bins = 15) + 
  theme_minimal_hgrid()

General Inquirer

and the General Inquirer dictionary:

Show code
# convert main corpus to DFM using the General Inquirer dictionary
main_geninq <- dfm(main_tokens) %>%
                    dfm_lookup(data_dictionary_geninqposneg)
# create main polarity measure for GenInq
main_geninq <- convert(main_geninq, to = "data.frame")
main_geninq$polarity <- (main_geninq$positive - main_geninq$negative)/(main_geninq$positive + main_geninq$negative)
main_geninq$polarity[which((main_geninq$positive + main_geninq$negative) == 0)] <- 0
# convert print corpus to DFM using the General Inquirer dictionary
print_geninq <- dfm(print_tokens) %>%
                    dfm_lookup(data_dictionary_geninqposneg)
# create print polarity measure for GenInq
print_geninq <- convert(print_geninq, to = "data.frame")
print_geninq$polarity <- (print_geninq$positive - print_geninq$negative)/(print_geninq$positive + print_geninq $negative)
print_geninq$polarity[which((print_geninq$positive + print_geninq$negative) == 0)] <- 0

General Inquirer Sample Results

Looking at the headlines that are indicated as “1”, or positive in sentiment, I can again see why these specific headlines are being evaluated as ‘positive’. This one is even more of a mixed bag, with the sentiment rationale clear. However, it is also clear why the rationale is being used at the expense of subtle subject matter knowledge.

Show code
head(main_corpus[which(main_geninq$polarity == 1)])
Corpus consisting of 6 documents and 2 docvars.
8 :
"After Tours in Afghanistan, U.S. Veterans Weigh Peace With t..."

9 :
"Javier Pérez de Cuéllar Dies at 100; U.N. Chief Brokered Pea..."

14 :
"As U.S. Troops Leave Afghanistan, Diplomats Are Left to Fill..."

22 :
"Afghan Sides Agree to Rare Cease-Fire During Eid al-Fitr"

23 :
"How the Taliban Outlasted a Superpower: Tenacity and Carnage"

24 :
"Trump Wants Troops in Afghanistan Home by Election Day. The ..."
Show code
head(print_corpus[which(print_geninq$polarity == 1)])
Corpus consisting of 6 documents and 2 docvars.
2 :
"U.S. and Taliban Sign Withdrawal Accord"

9 :
"Javier Pérez de Cuéllar, U.N. Chief  Behind Vital Peace Pact..."

19 :
"Prominent Retired General Joins Taliban, Stunning the Afghan..."

20 :
"Afghans’ Power-Sharing Accord  Honors Official Accused in Ra..."

22 :
"For the First Time Since 2018, the Taliban Agree to a Cease-..."

24 :
"Trump Wants Troops in Afghanistan Home by Election Day"

General Inquirer Polarity Plot

And use the information in a data frame to plot the output as represented by a calculation for polarity:

Show code
#for the main headlines
ggplot(main_geninq) + 
  geom_histogram(aes(x=polarity), bins = 15) + 
  theme_minimal_hgrid()
Show code
#and the print headlines
ggplot(print_geninq) + 
  geom_histogram(aes(x=polarity), bins = 15) + 
  theme_minimal_hgrid()

Comparison Study

Create Data Frame of All Results

Now I’m going to be able to compare the different dictionary scores in one data frame for each type of headline.

Main Headlines

Show code
# create unique names for each main headline dataframe
colnames(df_main_nrc) <- paste("nrc", colnames(df_main_nrc), sep = "_")
colnames(main_lsd2015) <- paste("lsd2015", colnames(main_lsd2015), sep = "_")
colnames(main_geninq) <- paste("geninq", colnames(main_geninq), sep = "_")
# now let's compare our estimates
main_sent <- merge(df_main_nrc, main_lsd2015, by.x = "nrc_doc_id", by.y = "lsd2015_doc_id")
main_sent <- merge(main_sent, main_geninq, by.x = "nrc_doc_id", by.y = "geninq_doc_id")
head(main_sent)[1:5]
  nrc_doc_id nrc_anger nrc_anticipation nrc_disgust nrc_fear
1          1         0                1           0        0
2         10         0                3           0        0
3        100         0                0           0        0
4        101         0                2           0        0
5        102         1                0           0        2
6        103         1                1           0        1
Show code
# create unique names for each print headline dataframe
colnames(df_print_nrc) <- paste("nrc", colnames(df_print_nrc), sep = "_")
colnames(print_lsd2015) <- paste("lsd2015", colnames(print_lsd2015), sep = "_")
colnames(print_geninq) <- paste("geninq", colnames(print_geninq), sep = "_")
# now let's compare our estimates
print_sent <- merge(df_print_nrc, print_lsd2015, by.x = "nrc_doc_id", by.y = "lsd2015_doc_id")
print_sent <- merge(print_sent, print_geninq, by.x = "nrc_doc_id", by.y = "geninq_doc_id")
head(print_sent)[1:5]
  nrc_doc_id nrc_anger nrc_anticipation nrc_disgust nrc_fear
1          1         0                0           0        0
2         10         0                2           0        0
3        100         0                0           0        0
4        101         0                2           0        0
5        102         1                0           0        4
6        103         1                1           0        1

Correlation

Now that I have them all in a single data frame, it’s straightforward to figure out a bit about how well our different measures of polarity agree across the different approaches by looking at their correlation using the “cor()” function.

It seems like the polarity of the headlines are more highly correlated between dictionaries slightly for the main headlines than the print headlines.

For Main Headlines

Show code
cor(main_sent$nrc_polarity, main_sent$lsd2015_polarity)
[1] 0.512629
Show code
cor(main_sent$nrc_polarity, main_sent$geninq_polarity)
[1] 0.4939498
Show code
cor(main_sent$lsd2015_polarity, main_sent$geninq_polarity)
[1] 0.5273237

For Print Headlines

Show code
cor(print_sent$nrc_polarity, print_sent$lsd2015_polarity)
[1] 0.4332694
Show code
cor(print_sent$nrc_polarity, print_sent$geninq_polarity)
[1] 0.4983176
Show code
cor(print_sent$lsd2015_polarity, print_sent$geninq_polarity)
[1] 0.4879879

Correlation of NRC Sentiments

I can take a quick visual look at the correlation between sentiments detected in both sets of headlines using the “GGally” package. There seems to be very little difference in that regard.

Main Headlines

Show code
library(GGally)

main_nrc_only<- read.csv("main_sent_nrc_only.csv")
ggcorr(main_nrc_only, method = c("everything", "pearson"))

Show code
print_nrc_only<- read.csv("print_sent_nrc_only.csv")
ggcorr(print_nrc_only, method = c("everything", "pearson"))

Linear Model Testing

Finally, I want to visually look at the correlations or positive and negative sentiments as my starting point for understanding relationships between both my sentiment analyses and dictionaries. I’ll start by dividing the sentiment scores for positive and negative from each text source into its own object and change column names to make them unique except for ‘doc_id’ for joining them into one data frame.

Show code
corr_main <- main_sent %>%
  select(nrc_doc_id, nrc_polarity, lsd2015_polarity, geninq_polarity )
colnames(corr_main) <- c("doc_id", "main_nrc", "main_lsd", "main_geninq")
corr_print <- print_sent %>%
  select(nrc_doc_id, nrc_polarity, lsd2015_polarity, geninq_polarity )
colnames(corr_print) <- c("doc_id", "print_nrc", "print_lsd", "print_geninq")

corr_matrix <- join(corr_main, corr_print, by = "doc_id")
head(corr_matrix)
  doc_id   main_nrc main_lsd main_geninq  print_nrc print_lsd
1      1  1.0000000        0   0.0000000  0.0000000         0
2     10  1.0000000        0   0.3333333  1.0000000        -1
3    100 -1.0000000       -1   0.0000000 -1.0000000        -1
4    101 -0.3333333       -1  -1.0000000 -0.3333333        -1
5    102 -1.0000000       -1   0.0000000 -1.0000000        -1
6    103  1.0000000        0   0.0000000  1.0000000         0
  print_geninq
1    0.0000000
2    0.3333333
3   -1.0000000
4   -1.0000000
5    1.0000000
6    1.0000000

Then I can look at the model for each relationship

NRC

Show code
#run the linear model of main vs. print correlation in the NRC dictionary
lm_nrc <- lm(main_nrc~print_nrc, data = corr_matrix)
summary(lm_nrc)

Call:
lm(formula = main_nrc ~ print_nrc, data = corr_matrix)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3637 -0.4190  0.1087  0.5810  1.5810 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.10868    0.03398  -3.198  0.00151 ** 
print_nrc    0.47236    0.04668  10.119  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6304 on 344 degrees of freedom
Multiple R-squared:  0.2294,    Adjusted R-squared:  0.2272 
F-statistic: 102.4 on 1 and 344 DF,  p-value: < 2.2e-16

LSD

Show code
#run the linear model of main vs. print correlation in the LSD dictionary
lm_lsd <- lm(main_lsd~print_lsd, data = corr_matrix)
summary(lm_lsd)

Call:
lm(formula = main_lsd ~ print_lsd, data = corr_matrix)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.31105 -0.33699  0.01796  0.50931  1.66301 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.17598    0.03452  -5.097  5.7e-07 ***
print_lsd    0.48703    0.04767  10.218  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6117 on 344 degrees of freedom
Multiple R-squared:  0.2328,    Adjusted R-squared:  0.2306 
F-statistic: 104.4 on 1 and 344 DF,  p-value: < 2.2e-16

General Inquiry

Show code
#run the linear model of main vs. print correlation in the General Inquiry dictionary
lm_geninq <- lm(main_geninq~print_geninq, data = corr_matrix)
summary(lm_geninq)

Call:
lm(formula = main_geninq ~ print_geninq, data = corr_matrix)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1599 -0.5267  0.1567  0.4733  1.4733 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.15671    0.03540  -4.426 1.29e-05 ***
print_geninq  0.31660    0.04799   6.597 1.59e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6479 on 344 degrees of freedom
Multiple R-squared:  0.1123,    Adjusted R-squared:  0.1097 
F-statistic: 43.52 on 1 and 344 DF,  p-value: 1.586e-10

And try to look at if there is any meaningful difference between the models. There does not seem to be!

Show code
#create a data frame from the NRC model results
tidynrc <- tidy(lm_nrc, conf.int = FALSE) 
#round the results to 3 decimal points
tidynrc <- tidynrc %>%
  mutate_if(is.numeric, round, 3)
tidynrc$model <- c("nrc")

#create a data frame from the LSD model results
tidylsd <- tidy(lm_lsd, conf.int = FALSE) 
#round the results to 3 decimal points
tidylsd <- tidylsd %>%
  mutate_if(is.numeric, round, 3)
tidylsd$model <- c("lsd")

#create a data frame from the Gen Inq model results
tidygeninq <- tidy(lm_geninq, conf.int = FALSE) 
#round the results to 3 decimal points
tidygeninq <- tidygeninq %>%
  mutate_if(is.numeric, round, 3)
tidygeninq$model <- c("geninq")

tidy_all <- do.call("rbind", list(tidynrc, tidylsd, tidygeninq))

tidy_all
# A tibble: 6 x 6
  term         estimate std.error statistic p.value model 
  <chr>           <dbl>     <dbl>     <dbl>   <dbl> <chr> 
1 (Intercept)    -0.109     0.034     -3.20   0.002 nrc   
2 print_nrc       0.472     0.047     10.1    0     nrc   
3 (Intercept)    -0.176     0.035     -5.10   0     lsd   
4 print_lsd       0.487     0.048     10.2    0     lsd   
5 (Intercept)    -0.157     0.035     -4.43   0     geninq
6 print_geninq    0.317     0.048      6.60   0     geninq

Citations