Analysis of Main vs. Print Headlines: Phase 2

text as data NYT text analysis project sentiment analysis co-occurrence matrix

Text as Data Project Headline Comparison Research Using API Query “Afghanistan”

Kristina Becvar https://kbec19.github.io/NYT-Analysis/ (UMass DACSS Program (My Academic Blog Link))https://kristinabecvar.com
2022-04-26

“Afghanistan” Search Term

This is a continuation of my analysis started in my last post.

Making Different Choices on Inclusion of Observations

In my initial analysis of the headline data, I used the scope of the project from a prior term to set the parameters for the headline API search. I decided to expand the term to include just “Afghanistan” rather than “Afghanistan withdrawal” for a couple of reasons. First, to increase the volume of observations and increase reliability. Second, because I want to look at the comparison between the two search terms for any change.

Gathering Data

Previous Process

The data was pulled via API using the same process as in my first phase of the comparison research, with the only change in the query term “Afghanistan” as opposed to “Afghanistan Withdrawal”. This led to a significantly larger dataset for comparison, though most of the increase in count was filtered out due to their classification as not news-related. Still, the number of observed, relevant headlines increased from 346 to 936 (for each type; main and print headlines).

Load Data

Now to the active review of the data. Loading the data from my collection phase:

Show code
#load data
main_headlines <- read.csv("afghanistan_headlines_main.csv")
main_headlines <- as.data.frame(main_headlines)
#turn into data frame
print_headlines <- read.csv("afghanistan_headlines_print.csv")
print_headlines <- as.data.frame(print_headlines)
#inspect data
head(main_headlines)
  doc_id       date
1      1  7/17/2020
2      2  8/30/2020
3      3   6/2/2021
4      4 12/20/2020
5      5  9/11/2021
6      6   9/1/2021
                                                                           text
1  $174 Million Afghan Drone Program Is Riddled With Problems, U.S. Report Says
2          ‘A Hail Mary’: Psychedelic Therapy Draws Veterans to Jungle Retreats
3   ‘Come On In, Boys’: A Wave of the Hand Sets Off Spain-Morocco Migrant Fight
4 ‘Covid Can’t Compete.’ In a Place Mired in War, the Virus Is an Afterthought.
5    ‘Everything Changed Overnight’: Afghan Reporters Face an Intolerant Regime
6      ‘Finally, I Am Safe’: U.S. Air Base Becomes Temporary Refuge for Afghans
Show code
head(print_headlines)
  doc_id       date
1      1  7/17/2020
2      2  8/30/2020
3      3   6/2/2021
4      4 12/20/2020
5      5  9/11/2021
6      6   9/1/2021
                                                                            text
1 $174 Million Drone Program for Afghans Is Riddled With Problems, Pentagon Says
2                Psychedelic Therapy In the Jungle Soothes The Pain for Veterans
3                                 Morocco Sends Spanish Outpost a Migrant Influx
4         ‘It’s a Lie’: Denial and Skepticism Permeate a Nation Embroiled in War
5                                     ‘Everything Changed’: Media Face Crackdown
6         ‘Finally, I Am Safe’: Thousands Find Temporary Refuge at U.S. Air Base
Show code
all_results <- read.csv("all_results.csv")

Create Corpus

Show code
main_corpus <- corpus(main_headlines, docid_field = "doc_id", text_field = "text")
print_corpus <- corpus(print_headlines, docid_field = "doc_id", text_field = "text")

Assign Type to Docvars

Show code
main_corpus$type <- "Main Headline"
print_corpus$type <- "Print Headline"

docvars(main_corpus, field = "type") <- main_corpus$type
docvars(print_corpus, field = "type") <- print_corpus$type

Tokenization

I want to optimize pre-processing by removing the “�” symbol that has plagued me since starting working with this API by using “remove_symbols=TRUE” in addition to removing the punctuation when tokenizing. I also want to remove stopwords. I do NOT want to use stemming at this point.

Main Headlines

Show code
main_tokens <- tokens(main_corpus) %>%
  tokens(main_corpus, remove_punct = TRUE) %>%
  tokens(main_corpus, remove_numbers = TRUE) %>%
  tokens(main_corpus, remove_symbols = TRUE) %>%
  tokens_remove(stopwords("english")) %>%
  tokens_remove(c("s"))

main_dfm <- dfm(main_tokens)

length(main_tokens)
[1] 936
Show code
print(main_tokens)
Tokens consisting of 936 documents and 2 docvars.
1 :
[1] "Million"  "Afghan"   "Drone"    "Program"  "Riddled"  "Problems"
[7] "U.S"      "Report"   "Says"    

2 :
[1] "Hail"        "Mary"        "Psychedelic" "Therapy"    
[5] "Draws"       "Veterans"    "Jungle"      "Retreats"   

3 :
[1] "Come"          "Boys"          "Wave"          "Hand"         
[5] "Sets"          "Spain-Morocco" "Migrant"       "Fight"        

4 :
[1] "Covid"        "Compete"      "Place"        "Mired"       
[5] "War"          "Virus"        "Afterthought"

5 :
[1] "Everything" "Changed"    "Overnight"  "Afghan"     "Reporters" 
[6] "Face"       "Intolerant" "Regime"    

6 :
[1] "Finally"   "Safe"      "U.S"       "Air"       "Base"     
[6] "Becomes"   "Temporary" "Refuge"    "Afghans"  

[ reached max_ndoc ... 930 more documents ]
Show code
print_tokens <- tokens(print_corpus) %>%
  tokens(print_corpus, remove_punct = TRUE) %>%
  tokens(print_corpus, remove_numbers = TRUE) %>%
  tokens(print_corpus, remove_symbols = TRUE) %>%
  tokens_remove(stopwords("english")) %>%
  tokens_remove(c("s"))

main_dfm <- dfm(print_tokens)

length(print_tokens)
[1] 936
Show code
print(print_tokens)
Tokens consisting of 936 documents and 2 docvars.
1 :
[1] "Million"  "Drone"    "Program"  "Afghans"  "Riddled"  "Problems"
[7] "Pentagon" "Says"    

2 :
[1] "Psychedelic" "Therapy"     "Jungle"      "Soothes"    
[5] "Pain"        "Veterans"   

3 :
[1] "Morocco" "Sends"   "Spanish" "Outpost" "Migrant" "Influx" 

4 :
[1] "Lie"        "Denial"     "Skepticism" "Permeate"   "Nation"    
[6] "Embroiled"  "War"       

5 :
[1] "Everything" "Changed"    "Media"      "Face"       "Crackdown" 

6 :
[1] "Finally"   "Safe"      "Thousands" "Find"      "Temporary"
[6] "Refuge"    "U.S"       "Air"       "Base"     

[ reached max_ndoc ... 930 more documents ]

Document Feature Matrix

Again, this will show me the occurrence of words within each ‘doc’ or headline observation.

Show code
#print dfm
print_dfm <- dfm(print_tokens)
#main dfm
main_dfm <- dfm(main_tokens)
#look at each dfm
print_dfm
Document-feature matrix of: 936 documents, 2,661 features (99.75% sparse) and 2 docvars.
    features
docs million drone program afghans riddled problems pentagon says
   1       1     1       1       1       1        1        1    1
   2       0     0       0       0       0        0        0    0
   3       0     0       0       0       0        0        0    0
   4       0     0       0       0       0        0        0    0
   5       0     0       0       0       0        0        0    0
   6       0     0       0       0       0        0        0    0
    features
docs psychedelic therapy
   1           0       0
   2           1       1
   3           0       0
   4           0       0
   5           0       0
   6           0       0
[ reached max_ndoc ... 930 more documents, reached max_nfeat ... 2,651 more features ]
Show code
main_dfm
Document-feature matrix of: 936 documents, 2,745 features (99.74% sparse) and 2 docvars.
    features
docs million afghan drone program riddled problems u.s report says
   1       1      1     1       1       1        1   1      1    1
   2       0      0     0       0       0        0   0      0    0
   3       0      0     0       0       0        0   0      0    0
   4       0      0     0       0       0        0   0      0    0
   5       0      1     0       0       0        0   0      0    0
   6       0      0     0       0       0        0   1      0    0
    features
docs hail
   1    0
   2    1
   3    0
   4    0
   5    0
   6    0
[ reached max_ndoc ... 930 more documents, reached max_nfeat ... 2,735 more features ]

Word Frequency Ratings

Again, I can take a preliminary look at the data frame from each of the headlines to see the most frequent words after pre-processing.

The only significant change by removing “withdrawal” from my search term is that the term “exit” is not present any longer on the print headline frequency header. This is logical.

Show code
#create a word frequency variable and the rankings
#main headlines
main_counts <- as.data.frame(sort(colSums(main_dfm),dec=T))
colnames(main_counts) <- c("Frequency")
main_counts$Rank <- c(1:ncol(main_dfm))
head(main_counts)
            Frequency Rank
u.s               166    1
afghan            151    2
afghanistan       135    3
taliban           110    4
biden              95    5
war                64    6
Show code
#print headlines
print_counts <- as.data.frame(sort(colSums(print_dfm),dec=T))
colnames(print_counts) <- c("Frequency")
print_counts$Rank <- c(1:ncol(print_dfm))
head(print_counts)
            Frequency Rank
u.s               171    1
afghan            109    2
taliban           108    3
afghanistan        84    4
biden              76    5
war                53    6

Feature Co-Occurrence Matrix

Now I can take a look at this network of feature co-occurrences again. This time, I’m also going to increase the feature count from 20 to 30.

First, for the main headlines:

Show code
# create fcm from dfm
main_fcm <- fcm(main_dfm)
# check the dimensions (i.e., the number of rows and the number of columnns)
# of the matrix we created
dim(main_fcm)
[1] 2745 2745
Show code
# pull the top features
myFeatures <- names(topfeatures(main_fcm, 30))
# retain only those top features as part of our matrix
smaller_main_fcm <- fcm_select(main_fcm, pattern = myFeatures, selection = "keep")
# check dimensions
dim(smaller_main_fcm)
[1] 30 30
Show code
# compute size weight for vertices in network
size <- log(colSums(smaller_main_fcm))
# create plot
textplot_network(smaller_main_fcm, vertex_size = size / max(size) * 3)

and for the print headlines:

Show code
# create fcm from dfm
print_fcm <- fcm(print_dfm)
# check the dimensions (i.e., the number of rows and the number of columnns)
# of the matrix we created
dim(print_fcm)
[1] 2661 2661
Show code
# pull the top features
myFeatures <- names(topfeatures(print_fcm, 30))
# retain only those top features as part of our matrix
smaller_print_fcm <- fcm_select(print_fcm, pattern = myFeatures, selection = "keep")
# check dimensions
dim(smaller_print_fcm)
[1] 30 30
Show code
# compute size weight for vertices in network
size <- log(colSums(smaller_print_fcm))
# create plot
textplot_network(smaller_print_fcm, vertex_size = size / max(size) * 3)

The resulting matrices have definitely changed, at least slightly.

Dictionary Analysis

To compare equally both this and my initial sentiment analysis, I am going to use the three dictionaries we used in the course tutorial, the NRC, LSD(2015) and General Inquiry dictionaries.

NRC

I am first using the “liwcalike()” function from the quanteda.dictionaries package to apply the NRC dictionary. I can take a look at the head or tail and choose to look at a snapshot of the sentiments that have been applied to the corpus for each text group. Just at first glance, I can again see some differences in the scoring.

Show code
#use liwcalike() to estimate sentiment using NRC dictionary
#for main headlines
main_sentiment_nrc <- liwcalike(as.character(main_corpus), data_dictionary_NRC)
head(main_sentiment_nrc)[7:12]
  anger anticipation disgust fear   joy negative
1  0.00         0.00    0.00 0.00  0.00    13.33
2  0.00         0.00    0.00 7.69  0.00     7.69
3  5.00         0.00    0.00 5.00  0.00     5.00
4  0.00         0.00    0.00 5.26  0.00    10.53
5  8.33         0.00    8.33 8.33  0.00     8.33
6  0.00         5.88    5.88 0.00 11.76     0.00
Show code
#and print headlines
print_sentiment_nrc <- liwcalike(as.character(print_corpus), data_dictionary_NRC)
head(print_sentiment_nrc)[11:16]
    joy negative positive sadness surprise trust
1  0.00    14.29     0.00    0.00     0.00  0.00
2  0.00    10.00     0.00   10.00     0.00  0.00
3  0.00     0.00     0.00    0.00     0.00  0.00
4  0.00    26.67     0.00    6.67     0.00  6.67
5  0.00     0.00     0.00    0.00     0.00  0.00
6 11.76     0.00    11.76    0.00     5.88 17.65

NRC as DFM

I can also put the results into a document feature matrix for each text group:

Show code
#convert tokens from each headline data set to DFM using the dictionary "NRC"
main_nrc <- dfm(main_tokens) %>%
  dfm_lookup(data_dictionary_NRC)
print_nrc <- dfm(print_tokens) %>%
  dfm_lookup(data_dictionary_NRC)

dim(main_nrc)
[1] 936  10
Show code
main_nrc
Document-feature matrix of: 936 documents, 10 features (67.61% sparse) and 2 docvars.
    features
docs anger anticipation disgust fear joy negative positive sadness
   1     0            0       0    0   0        2        0       0
   2     0            0       0    1   0        1        1       0
   3     1            0       0    1   0        1        0       0
   4     0            0       0    1   0        2        0       0
   5     1            0       1    1   0        1        0       1
   6     0            1       1    0   2        0        2       0
    features
docs surprise trust
   1        0     0
   2        0     1
   3        0     0
   4        0     0
   5        0     0
   6        1     3
[ reached max_ndoc ... 930 more documents ]
Show code
dim(print_nrc)
[1] 936  10
Show code
print_nrc
Document-feature matrix of: 936 documents, 10 features (69.21% sparse) and 2 docvars.
    features
docs anger anticipation disgust fear joy negative positive sadness
   1     0            0       0    0   0        2        0       0
   2     0            0       0    2   0        1        0       1
   3     0            0       0    1   0        0        0       0
   4     1            0       1    1   0        4        0       1
   5     0            0       0    0   0        0        0       0
   6     0            1       1    0   2        0        2       0
    features
docs surprise trust
   1        0     0
   2        0     0
   3        0     0
   4        0     1
   5        0     0
   6        1     3
[ reached max_ndoc ... 930 more documents ]

NRC Polarity Plot

And use the information in a data frame to plot the output as represented by a calculation for polarity:

Show code
library(cowplot)
#for the main headlines
df_main_nrc <- convert(main_nrc, to = "data.frame")
df_main_nrc$polarity <- (df_main_nrc$positive - df_main_nrc$negative)/(df_main_nrc$positive + df_main_nrc$negative)
df_main_nrc$polarity[which((df_main_nrc$positive + df_main_nrc$negative) == 0)] <- 0

ggplot(df_main_nrc) + 
  geom_histogram(aes(x=polarity), bins = 15) + 
  theme_minimal_hgrid()
Show code
#and the print headlines
df_print_nrc <- convert(print_nrc, to = "data.frame")
df_print_nrc$polarity <- (df_print_nrc$positive - df_print_nrc$negative)/(df_print_nrc$positive + df_print_nrc$negative)
df_print_nrc$polarity[which((df_print_nrc$positive + df_print_nrc$negative) == 0)] <- 0

ggplot(df_print_nrc) + 
  geom_histogram(aes(x=polarity), bins = 15) + 
  theme_minimal_hgrid()

NRC Sample Results

Looking at the headlines that are indicated as “1”, or positive in sentiment, this expanded corpus reflects more positivity than the top results from the smaller corpus.

Show code
head(main_corpus[which(df_main_nrc$polarity == 1)])
Corpus consisting of 6 documents and 2 docvars.
6 :
"‘Finally, I Am Safe’: U.S. Air Base Becomes Temporary Refuge..."

8 :
"‘Football Is Like Food’: Afghan Female Soccer Players Find a..."

18 :
"‘Is Austin on Your List?’: Biden’s Pentagon Pick Rose Despit..."

29 :
"‘We Have to Try’: Lawmakers Rush to Assist in Afghanistan Ev..."

40 :
"4 Takeaways From the U.S. Deal With the Taliban"

43 :
"98 Countries Pledge to Accept Afghans After U.S. Military De..."
Show code
head(print_corpus[which(df_print_nrc$polarity == 1)])
Corpus consisting of 6 documents and 2 docvars.
6 :
"‘Finally, I Am Safe’: Thousands Find Temporary Refuge at U.S..."

15 :
"Veterans Feel Urgency to Aid Afghan Allies"

18 :
"How Biden’s Defense Nominee Overcame Barriers to Diversity"

25 :
"How Biden, by Turns Genial and Blunt, Built Diplomatic Bridg..."

45 :
"Rescue Flight To Germany Inspires Name For Newborn"

50 :
"A Call for the Return of Civility, And Truth as a Guiding Li..."

LSD 2015

I am going to want to look at multiple dictionaries to see if one can best apply to this data. Next, the LSD 2015 dictionary:

Show code
# convert main corpus to DFM using the LSD2015 dictionary
main_lsd2015 <- dfm(main_tokens) %>%
  dfm_lookup(data_dictionary_LSD2015)
# create main polarity measure for LSD2015
main_lsd2015 <- convert(main_lsd2015, to = "data.frame")
main_lsd2015$polarity <- (main_lsd2015$positive - main_lsd2015$negative)/(main_lsd2015$positive + main_lsd2015$negative)
main_lsd2015$polarity[which((main_lsd2015$positive + main_lsd2015$negative) == 0)] <- 0
# convert print corpus to DFM using the LSD2015 dictionary
print_lsd2015 <- dfm(print_tokens) %>%
  dfm_lookup(data_dictionary_LSD2015)
# create print polarity measure for LSD2015
print_lsd2015 <- convert(print_lsd2015, to = "data.frame")
print_lsd2015$polarity <- (print_lsd2015$positive - print_lsd2015$negative)/(print_lsd2015$positive + print_lsd2015$negative)
print_lsd2015$polarity[which((print_lsd2015$positive + print_lsd2015$negative) == 0)] <- 0

LSD Sample Results

Looking at the headlines that are indicated as “1”, or positive in sentiment, I can again see why these specific headlines are being evaluated as ‘positive’ despite more aberrations than in the NRC dictionary.

Show code
head(main_corpus[which(main_lsd2015$polarity == 1)])
Corpus consisting of 6 documents and 2 docvars.
6 :
"‘Finally, I Am Safe’: U.S. Air Base Becomes Temporary Refuge..."

8 :
"‘Football Is Like Food’: Afghan Female Soccer Players Find a..."

19 :
"‘It’s Like Falling in Love’: Israeli Entrepreneurs Welcomed ..."

25 :
"‘Strategic Empathy’: How Biden’s Informal Diplomacy Shaped F..."

29 :
"‘We Have to Try’: Lawmakers Rush to Assist in Afghanistan Ev..."

43 :
"98 Countries Pledge to Accept Afghans After U.S. Military De..."
Show code
head(print_corpus[which(print_lsd2015$polarity == 1)])
Corpus consisting of 6 documents and 2 docvars.
6 :
"‘Finally, I Am Safe’: Thousands Find Temporary Refuge at U.S..."

15 :
"Veterans Feel Urgency to Aid Afghan Allies"

29 :
"As Panicked Afghans Seek Help, Lawmakers Say, ‘We Have to Tr..."

31 :
"‘She Was Alone’: One Official’s Harrowing Escape From Kabul"

43 :
"98 Countries Pledge to Accept Afghans After U.S. Departure"

45 :
"Rescue Flight To Germany Inspires Name For Newborn"

LSD Polarity Plot

And use the information in a data frame to plot the output as represented by a calculation for polarity:

Show code
#for the main headlines
ggplot(main_lsd2015) + 
  geom_histogram(aes(x=polarity), bins = 15) + 
  theme_minimal_hgrid()
Show code
#and the print headlines
ggplot(print_lsd2015) + 
  geom_histogram(aes(x=polarity), bins = 15) + 
  theme_minimal_hgrid()

General Inquirer

and the General Inquirer dictionary:

Show code
# convert main corpus to DFM using the General Inquirer dictionary
main_geninq <- dfm(main_tokens) %>%
                    dfm_lookup(data_dictionary_geninqposneg)
# create main polarity measure for GenInq
main_geninq <- convert(main_geninq, to = "data.frame")
main_geninq$polarity <- (main_geninq$positive - main_geninq$negative)/(main_geninq$positive + main_geninq$negative)
main_geninq$polarity[which((main_geninq$positive + main_geninq$negative) == 0)] <- 0
# convert print corpus to DFM using the General Inquirer dictionary
print_geninq <- dfm(print_tokens) %>%
                    dfm_lookup(data_dictionary_geninqposneg)
# create print polarity measure for GenInq
print_geninq <- convert(print_geninq, to = "data.frame")
print_geninq$polarity <- (print_geninq$positive - print_geninq$negative)/(print_geninq$positive + print_geninq $negative)
print_geninq$polarity[which((print_geninq$positive + print_geninq$negative) == 0)] <- 0

General Inquirer Sample Results

Looking at the headlines that are indicated as “1”, or positive in sentiment, again - this one is even more of a mixed bag, with the sentiment rationale clear. However, it is also clear why the rationale is being used at the expense of subtle subject matter knowledge.

Show code
head(main_corpus[which(main_geninq$polarity == 1)])
Corpus consisting of 6 documents and 2 docvars.
6 :
"‘Finally, I Am Safe’: U.S. Air Base Becomes Temporary Refuge..."

8 :
"‘Football Is Like Food’: Afghan Female Soccer Players Find a..."

12 :
"‘I Forget About the World:’ Afghan Youth Find Escape in a Vi..."

19 :
"‘It’s Like Falling in Love’: Israeli Entrepreneurs Welcomed ..."

27 :
"‘They Just Left Us’: Greece Is Accused of Setting Migrants A..."

43 :
"98 Countries Pledge to Accept Afghans After U.S. Military De..."
Show code
head(print_corpus[which(print_geninq$polarity == 1)])
Corpus consisting of 6 documents and 2 docvars.
6 :
"‘Finally, I Am Safe’: Thousands Find Temporary Refuge at U.S..."

15 :
"Veterans Feel Urgency to Aid Afghan Allies"

18 :
"How Biden’s Defense Nominee Overcame Barriers to Diversity"

33 :
"‘Why Do We Deserve to Die?’ Burying Hazara Girls in Kabul"

42 :
"New Wave of Refugees Faces A Much Chillier U.S. Welcome"

43 :
"98 Countries Pledge to Accept Afghans After U.S. Departure"

General Inquirer Polarity Plot

And use the information in a data frame to plot the output as represented by a calculation for polarity:

Show code
#for the main headlines
ggplot(main_geninq) + 
  geom_histogram(aes(x=polarity), bins = 15) + 
  theme_minimal_hgrid()
Show code
#and the print headlines
ggplot(print_geninq) + 
  geom_histogram(aes(x=polarity), bins = 15) + 
  theme_minimal_hgrid()

Comparison Study

Create Data Frame of All Results

Now I’m going to be able to compare the different dictionary scores in one data frame for each type of headline.

Main Headlines

Show code
# create unique names for each main headline dataframe
colnames(df_main_nrc) <- paste("nrc", colnames(df_main_nrc), sep = "_")
colnames(main_lsd2015) <- paste("lsd2015", colnames(main_lsd2015), sep = "_")
colnames(main_geninq) <- paste("geninq", colnames(main_geninq), sep = "_")
# now let's compare our estimates
main_sent <- merge(df_main_nrc, main_lsd2015, by.x = "nrc_doc_id", by.y = "lsd2015_doc_id")
main_sent <- merge(main_sent, main_geninq, by.x = "nrc_doc_id", by.y = "geninq_doc_id")
head(main_sent)[1:5]
  nrc_doc_id nrc_anger nrc_anticipation nrc_disgust nrc_fear
1          1         0                0           0        0
2         10         0                0           0        2
3        100         0                0           0        1
4        101         0                2           0        0
5        102         0                1           0        0
6        103         1                1           1        1
Show code
# create unique names for each print headline dataframe
colnames(df_print_nrc) <- paste("nrc", colnames(df_print_nrc), sep = "_")
colnames(print_lsd2015) <- paste("lsd2015", colnames(print_lsd2015), sep = "_")
colnames(print_geninq) <- paste("geninq", colnames(print_geninq), sep = "_")
# now let's compare our estimates
print_sent <- merge(df_print_nrc, print_lsd2015, by.x = "nrc_doc_id", by.y = "lsd2015_doc_id")
print_sent <- merge(print_sent, print_geninq, by.x = "nrc_doc_id", by.y = "geninq_doc_id")
head(print_sent)[1:5]
  nrc_doc_id nrc_anger nrc_anticipation nrc_disgust nrc_fear
1          1         0                0           0        0
2         10         0                0           0        2
3        100         0                0           0        1
4        101         0                2           0        0
5        102         0                1           0        0
6        103         1                0           1        1
Show code
write.csv(main_sent, file="main_sent.csv")
write.csv(print_sent, file="print_sent.csv")

Correlation

Now that I have them all in a single data frame, it’s straightforward to figure out a bit about how well our different measures of polarity agree across the different approaches by looking at their correlation using the “cor()” function.

It seems like the polarity of the headlines are more similar in this expanded analysis.

For Main Headlines

Show code
cor(main_sent$nrc_polarity, main_sent$lsd2015_polarity)
[1] 0.4619394
Show code
cor(main_sent$nrc_polarity, main_sent$geninq_polarity)
[1] 0.4702233
Show code
cor(main_sent$lsd2015_polarity, main_sent$geninq_polarity)
[1] 0.5321535

For Print Headlines

Show code
cor(print_sent$nrc_polarity, print_sent$lsd2015_polarity)
[1] 0.4700115
Show code
cor(print_sent$nrc_polarity, print_sent$geninq_polarity)
[1] 0.4824528
Show code
cor(print_sent$lsd2015_polarity, print_sent$geninq_polarity)
[1] 0.4933739

Correlation of NRC Sentiments

I can take a quick visual look at the correlation between sentiments detected in both sets of headlines using the “GGally” package. There seems to be very little difference in that regard.

Main Headlines

Show code
library(GGally)

main_nrc_only<- read.csv("main_sent_nrc_only.csv")
ggcorr(main_nrc_only, method = c("everything", "pearson"))

Show code
print_nrc_only<- read.csv("print_sent_nrc_only.csv")
ggcorr(print_nrc_only, method = c("everything", "pearson"))

Linear Model Testing

Finally, I want to visually look at the correlations or positive and negative sentiments as my starting point for understanding relationships between both my sentiment analyses and dictionaries. I’ll start by dividing the sentiment scores for positive and negative from each text source into its own object and change column names to make them unique except for ‘doc_id’ for joining them into one data frame.

Show code
corr_main <- main_sent %>%
  select(nrc_doc_id, nrc_polarity, lsd2015_polarity, geninq_polarity )
colnames(corr_main) <- c("doc_id", "main_nrc", "main_lsd", "main_geninq")
corr_print <- print_sent %>%
  select(nrc_doc_id, nrc_polarity, lsd2015_polarity, geninq_polarity )
colnames(corr_print) <- c("doc_id", "print_nrc", "print_lsd", "print_geninq")

corr_matrix <- join(corr_main, corr_print, by = "doc_id")
head(corr_matrix)
  doc_id main_nrc main_lsd main_geninq  print_nrc  print_lsd
1      1       -1       -1           0 -1.0000000 -1.0000000
2     10       -1       -1          -1 -1.0000000 -1.0000000
3    100       -1       -1          -1 -1.0000000 -1.0000000
4    101        1        0           1  0.3333333 -0.3333333
5    102        1        1           1  1.0000000  1.0000000
6    103        0        0           0 -1.0000000 -1.0000000
  print_geninq
1    0.0000000
2   -1.0000000
3   -1.0000000
4    0.3333333
5    1.0000000
6   -1.0000000

Then I can look at the model for each relationship

NRC

Show code
#run the linear model of main vs. print correlation in the NRC dictionary
lm_nrc <- lm(main_nrc~print_nrc, data = corr_matrix)
summary(lm_nrc)

Call:
lm(formula = main_nrc ~ print_nrc, data = corr_matrix)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.36631 -0.44844  0.09262  0.55156  1.55156 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.09262    0.02090  -4.432 1.05e-05 ***
print_nrc    0.45893    0.02838  16.173  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6348 on 934 degrees of freedom
Multiple R-squared:  0.2188,    Adjusted R-squared:  0.2179 
F-statistic: 261.6 on 1 and 934 DF,  p-value: < 2.2e-16

LSD

Show code
#run the linear model of main vs. print correlation in the LSD dictionary
lm_lsd <- lm(main_lsd~print_lsd, data = corr_matrix)
summary(lm_lsd)

Call:
lm(formula = main_lsd ~ print_lsd, data = corr_matrix)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3257 -0.3098 -0.3098  0.3516  1.6903 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.18226    0.02113  -8.627   <2e-16 ***
print_lsd    0.50799    0.02739  18.549   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5908 on 934 degrees of freedom
Multiple R-squared:  0.2692,    Adjusted R-squared:  0.2684 
F-statistic: 344.1 on 1 and 934 DF,  p-value: < 2.2e-16

General Inquiry

Show code
#run the linear model of main vs. print correlation in the General Inquiry dictionary
lm_geninq <- lm(main_geninq~print_geninq, data = corr_matrix)
summary(lm_geninq)

Call:
lm(formula = main_geninq ~ print_geninq, data = corr_matrix)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2563 -0.4979  0.1229  0.5021  1.5021 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.12292    0.02168  -5.669 1.91e-08 ***
print_geninq  0.37920    0.02979  12.730  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6409 on 934 degrees of freedom
Multiple R-squared:  0.1478,    Adjusted R-squared:  0.1469 
F-statistic:   162 on 1 and 934 DF,  p-value: < 2.2e-16

And try to look at if there is any meaningful difference between the models. Despite the differences in the expanded dataset from the primary one, there stil does not seem to be any meaningful difference between main and print headlines based on these analyses.

Show code
#create a data frame from the NRC model results
tidynrc <- tidy(lm_nrc, conf.int = FALSE) 
#round the results to 3 decimal points
tidynrc <- tidynrc %>%
  mutate_if(is.numeric, round, 3)
tidynrc$model <- c("nrc")

#create a data frame from the LSD model results
tidylsd <- tidy(lm_lsd, conf.int = FALSE) 
#round the results to 3 decimal points
tidylsd <- tidylsd %>%
  mutate_if(is.numeric, round, 3)
tidylsd$model <- c("lsd")

#create a data frame from the Gen Inq model results
tidygeninq <- tidy(lm_geninq, conf.int = FALSE) 
#round the results to 3 decimal points
tidygeninq <- tidygeninq %>%
  mutate_if(is.numeric, round, 3)
tidygeninq$model <- c("geninq")

tidy_all <- do.call("rbind", list(tidynrc, tidylsd, tidygeninq))

tidy_all
# A tibble: 6 x 6
  term         estimate std.error statistic p.value model 
  <chr>           <dbl>     <dbl>     <dbl>   <dbl> <chr> 
1 (Intercept)    -0.093     0.021     -4.43       0 nrc   
2 print_nrc       0.459     0.028     16.2        0 nrc   
3 (Intercept)    -0.182     0.021     -8.63       0 lsd   
4 print_lsd       0.508     0.027     18.5        0 lsd   
5 (Intercept)    -0.123     0.022     -5.67       0 geninq
6 print_geninq    0.379     0.03      12.7        0 geninq

Visualizing NRC Sentiment

Main Headlines

Show code
#main headlines
head(main_nrc)
Document-feature matrix of: 6 documents, 10 features (65.00% sparse) and 2 docvars.
    features
docs anger anticipation disgust fear joy negative positive sadness
   1     0            0       0    0   0        2        0       0
   2     0            0       0    1   0        1        1       0
   3     1            0       0    1   0        1        0       0
   4     0            0       0    1   0        2        0       0
   5     1            0       1    1   0        1        0       1
   6     0            1       1    0   2        0        2       0
    features
docs surprise trust
   1        0     0
   2        0     1
   3        0     0
   4        0     0
   5        0     0
   6        1     3
Show code
#transpose
main_df <-data.frame(t(main_nrc))
#The function rowSums computes column sums across rows for each level of a grouping variable.
df_new <- data.frame(rowSums(main_df[2:937]))
#Transformation and cleaning
names(df_new)[1] <- "count"
df_new <- cbind("sentiment" = rownames(df_new), df_new)
rownames(df_new) <- NULL
df_new2<-df_new[1:10,]
df_new2 <- read.csv("df_new2.csv")
#Plot One - count of words associated with each sentiment
quickplot(sentiment, data=df_new2, weight=count, geom="bar", fill=sentiment, ylab="count")+ggtitle("Main Headline Sentiments")

Show code
#print headlines
head(print_nrc)
Document-feature matrix of: 6 documents, 10 features (71.67% sparse) and 2 docvars.
    features
docs anger anticipation disgust fear joy negative positive sadness
   1     0            0       0    0   0        2        0       0
   2     0            0       0    2   0        1        0       1
   3     0            0       0    1   0        0        0       0
   4     1            0       1    1   0        4        0       1
   5     0            0       0    0   0        0        0       0
   6     0            1       1    0   2        0        2       0
    features
docs surprise trust
   1        0     0
   2        0     0
   3        0     0
   4        0     1
   5        0     0
   6        1     3
Show code
#transpose
main_df2 <-data.frame(t(print_nrc))
#The function rowSums computes column sums across rows for each level of a grouping variable.
df_new3 <- data.frame(rowSums(main_df2[2:937]))
#Transformation and cleaning
names(df_new3)[1] <- "count"
df_new3 <- cbind("sentiment" = rownames(df_new3), df_new3)
rownames(df_new3) <- NULL
df_new4<-df_new3[1:10,]
df_new4 <- read.csv("df_new4.csv")
#Plot One - count of words associated with each sentiment
quickplot(sentiment, data=df_new4, weight=count, geom="bar", fill=sentiment, ylab="count")+ggtitle("Print Headline Sentiments")

Clustered Bar Chart

Show code
#load data frame with both headline sentiments from NRC
df_dual <- read.csv("df_dual.csv")

ggplot(df_dual,
       aes(x = sentiment,
           y = count,
           fill = headline)) +
  geom_bar(stat = "identity",
           position = "dodge") +
  scale_fill_manual(values=c("#993333", "#336699")) +
  theme_minimal()

Summary

Although I was not able to find any statistically meaningful, measurable difference between the sentiments of print vs. main headlines analyzed in this project, it is still a valid observation that there is an overall pattern to be observed.

Specifically, there is a pattern that the print headlines carry a lower level of emotionally weighted words than the online headlines. I can’t make a factual observation, but I would like to do further, expanded studies in articles in this research path to investigate the hypothesis.

Citations