Text as Data Project Headline Comparison Research Using API Query “Afghanistan”
This is a continuation of my analysis started in my last post.
In my initial analysis of the headline data, I used the scope of the project from a prior term to set the parameters for the headline API search. I decided to expand the term to include just “Afghanistan” rather than “Afghanistan withdrawal” for a couple of reasons. First, to increase the volume of observations and increase reliability. Second, because I want to look at the comparison between the two search terms for any change.
The data was pulled via API using the same process as in my first phase of the comparison research, with the only change in the query term “Afghanistan” as opposed to “Afghanistan Withdrawal”. This led to a significantly larger dataset for comparison, though most of the increase in count was filtered out due to their classification as not news-related. Still, the number of observed, relevant headlines increased from 346 to 936 (for each type; main and print headlines).
Now to the active review of the data. Loading the data from my collection phase:
#load data
main_headlines <- read.csv("afghanistan_headlines_main.csv")
main_headlines <- as.data.frame(main_headlines)
#turn into data frame
print_headlines <- read.csv("afghanistan_headlines_print.csv")
print_headlines <- as.data.frame(print_headlines)
#inspect data
head(main_headlines)
doc_id date
1 1 7/17/2020
2 2 8/30/2020
3 3 6/2/2021
4 4 12/20/2020
5 5 9/11/2021
6 6 9/1/2021
text
1 $174 Million Afghan Drone Program Is Riddled With Problems, U.S. Report Says
2 ‘A Hail Mary’: Psychedelic Therapy Draws Veterans to Jungle Retreats
3 ‘Come On In, Boys’: A Wave of the Hand Sets Off Spain-Morocco Migrant Fight
4 ‘Covid Can’t Compete.’ In a Place Mired in War, the Virus Is an Afterthought.
5 ‘Everything Changed Overnight’: Afghan Reporters Face an Intolerant Regime
6 ‘Finally, I Am Safe’: U.S. Air Base Becomes Temporary Refuge for Afghans
head(print_headlines)
doc_id date
1 1 7/17/2020
2 2 8/30/2020
3 3 6/2/2021
4 4 12/20/2020
5 5 9/11/2021
6 6 9/1/2021
text
1 $174 Million Drone Program for Afghans Is Riddled With Problems, Pentagon Says
2 Psychedelic Therapy In the Jungle Soothes The Pain for Veterans
3 Morocco Sends Spanish Outpost a Migrant Influx
4 ‘It’s a Lie’: Denial and Skepticism Permeate a Nation Embroiled in War
5 ‘Everything Changed’: Media Face Crackdown
6 ‘Finally, I Am Safe’: Thousands Find Temporary Refuge at U.S. Air Base
all_results <- read.csv("all_results.csv")
I want to optimize pre-processing by removing the “�” symbol that has plagued me since starting working with this API by using “remove_symbols=TRUE” in addition to removing the punctuation when tokenizing. I also want to remove stopwords. I do NOT want to use stemming at this point.
main_tokens <- tokens(main_corpus) %>%
tokens(main_corpus, remove_punct = TRUE) %>%
tokens(main_corpus, remove_numbers = TRUE) %>%
tokens(main_corpus, remove_symbols = TRUE) %>%
tokens_remove(stopwords("english")) %>%
tokens_remove(c("s"))
main_dfm <- dfm(main_tokens)
length(main_tokens)
[1] 936
print(main_tokens)
Tokens consisting of 936 documents and 2 docvars.
1 :
[1] "Million" "Afghan" "Drone" "Program" "Riddled" "Problems"
[7] "U.S" "Report" "Says"
2 :
[1] "Hail" "Mary" "Psychedelic" "Therapy"
[5] "Draws" "Veterans" "Jungle" "Retreats"
3 :
[1] "Come" "Boys" "Wave" "Hand"
[5] "Sets" "Spain-Morocco" "Migrant" "Fight"
4 :
[1] "Covid" "Compete" "Place" "Mired"
[5] "War" "Virus" "Afterthought"
5 :
[1] "Everything" "Changed" "Overnight" "Afghan" "Reporters"
[6] "Face" "Intolerant" "Regime"
6 :
[1] "Finally" "Safe" "U.S" "Air" "Base"
[6] "Becomes" "Temporary" "Refuge" "Afghans"
[ reached max_ndoc ... 930 more documents ]
print_tokens <- tokens(print_corpus) %>%
tokens(print_corpus, remove_punct = TRUE) %>%
tokens(print_corpus, remove_numbers = TRUE) %>%
tokens(print_corpus, remove_symbols = TRUE) %>%
tokens_remove(stopwords("english")) %>%
tokens_remove(c("s"))
main_dfm <- dfm(print_tokens)
length(print_tokens)
[1] 936
print(print_tokens)
Tokens consisting of 936 documents and 2 docvars.
1 :
[1] "Million" "Drone" "Program" "Afghans" "Riddled" "Problems"
[7] "Pentagon" "Says"
2 :
[1] "Psychedelic" "Therapy" "Jungle" "Soothes"
[5] "Pain" "Veterans"
3 :
[1] "Morocco" "Sends" "Spanish" "Outpost" "Migrant" "Influx"
4 :
[1] "Lie" "Denial" "Skepticism" "Permeate" "Nation"
[6] "Embroiled" "War"
5 :
[1] "Everything" "Changed" "Media" "Face" "Crackdown"
6 :
[1] "Finally" "Safe" "Thousands" "Find" "Temporary"
[6] "Refuge" "U.S" "Air" "Base"
[ reached max_ndoc ... 930 more documents ]
Again, this will show me the occurrence of words within each ‘doc’ or headline observation.
Document-feature matrix of: 936 documents, 2,661 features (99.75% sparse) and 2 docvars.
features
docs million drone program afghans riddled problems pentagon says
1 1 1 1 1 1 1 1 1
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
features
docs psychedelic therapy
1 0 0
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
[ reached max_ndoc ... 930 more documents, reached max_nfeat ... 2,651 more features ]
main_dfm
Document-feature matrix of: 936 documents, 2,745 features (99.74% sparse) and 2 docvars.
features
docs million afghan drone program riddled problems u.s report says
1 1 1 1 1 1 1 1 1 1
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0 0
6 0 0 0 0 0 0 1 0 0
features
docs hail
1 0
2 1
3 0
4 0
5 0
6 0
[ reached max_ndoc ... 930 more documents, reached max_nfeat ... 2,735 more features ]
Again, I can take a preliminary look at the data frame from each of the headlines to see the most frequent words after pre-processing.
The only significant change by removing “withdrawal” from my search term is that the term “exit” is not present any longer on the print headline frequency header. This is logical.
Frequency Rank
u.s 166 1
afghan 151 2
afghanistan 135 3
taliban 110 4
biden 95 5
war 64 6
Frequency Rank
u.s 171 1
afghan 109 2
taliban 108 3
afghanistan 84 4
biden 76 5
war 53 6
Now I can take a look at this network of feature co-occurrences again. This time, I’m also going to increase the feature count from 20 to 30.
First, for the main headlines:
[1] 2745 2745
# pull the top features
myFeatures <- names(topfeatures(main_fcm, 30))
# retain only those top features as part of our matrix
smaller_main_fcm <- fcm_select(main_fcm, pattern = myFeatures, selection = "keep")
# check dimensions
dim(smaller_main_fcm)
[1] 30 30
# compute size weight for vertices in network
size <- log(colSums(smaller_main_fcm))
# create plot
textplot_network(smaller_main_fcm, vertex_size = size / max(size) * 3)
and for the print headlines:
[1] 2661 2661
# pull the top features
myFeatures <- names(topfeatures(print_fcm, 30))
# retain only those top features as part of our matrix
smaller_print_fcm <- fcm_select(print_fcm, pattern = myFeatures, selection = "keep")
# check dimensions
dim(smaller_print_fcm)
[1] 30 30
# compute size weight for vertices in network
size <- log(colSums(smaller_print_fcm))
# create plot
textplot_network(smaller_print_fcm, vertex_size = size / max(size) * 3)
The resulting matrices have definitely changed, at least slightly.
To compare equally both this and my initial sentiment analysis, I am going to use the three dictionaries we used in the course tutorial, the NRC, LSD(2015) and General Inquiry dictionaries.
I am first using the “liwcalike()” function from the quanteda.dictionaries package to apply the NRC dictionary. I can take a look at the head or tail and choose to look at a snapshot of the sentiments that have been applied to the corpus for each text group. Just at first glance, I can again see some differences in the scoring.
#use liwcalike() to estimate sentiment using NRC dictionary
#for main headlines
main_sentiment_nrc <- liwcalike(as.character(main_corpus), data_dictionary_NRC)
head(main_sentiment_nrc)[7:12]
anger anticipation disgust fear joy negative
1 0.00 0.00 0.00 0.00 0.00 13.33
2 0.00 0.00 0.00 7.69 0.00 7.69
3 5.00 0.00 0.00 5.00 0.00 5.00
4 0.00 0.00 0.00 5.26 0.00 10.53
5 8.33 0.00 8.33 8.33 0.00 8.33
6 0.00 5.88 5.88 0.00 11.76 0.00
#and print headlines
print_sentiment_nrc <- liwcalike(as.character(print_corpus), data_dictionary_NRC)
head(print_sentiment_nrc)[11:16]
joy negative positive sadness surprise trust
1 0.00 14.29 0.00 0.00 0.00 0.00
2 0.00 10.00 0.00 10.00 0.00 0.00
3 0.00 0.00 0.00 0.00 0.00 0.00
4 0.00 26.67 0.00 6.67 0.00 6.67
5 0.00 0.00 0.00 0.00 0.00 0.00
6 11.76 0.00 11.76 0.00 5.88 17.65
I can also put the results into a document feature matrix for each text group:
#convert tokens from each headline data set to DFM using the dictionary "NRC"
main_nrc <- dfm(main_tokens) %>%
dfm_lookup(data_dictionary_NRC)
print_nrc <- dfm(print_tokens) %>%
dfm_lookup(data_dictionary_NRC)
dim(main_nrc)
[1] 936 10
main_nrc
Document-feature matrix of: 936 documents, 10 features (67.61% sparse) and 2 docvars.
features
docs anger anticipation disgust fear joy negative positive sadness
1 0 0 0 0 0 2 0 0
2 0 0 0 1 0 1 1 0
3 1 0 0 1 0 1 0 0
4 0 0 0 1 0 2 0 0
5 1 0 1 1 0 1 0 1
6 0 1 1 0 2 0 2 0
features
docs surprise trust
1 0 0
2 0 1
3 0 0
4 0 0
5 0 0
6 1 3
[ reached max_ndoc ... 930 more documents ]
dim(print_nrc)
[1] 936 10
print_nrc
Document-feature matrix of: 936 documents, 10 features (69.21% sparse) and 2 docvars.
features
docs anger anticipation disgust fear joy negative positive sadness
1 0 0 0 0 0 2 0 0
2 0 0 0 2 0 1 0 1
3 0 0 0 1 0 0 0 0
4 1 0 1 1 0 4 0 1
5 0 0 0 0 0 0 0 0
6 0 1 1 0 2 0 2 0
features
docs surprise trust
1 0 0
2 0 0
3 0 0
4 0 1
5 0 0
6 1 3
[ reached max_ndoc ... 930 more documents ]
And use the information in a data frame to plot the output as represented by a calculation for polarity:
library(cowplot)
#for the main headlines
df_main_nrc <- convert(main_nrc, to = "data.frame")
df_main_nrc$polarity <- (df_main_nrc$positive - df_main_nrc$negative)/(df_main_nrc$positive + df_main_nrc$negative)
df_main_nrc$polarity[which((df_main_nrc$positive + df_main_nrc$negative) == 0)] <- 0
ggplot(df_main_nrc) +
geom_histogram(aes(x=polarity), bins = 15) +
theme_minimal_hgrid()
#and the print headlines
df_print_nrc <- convert(print_nrc, to = "data.frame")
df_print_nrc$polarity <- (df_print_nrc$positive - df_print_nrc$negative)/(df_print_nrc$positive + df_print_nrc$negative)
df_print_nrc$polarity[which((df_print_nrc$positive + df_print_nrc$negative) == 0)] <- 0
ggplot(df_print_nrc) +
geom_histogram(aes(x=polarity), bins = 15) +
theme_minimal_hgrid()
Looking at the headlines that are indicated as “1”, or positive in sentiment, this expanded corpus reflects more positivity than the top results from the smaller corpus.
Corpus consisting of 6 documents and 2 docvars.
6 :
"‘Finally, I Am Safe’: U.S. Air Base Becomes Temporary Refuge..."
8 :
"‘Football Is Like Food’: Afghan Female Soccer Players Find a..."
18 :
"‘Is Austin on Your List?’: Biden’s Pentagon Pick Rose Despit..."
29 :
"‘We Have to Try’: Lawmakers Rush to Assist in Afghanistan Ev..."
40 :
"4 Takeaways From the U.S. Deal With the Taliban"
43 :
"98 Countries Pledge to Accept Afghans After U.S. Military De..."
Corpus consisting of 6 documents and 2 docvars.
6 :
"‘Finally, I Am Safe’: Thousands Find Temporary Refuge at U.S..."
15 :
"Veterans Feel Urgency to Aid Afghan Allies"
18 :
"How Biden’s Defense Nominee Overcame Barriers to Diversity"
25 :
"How Biden, by Turns Genial and Blunt, Built Diplomatic Bridg..."
45 :
"Rescue Flight To Germany Inspires Name For Newborn"
50 :
"A Call for the Return of Civility, And Truth as a Guiding Li..."
I am going to want to look at multiple dictionaries to see if one can best apply to this data. Next, the LSD 2015 dictionary:
# convert main corpus to DFM using the LSD2015 dictionary
main_lsd2015 <- dfm(main_tokens) %>%
dfm_lookup(data_dictionary_LSD2015)
# create main polarity measure for LSD2015
main_lsd2015 <- convert(main_lsd2015, to = "data.frame")
main_lsd2015$polarity <- (main_lsd2015$positive - main_lsd2015$negative)/(main_lsd2015$positive + main_lsd2015$negative)
main_lsd2015$polarity[which((main_lsd2015$positive + main_lsd2015$negative) == 0)] <- 0
# convert print corpus to DFM using the LSD2015 dictionary
print_lsd2015 <- dfm(print_tokens) %>%
dfm_lookup(data_dictionary_LSD2015)
# create print polarity measure for LSD2015
print_lsd2015 <- convert(print_lsd2015, to = "data.frame")
print_lsd2015$polarity <- (print_lsd2015$positive - print_lsd2015$negative)/(print_lsd2015$positive + print_lsd2015$negative)
print_lsd2015$polarity[which((print_lsd2015$positive + print_lsd2015$negative) == 0)] <- 0
Looking at the headlines that are indicated as “1”, or positive in sentiment, I can again see why these specific headlines are being evaluated as ‘positive’ despite more aberrations than in the NRC dictionary.
Corpus consisting of 6 documents and 2 docvars.
6 :
"‘Finally, I Am Safe’: U.S. Air Base Becomes Temporary Refuge..."
8 :
"‘Football Is Like Food’: Afghan Female Soccer Players Find a..."
19 :
"‘It’s Like Falling in Love’: Israeli Entrepreneurs Welcomed ..."
25 :
"‘Strategic Empathy’: How Biden’s Informal Diplomacy Shaped F..."
29 :
"‘We Have to Try’: Lawmakers Rush to Assist in Afghanistan Ev..."
43 :
"98 Countries Pledge to Accept Afghans After U.S. Military De..."
Corpus consisting of 6 documents and 2 docvars.
6 :
"‘Finally, I Am Safe’: Thousands Find Temporary Refuge at U.S..."
15 :
"Veterans Feel Urgency to Aid Afghan Allies"
29 :
"As Panicked Afghans Seek Help, Lawmakers Say, ‘We Have to Tr..."
31 :
"‘She Was Alone’: One Official’s Harrowing Escape From Kabul"
43 :
"98 Countries Pledge to Accept Afghans After U.S. Departure"
45 :
"Rescue Flight To Germany Inspires Name For Newborn"
And use the information in a data frame to plot the output as represented by a calculation for polarity:
#for the main headlines
ggplot(main_lsd2015) +
geom_histogram(aes(x=polarity), bins = 15) +
theme_minimal_hgrid()
#and the print headlines
ggplot(print_lsd2015) +
geom_histogram(aes(x=polarity), bins = 15) +
theme_minimal_hgrid()
and the General Inquirer dictionary:
# convert main corpus to DFM using the General Inquirer dictionary
main_geninq <- dfm(main_tokens) %>%
dfm_lookup(data_dictionary_geninqposneg)
# create main polarity measure for GenInq
main_geninq <- convert(main_geninq, to = "data.frame")
main_geninq$polarity <- (main_geninq$positive - main_geninq$negative)/(main_geninq$positive + main_geninq$negative)
main_geninq$polarity[which((main_geninq$positive + main_geninq$negative) == 0)] <- 0
# convert print corpus to DFM using the General Inquirer dictionary
print_geninq <- dfm(print_tokens) %>%
dfm_lookup(data_dictionary_geninqposneg)
# create print polarity measure for GenInq
print_geninq <- convert(print_geninq, to = "data.frame")
print_geninq$polarity <- (print_geninq$positive - print_geninq$negative)/(print_geninq$positive + print_geninq $negative)
print_geninq$polarity[which((print_geninq$positive + print_geninq$negative) == 0)] <- 0
Looking at the headlines that are indicated as “1”, or positive in sentiment, again - this one is even more of a mixed bag, with the sentiment rationale clear. However, it is also clear why the rationale is being used at the expense of subtle subject matter knowledge.
Corpus consisting of 6 documents and 2 docvars.
6 :
"‘Finally, I Am Safe’: U.S. Air Base Becomes Temporary Refuge..."
8 :
"‘Football Is Like Food’: Afghan Female Soccer Players Find a..."
12 :
"‘I Forget About the World:’ Afghan Youth Find Escape in a Vi..."
19 :
"‘It’s Like Falling in Love’: Israeli Entrepreneurs Welcomed ..."
27 :
"‘They Just Left Us’: Greece Is Accused of Setting Migrants A..."
43 :
"98 Countries Pledge to Accept Afghans After U.S. Military De..."
Corpus consisting of 6 documents and 2 docvars.
6 :
"‘Finally, I Am Safe’: Thousands Find Temporary Refuge at U.S..."
15 :
"Veterans Feel Urgency to Aid Afghan Allies"
18 :
"How Biden’s Defense Nominee Overcame Barriers to Diversity"
33 :
"‘Why Do We Deserve to Die?’ Burying Hazara Girls in Kabul"
42 :
"New Wave of Refugees Faces A Much Chillier U.S. Welcome"
43 :
"98 Countries Pledge to Accept Afghans After U.S. Departure"
And use the information in a data frame to plot the output as represented by a calculation for polarity:
#for the main headlines
ggplot(main_geninq) +
geom_histogram(aes(x=polarity), bins = 15) +
theme_minimal_hgrid()
#and the print headlines
ggplot(print_geninq) +
geom_histogram(aes(x=polarity), bins = 15) +
theme_minimal_hgrid()
Now I’m going to be able to compare the different dictionary scores in one data frame for each type of headline.
# create unique names for each main headline dataframe
colnames(df_main_nrc) <- paste("nrc", colnames(df_main_nrc), sep = "_")
colnames(main_lsd2015) <- paste("lsd2015", colnames(main_lsd2015), sep = "_")
colnames(main_geninq) <- paste("geninq", colnames(main_geninq), sep = "_")
# now let's compare our estimates
main_sent <- merge(df_main_nrc, main_lsd2015, by.x = "nrc_doc_id", by.y = "lsd2015_doc_id")
main_sent <- merge(main_sent, main_geninq, by.x = "nrc_doc_id", by.y = "geninq_doc_id")
head(main_sent)[1:5]
nrc_doc_id nrc_anger nrc_anticipation nrc_disgust nrc_fear
1 1 0 0 0 0
2 10 0 0 0 2
3 100 0 0 0 1
4 101 0 2 0 0
5 102 0 1 0 0
6 103 1 1 1 1
# create unique names for each print headline dataframe
colnames(df_print_nrc) <- paste("nrc", colnames(df_print_nrc), sep = "_")
colnames(print_lsd2015) <- paste("lsd2015", colnames(print_lsd2015), sep = "_")
colnames(print_geninq) <- paste("geninq", colnames(print_geninq), sep = "_")
# now let's compare our estimates
print_sent <- merge(df_print_nrc, print_lsd2015, by.x = "nrc_doc_id", by.y = "lsd2015_doc_id")
print_sent <- merge(print_sent, print_geninq, by.x = "nrc_doc_id", by.y = "geninq_doc_id")
head(print_sent)[1:5]
nrc_doc_id nrc_anger nrc_anticipation nrc_disgust nrc_fear
1 1 0 0 0 0
2 10 0 0 0 2
3 100 0 0 0 1
4 101 0 2 0 0
5 102 0 1 0 0
6 103 1 0 1 1
Now that I have them all in a single data frame, it’s straightforward to figure out a bit about how well our different measures of polarity agree across the different approaches by looking at their correlation using the “cor()” function.
It seems like the polarity of the headlines are more similar in this expanded analysis.
cor(main_sent$nrc_polarity, main_sent$lsd2015_polarity)
[1] 0.4619394
cor(main_sent$nrc_polarity, main_sent$geninq_polarity)
[1] 0.4702233
cor(main_sent$lsd2015_polarity, main_sent$geninq_polarity)
[1] 0.5321535
cor(print_sent$nrc_polarity, print_sent$lsd2015_polarity)
[1] 0.4700115
cor(print_sent$nrc_polarity, print_sent$geninq_polarity)
[1] 0.4824528
cor(print_sent$lsd2015_polarity, print_sent$geninq_polarity)
[1] 0.4933739
I can take a quick visual look at the correlation between sentiments detected in both sets of headlines using the “GGally” package. There seems to be very little difference in that regard.
Finally, I want to visually look at the correlations or positive and negative sentiments as my starting point for understanding relationships between both my sentiment analyses and dictionaries. I’ll start by dividing the sentiment scores for positive and negative from each text source into its own object and change column names to make them unique except for ‘doc_id’ for joining them into one data frame.
corr_main <- main_sent %>%
select(nrc_doc_id, nrc_polarity, lsd2015_polarity, geninq_polarity )
colnames(corr_main) <- c("doc_id", "main_nrc", "main_lsd", "main_geninq")
corr_print <- print_sent %>%
select(nrc_doc_id, nrc_polarity, lsd2015_polarity, geninq_polarity )
colnames(corr_print) <- c("doc_id", "print_nrc", "print_lsd", "print_geninq")
corr_matrix <- join(corr_main, corr_print, by = "doc_id")
head(corr_matrix)
doc_id main_nrc main_lsd main_geninq print_nrc print_lsd
1 1 -1 -1 0 -1.0000000 -1.0000000
2 10 -1 -1 -1 -1.0000000 -1.0000000
3 100 -1 -1 -1 -1.0000000 -1.0000000
4 101 1 0 1 0.3333333 -0.3333333
5 102 1 1 1 1.0000000 1.0000000
6 103 0 0 0 -1.0000000 -1.0000000
print_geninq
1 0.0000000
2 -1.0000000
3 -1.0000000
4 0.3333333
5 1.0000000
6 -1.0000000
Then I can look at the model for each relationship
Call:
lm(formula = main_nrc ~ print_nrc, data = corr_matrix)
Residuals:
Min 1Q Median 3Q Max
-1.36631 -0.44844 0.09262 0.55156 1.55156
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.09262 0.02090 -4.432 1.05e-05 ***
print_nrc 0.45893 0.02838 16.173 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6348 on 934 degrees of freedom
Multiple R-squared: 0.2188, Adjusted R-squared: 0.2179
F-statistic: 261.6 on 1 and 934 DF, p-value: < 2.2e-16
Call:
lm(formula = main_lsd ~ print_lsd, data = corr_matrix)
Residuals:
Min 1Q Median 3Q Max
-1.3257 -0.3098 -0.3098 0.3516 1.6903
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.18226 0.02113 -8.627 <2e-16 ***
print_lsd 0.50799 0.02739 18.549 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5908 on 934 degrees of freedom
Multiple R-squared: 0.2692, Adjusted R-squared: 0.2684
F-statistic: 344.1 on 1 and 934 DF, p-value: < 2.2e-16
Call:
lm(formula = main_geninq ~ print_geninq, data = corr_matrix)
Residuals:
Min 1Q Median 3Q Max
-1.2563 -0.4979 0.1229 0.5021 1.5021
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.12292 0.02168 -5.669 1.91e-08 ***
print_geninq 0.37920 0.02979 12.730 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6409 on 934 degrees of freedom
Multiple R-squared: 0.1478, Adjusted R-squared: 0.1469
F-statistic: 162 on 1 and 934 DF, p-value: < 2.2e-16
And try to look at if there is any meaningful difference between the models. Despite the differences in the expanded dataset from the primary one, there stil does not seem to be any meaningful difference between main and print headlines based on these analyses.
#create a data frame from the NRC model results
tidynrc <- tidy(lm_nrc, conf.int = FALSE)
#round the results to 3 decimal points
tidynrc <- tidynrc %>%
mutate_if(is.numeric, round, 3)
tidynrc$model <- c("nrc")
#create a data frame from the LSD model results
tidylsd <- tidy(lm_lsd, conf.int = FALSE)
#round the results to 3 decimal points
tidylsd <- tidylsd %>%
mutate_if(is.numeric, round, 3)
tidylsd$model <- c("lsd")
#create a data frame from the Gen Inq model results
tidygeninq <- tidy(lm_geninq, conf.int = FALSE)
#round the results to 3 decimal points
tidygeninq <- tidygeninq %>%
mutate_if(is.numeric, round, 3)
tidygeninq$model <- c("geninq")
tidy_all <- do.call("rbind", list(tidynrc, tidylsd, tidygeninq))
tidy_all
# A tibble: 6 x 6
term estimate std.error statistic p.value model
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 (Intercept) -0.093 0.021 -4.43 0 nrc
2 print_nrc 0.459 0.028 16.2 0 nrc
3 (Intercept) -0.182 0.021 -8.63 0 lsd
4 print_lsd 0.508 0.027 18.5 0 lsd
5 (Intercept) -0.123 0.022 -5.67 0 geninq
6 print_geninq 0.379 0.03 12.7 0 geninq
#main headlines
head(main_nrc)
Document-feature matrix of: 6 documents, 10 features (65.00% sparse) and 2 docvars.
features
docs anger anticipation disgust fear joy negative positive sadness
1 0 0 0 0 0 2 0 0
2 0 0 0 1 0 1 1 0
3 1 0 0 1 0 1 0 0
4 0 0 0 1 0 2 0 0
5 1 0 1 1 0 1 0 1
6 0 1 1 0 2 0 2 0
features
docs surprise trust
1 0 0
2 0 1
3 0 0
4 0 0
5 0 0
6 1 3
#transpose
main_df <-data.frame(t(main_nrc))
#The function rowSums computes column sums across rows for each level of a grouping variable.
df_new <- data.frame(rowSums(main_df[2:937]))
#Transformation and cleaning
names(df_new)[1] <- "count"
df_new <- cbind("sentiment" = rownames(df_new), df_new)
rownames(df_new) <- NULL
df_new2<-df_new[1:10,]
df_new2 <- read.csv("df_new2.csv")
#Plot One - count of words associated with each sentiment
quickplot(sentiment, data=df_new2, weight=count, geom="bar", fill=sentiment, ylab="count")+ggtitle("Main Headline Sentiments")
#print headlines
head(print_nrc)
Document-feature matrix of: 6 documents, 10 features (71.67% sparse) and 2 docvars.
features
docs anger anticipation disgust fear joy negative positive sadness
1 0 0 0 0 0 2 0 0
2 0 0 0 2 0 1 0 1
3 0 0 0 1 0 0 0 0
4 1 0 1 1 0 4 0 1
5 0 0 0 0 0 0 0 0
6 0 1 1 0 2 0 2 0
features
docs surprise trust
1 0 0
2 0 0
3 0 0
4 0 1
5 0 0
6 1 3
#transpose
main_df2 <-data.frame(t(print_nrc))
#The function rowSums computes column sums across rows for each level of a grouping variable.
df_new3 <- data.frame(rowSums(main_df2[2:937]))
#Transformation and cleaning
names(df_new3)[1] <- "count"
df_new3 <- cbind("sentiment" = rownames(df_new3), df_new3)
rownames(df_new3) <- NULL
df_new4<-df_new3[1:10,]
df_new4 <- read.csv("df_new4.csv")
#Plot One - count of words associated with each sentiment
quickplot(sentiment, data=df_new4, weight=count, geom="bar", fill=sentiment, ylab="count")+ggtitle("Print Headline Sentiments")
#load data frame with both headline sentiments from NRC
df_dual <- read.csv("df_dual.csv")
ggplot(df_dual,
aes(x = sentiment,
y = count,
fill = headline)) +
geom_bar(stat = "identity",
position = "dodge") +
scale_fill_manual(values=c("#993333", "#336699")) +
theme_minimal()
Although I was not able to find any statistically meaningful, measurable difference between the sentiments of print vs. main headlines analyzed in this project, it is still a valid observation that there is an overall pattern to be observed.
Specifically, there is a pattern that the print headlines carry a lower level of emotionally weighted words than the online headlines. I can’t make a factual observation, but I would like to do further, expanded studies in articles in this research path to investigate the hypothesis.
This research makes use of the NRC Word-Emotion Association Lexicon, created by Saif Mohammad and Peter Turney at the National Research Council Canada.
This research makes use of the LSD Lexicoder Sentiment Dictionary. This dataset was first published in Young, L. & Soroka, S. (2012). Affective News: The Automated Coding of Sentiment in Political Texts]. doi: 10.1080/10584609.2012.671234 . Political Communication, 29(2), 205–231.
This research makes use of the General Inquirer Lexicon, Stone, P. J. (1962). The general inquirer: A computer system for content analysis and retrieval based on the sentence as a unit of information. Harvard: Laboratory of Social Relations, Harvard University.