Adding on to my initial sentiment analysis using new methods and dictionaries.
This time, my analysis will be through a different process of analysis than I used in my first two analyses.
This anaysis will use an alternative sequence to analyze a different set of three sentiment datasets within the “tidytext” package.
The dictionaries are “bing”, “nrc”, (which I used previously) and “AFINN”.
Loading the data from the expanded analysis:
#load data
main_headlines <- read.csv("afghanistan_headlines_main.csv")
main_headlines <- as.data.frame(main_headlines)
#turn into data frame
print_headlines <- read.csv("afghanistan_headlines_print.csv")
print_headlines <- as.data.frame(print_headlines)
The bing lexicon sorts words into positive or negative positions.
First I’ll create tokens for the main and print headlines.
#create tokens without stop words for main headlines
tkn_l_main <- apply(main_headlines, 1, function(x) { data.frame(text=x, stringsAsFactors = FALSE) %>% unnest_tokens(word, text)})
main_news_tokens <- lapply(tkn_l_main, function(x) {anti_join(x, stop_words)})
str(main_news_tokens, list.len = 5)
List of 936
$ :'data.frame': 12 obs. of 1 variable:
..$ word: chr [1:12] "1" "7" "17" "2020" ...
$ :'data.frame': 12 obs. of 1 variable:
..$ word: chr [1:12] "2" "8" "30" "2020" ...
$ :'data.frame': 12 obs. of 1 variable:
..$ word: chr [1:12] "3" "6" "2" "2021" ...
$ :'data.frame': 11 obs. of 1 variable:
..$ word: chr [1:11] "4" "12" "20" "2020" ...
$ :'data.frame': 10 obs. of 1 variable:
..$ word: chr [1:10] "5" "9" "11" "2021" ...
[list output truncated]
main_news_tokens[[1]]
word
doc_id 1
date...2 7
date...3 17
date...4 2020
text...5 174
text...6 million
text...7 afghan
text...8 drone
text...9 program
text...10 riddled
text...11 u.s
text...12 report
#create tokens without stop words for print headlines
tkn_l_print <- apply(print_headlines, 1, function(x) { data.frame(text=x, stringsAsFactors = FALSE) %>% unnest_tokens(word, text)})
print_news_tokens <- lapply(tkn_l_print, function(x) {anti_join(x, stop_words)})
str(print_news_tokens, list.len = 5)
List of 936
$ :'data.frame': 11 obs. of 1 variable:
..$ word: chr [1:11] "1" "7" "17" "2020" ...
$ :'data.frame': 10 obs. of 1 variable:
..$ word: chr [1:10] "2" "8" "30" "2020" ...
$ :'data.frame': 10 obs. of 1 variable:
..$ word: chr [1:10] "3" "6" "2" "2021" ...
$ :'data.frame': 12 obs. of 1 variable:
..$ word: chr [1:12] "4" "12" "20" "2020" ...
$ :'data.frame': 7 obs. of 1 variable:
..$ word: chr [1:7] "5" "9" "11" "2021" ...
[list output truncated]
print_news_tokens[[1]]
word
doc_id 1
date...2 7
date...3 17
date...4 2020
text...5 174
text...6 million
text...7 drone
text...8 program
text...9 afghans
text...10 riddled
text...11 pentagon
I need to next create a function to assign sentiment labels.
Then I can apply that sentiment function to the headline data sets
sentiments_bing <- get_sentiments("bing")
#apply sentiment to main headlines
main_news_sentiment_bing <- sapply(main_news_tokens, function(x) { x %>% inner_join(sentiments_bing) %>% compute_sentiment()})
#apply sentiment to print headlines
print_news_sentiment_bing <- sapply(print_news_tokens, function(x) { x %>% inner_join(sentiments_bing) %>% compute_sentiment()})
The summaries of each show the number of NA’s are minimal.
summary(main_news_sentiment_bing)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-4.0000 -1.0000 -1.0000 -0.5945 1.0000 2.0000 349
summary(print_news_sentiment_bing)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-4.0000 -1.0000 -1.0000 -0.6209 0.0000 3.0000 353
Now I can look at the first 10 headlines and the corresponding bing analysis scores. I can see that the scores vary, even in the first 10 headlines.
#head 10 main headlines with bing analysis scores
main_news_sentiment_bing_df <- data.frame(main_text=main_headlines$text, score = main_news_sentiment_bing)
head(main_news_sentiment_bing_df, 10)
main_text
1 $174 Million Afghan Drone Program Is Riddled With Problems, U.S. Report Says
2 ‘A Hail Mary’: Psychedelic Therapy Draws Veterans to Jungle Retreats
3 ‘Come On In, Boys’: A Wave of the Hand Sets Off Spain-Morocco Migrant Fight
4 ‘Covid Can’t Compete.’ In a Place Mired in War, the Virus Is an Afterthought.
5 ‘Everything Changed Overnight’: Afghan Reporters Face an Intolerant Regime
6 ‘Finally, I Am Safe’: U.S. Air Base Becomes Temporary Refuge for Afghans
7 ‘Find Him and Kill Him’: An Afghan Pilot’s Desperate Escape
8 ‘Football Is Like Food’: Afghan Female Soccer Players Find a Home in Italy
9 ‘Go Big’ on Coronavirus Stimulus, Trump Says, Pitching Checks for Americans
10 ‘Hospital Needs to Be Quarantined,’ but Works On in Country at War
score
1 NA
2 1
3 NA
4 -1
5 NA
6 1
7 -2
8 NA
9 1
10 NA
#head 10 print headlines with bing analysis scores
print_news_sentiment_bing_df <- data.frame(print_text=print_headlines$text, score = print_news_sentiment_bing)
head(print_news_sentiment_bing_df, 10)
print_text
1 $174 Million Drone Program for Afghans Is Riddled With Problems, Pentagon Says
2 Psychedelic Therapy In the Jungle Soothes The Pain for Veterans
3 Morocco Sends Spanish Outpost a Migrant Influx
4 ‘It’s a Lie’: Denial and Skepticism Permeate a Nation Embroiled in War
5 ‘Everything Changed’: Media Face Crackdown
6 ‘Finally, I Am Safe’: Thousands Find Temporary Refuge at U.S. Air Base
7 ‘Find Him and Kill Him’: A Pilot’s Desperate Escape From Kabul
8 Soccer Players Under Threat Escape to Italy
9 Plan Would Inject $1 Trillion Into Economy
10 As Pandemic Takes Toll on Afghan Doctors, Hospitals Still Tend to War Wounded
score
1 NA
2 -1
3 NA
4 -4
5 NA
6 1
7 -2
8 -1
9 NA
10 -1
As I saw in my first two analyses, the NRC lexicon uses 10 different sentiments, including negative and positive but with additional sentiments as well.
sentiments_nrc <- get_sentiments("nrc")
(unique_sentiments_nrc <- unique(sentiments_nrc$sentiment))
[1] "trust" "fear" "negative" "sadness"
[5] "anger" "surprise" "positive" "disgust"
[9] "joy" "anticipation"
Next again I will create a function to assign sentiment labels that apply ‘positive’ and ‘negative’ in a binary interpretation of each of the 8 other sentiments.
compute_pos_neg_sentiments_nrc <- function(the_sentiments_nrc) {
s <- unique(the_sentiments_nrc$sentiment)
df_sentiments <- data.frame(sentiment = s,
mapped_sentiment = c("positive", "negative", "negative", "negative",
"negative", "positive", "positive", "negative",
"positive", "positive"))
ss <- sentiments_nrc %>% inner_join(df_sentiments)
the_sentiments_nrc$sentiment <- ss$mapped_sentiment
the_sentiments_nrc
}
nrc_sentiments_pos_neg_scale <- compute_pos_neg_sentiments_nrc(sentiments_nrc)
Then I can apply that sentiment function to the headline data sets
#calculating NRC sentiment for main headlines
main_news_sentiment_nrc <- sapply(main_news_tokens, function(x) { x %>% inner_join(nrc_sentiments_pos_neg_scale) %>% compute_sentiment()})
#calculating NRC sentiment for print headlines
print_news_sentiment_nrc <- sapply(print_news_tokens, function(x) { x %>% inner_join(nrc_sentiments_pos_neg_scale) %>% compute_sentiment()})
The summaries of each show the number of NA’s are even more minimal.
summary(main_news_sentiment_nrc)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-12.0000 -3.0000 -1.0000 -0.7417 1.0000 13.0000 150
summary(print_news_sentiment_nrc)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-12.0000 -3.0000 -1.0000 -0.4994 2.0000 13.0000 151
Now I can look at the first 10 headlines and the corresponding NRC analysis scores. I can see that the scores vary as well, even in the first 10 headlines.
#data frame of main NRC sentiment
main_news_sentiment_nrc_df <- data.frame(main_text=main_headlines$text, score = main_news_sentiment_nrc)
head(main_news_sentiment_nrc_df, 10)
main_text
1 $174 Million Afghan Drone Program Is Riddled With Problems, U.S. Report Says
2 ‘A Hail Mary’: Psychedelic Therapy Draws Veterans to Jungle Retreats
3 ‘Come On In, Boys’: A Wave of the Hand Sets Off Spain-Morocco Migrant Fight
4 ‘Covid Can’t Compete.’ In a Place Mired in War, the Virus Is an Afterthought.
5 ‘Everything Changed Overnight’: Afghan Reporters Face an Intolerant Regime
6 ‘Finally, I Am Safe’: U.S. Air Base Becomes Temporary Refuge for Afghans
7 ‘Find Him and Kill Him’: An Afghan Pilot’s Desperate Escape
8 ‘Football Is Like Food’: Afghan Female Soccer Players Find a Home in Italy
9 ‘Go Big’ on Coronavirus Stimulus, Trump Says, Pitching Checks for Americans
10 ‘Hospital Needs to Be Quarantined,’ but Works On in Country at War
score
1 -2
2 0
3 -3
4 -3
5 -5
6 8
7 -4
8 6
9 1
10 -3
#data frame of print NRC sentiment
print_news_sentiment_nrc_df <- data.frame(print_text=print_headlines$text, score = print_news_sentiment_nrc)
head(print_news_sentiment_nrc_df, 10)
print_text
1 $174 Million Drone Program for Afghans Is Riddled With Problems, Pentagon Says
2 Psychedelic Therapy In the Jungle Soothes The Pain for Veterans
3 Morocco Sends Spanish Outpost a Migrant Influx
4 ‘It’s a Lie’: Denial and Skepticism Permeate a Nation Embroiled in War
5 ‘Everything Changed’: Media Face Crackdown
6 ‘Finally, I Am Safe’: Thousands Find Temporary Refuge at U.S. Air Base
7 ‘Find Him and Kill Him’: A Pilot’s Desperate Escape From Kabul
8 Soccer Players Under Threat Escape to Italy
9 Plan Would Inject $1 Trillion Into Economy
10 As Pandemic Takes Toll on Afghan Doctors, Hospitals Still Tend to War Wounded
score
1 -2
2 -4
3 -1
4 -7
5 NA
6 8
7 -4
8 -3
9 2
10 -5
The AFINN lexicon has valence ratings between -5 (negative) and +5 (positive).
sentiments_afinn <- get_sentiments("afinn")
colnames(sentiments_afinn) <- c("word", "sentiment")
Again I’ll create a function to assign sentiment labels.
#applying AFINN sentiment to main headlines
main_news_sentiment_afinn_df <- lapply(main_news_tokens, function(x) { x %>% inner_join(sentiments_afinn)})
main_news_sentiment_afinn <- sapply(main_news_sentiment_afinn_df, function(x) {
ifelse(nrow(x) > 0, sum(x$sentiment), NA)
})
#applying AFINN sentiment to print headlines
print_news_sentiment_afinn_df <- lapply(print_news_tokens, function(x) { x %>% inner_join(sentiments_afinn)})
print_news_sentiment_afinn <- sapply(print_news_sentiment_afinn_df, function(x) {
ifelse(nrow(x) > 0, sum(x$sentiment), NA)
})
The summaries of each show the number of NA’s are similar to that in the bing lexicon.
summary(main_news_sentiment_afinn)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-10.000 -3.000 -2.000 -1.769 -1.000 5.000 368
summary(print_news_sentiment_afinn)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-10.00 -3.00 -2.00 -1.52 -1.00 6.00 359
Now I can look at the first 10 headlines and the corresponding AFINN analysis scores. I can see that the scores vary a lot less than in the first two lexicons.
#data frame of AFINN main headlines
main_news_sentiment_afinn_df <- data.frame(main_text=main_headlines$text, score = main_news_sentiment_afinn)
head(main_news_sentiment_afinn_df, 10)
main_text
1 $174 Million Afghan Drone Program Is Riddled With Problems, U.S. Report Says
2 ‘A Hail Mary’: Psychedelic Therapy Draws Veterans to Jungle Retreats
3 ‘Come On In, Boys’: A Wave of the Hand Sets Off Spain-Morocco Migrant Fight
4 ‘Covid Can’t Compete.’ In a Place Mired in War, the Virus Is an Afterthought.
5 ‘Everything Changed Overnight’: Afghan Reporters Face an Intolerant Regime
6 ‘Finally, I Am Safe’: U.S. Air Base Becomes Temporary Refuge for Afghans
7 ‘Find Him and Kill Him’: An Afghan Pilot’s Desperate Escape
8 ‘Football Is Like Food’: Afghan Female Soccer Players Find a Home in Italy
9 ‘Go Big’ on Coronavirus Stimulus, Trump Says, Pitching Checks for Americans
10 ‘Hospital Needs to Be Quarantined,’ but Works On in Country at War
score
1 NA
2 2
3 -1
4 -2
5 NA
6 1
7 -7
8 NA
9 NA
10 -2
#data frame of AFINN print headlines
print_news_sentiment_afinn_df <- data.frame(print_text=print_headlines$text, score = print_news_sentiment_afinn)
head(print_news_sentiment_afinn_df, 10)
print_text
1 $174 Million Drone Program for Afghans Is Riddled With Problems, Pentagon Says
2 Psychedelic Therapy In the Jungle Soothes The Pain for Veterans
3 Morocco Sends Spanish Outpost a Migrant Influx
4 ‘It’s a Lie’: Denial and Skepticism Permeate a Nation Embroiled in War
5 ‘Everything Changed’: Media Face Crackdown
6 ‘Finally, I Am Safe’: Thousands Find Temporary Refuge at U.S. Air Base
7 ‘Find Him and Kill Him’: A Pilot’s Desperate Escape From Kabul
8 Soccer Players Under Threat Escape to Italy
9 Plan Would Inject $1 Trillion Into Economy
10 As Pandemic Takes Toll on Afghan Doctors, Hospitals Still Tend to War Wounded
score
1 NA
2 -2
3 NA
4 -4
5 NA
6 1
7 -7
8 -3
9 NA
10 -2
Having obtained for each headline data set three potential results as sentiment evaluation, next I will calculate their congruence.
By congruence, I am looking at the fact that all three lexicons express a positive or negative result. In other words, the same score signal the same sentiment independently from the lexicon’s respective scale of magnitude. If “NA” values are present, the congruence is computed until at least two non-“NA” values are available, otherwise the value is equal to “NA”.
Then, I compute the final news sentiment as based upon the sum of each lexicon sentiment score.
compute_final_sentiment <- function(x,y,z) {
if (is.na(x) && is.na(y) && is.na(z)) {
return (NA)
}
s <- sum(x, y, z, na.rm=TRUE)
# positive sentiments have score strictly greater than zero
# negative sentiments have score strictly less than zero
# neutral sentiments have score equal to zero
ifelse(s > 0, "positive", ifelse(s < 0, "negative", "neutral"))
}
Now I will put the sentiment results in new data frames and apply the analyses.
main_sentiments_results <- data.frame(main_text = main_headlines$text,
bing_score = main_news_sentiment_bing,
nrc_score = main_news_sentiment_nrc,
afinn_score = main_news_sentiment_afinn,
stringsAsFactors = FALSE)
print_sentiments_results <- data.frame(print_text = print_headlines$text,
bing_score = print_news_sentiment_bing,
nrc_score = print_news_sentiment_nrc,
afinn_score = print_news_sentiment_afinn,
stringsAsFactors = FALSE)
main_sentiments_results <- main_sentiments_results %>% rowwise() %>%
mutate(final_sentiment = compute_final_sentiment(bing_score, nrc_score, afinn_score),
congruence = compute_congruence(bing_score, nrc_score, afinn_score))
print_sentiments_results <- print_sentiments_results %>% rowwise() %>%
mutate(final_sentiment = compute_final_sentiment(bing_score, nrc_score, afinn_score),
congruence = compute_congruence(bing_score, nrc_score, afinn_score))
head(main_sentiments_results, 10)
# A tibble: 10 x 6
# Rowwise:
main_text bing_score nrc_score afinn_score final_sentiment
<chr> <int> <int> <dbl> <chr>
1 $174 Million Afgh~ NA -2 NA negative
2 ‘A Hail Mary’: Ps~ 1 0 2 negative
3 ‘Come On In, Boys~ NA -3 -1 negative
4 ‘Covid Can’t Comp~ -1 -3 -2 negative
5 ‘Everything Chang~ NA -5 NA negative
6 ‘Finally, I Am Sa~ 1 8 1 negative
7 ‘Find Him and Kil~ -2 -4 -7 negative
8 ‘Football Is Like~ NA 6 NA negative
9 ‘Go Big’ on Coron~ 1 1 NA negative
10 ‘Hospital Needs t~ NA -3 -2 negative
# ... with 1 more variable: congruence <lgl>
head(print_sentiments_results, 10)
# A tibble: 10 x 6
# Rowwise:
print_text bing_score nrc_score afinn_score final_sentiment
<chr> <int> <int> <dbl> <chr>
1 "$174 Million Dro~ NA -2 NA negative
2 "Psychedelic Ther~ -1 -4 -2 negative
3 "Morocco Sends Sp~ NA -1 NA negative
4 "‘It’s a Lie’: De~ -4 -7 -4 negative
5 "‘Everything Chan~ NA NA NA negative
6 "‘Finally, I Am S~ 1 8 1 negative
7 "‘Find Him and Ki~ -2 -4 -7 negative
8 "Soccer Players U~ -1 -3 -3 negative
9 "Plan Would Injec~ NA 2 NA negative
10 "As Pandemic Take~ -1 -5 -2 negative
# ... with 1 more variable: congruence <lgl>
It seems like I need to do more work on the congruence function, as I have all “NA” results.
#If it would be useful to replace the numeric score with same {negative, neutral, positive} scale.
replace_score_with_sentiment <- function(v_score) {
v_score[v_score > 0] <- "positive"
v_score[v_score < 0] <- "negative"
v_score[v_score == 0] <- "neutral"
v_score
}
I’ll combine all of the normalized and binary ‘positive’ and ‘negative’ sentiments from all three lexicons into one data frame for each headline set.
#apply scale to main results
main_sentiments_results$bing_score <- replace_score_with_sentiment(main_sentiments_results$bing_score)
main_sentiments_results$nrc_score <- replace_score_with_sentiment(main_sentiments_results$nrc_score)
main_sentiments_results$afinn_score <- replace_score_with_sentiment(main_sentiments_results$afinn_score)
main_sentiments_results[,2:5] <- lapply(main_sentiments_results[,2:5], as.factor)
head(main_sentiments_results, 40)
# A tibble: 40 x 6
# Rowwise:
main_text bing_score nrc_score afinn_score final_sentiment
<chr> <fct> <fct> <fct> <fct>
1 $174 Million Afgh~ <NA> negative <NA> negative
2 ‘A Hail Mary’: Ps~ positive neutral positive negative
3 ‘Come On In, Boys~ <NA> negative negative negative
4 ‘Covid Can’t Comp~ negative negative negative negative
5 ‘Everything Chang~ <NA> negative <NA> negative
6 ‘Finally, I Am Sa~ positive positive positive negative
7 ‘Find Him and Kil~ negative negative negative negative
8 ‘Football Is Like~ <NA> positive <NA> negative
9 ‘Go Big’ on Coron~ positive positive <NA> negative
10 ‘Hospital Needs t~ <NA> negative negative negative
# ... with 30 more rows, and 1 more variable: congruence <lgl>
#apply scale to print results
print_sentiments_results$bing_score <- replace_score_with_sentiment(print_sentiments_results$bing_score)
print_sentiments_results$nrc_score <- replace_score_with_sentiment(print_sentiments_results$nrc_score)
print_sentiments_results$afinn_score <- replace_score_with_sentiment(print_sentiments_results$afinn_score)
print_sentiments_results[,2:5] <- lapply(print_sentiments_results[,2:5], as.factor)
head(print_sentiments_results, 40)
# A tibble: 40 x 6
# Rowwise:
print_text bing_score nrc_score afinn_score final_sentiment
<chr> <fct> <fct> <fct> <fct>
1 "$174 Million Dro~ <NA> negative <NA> negative
2 "Psychedelic Ther~ negative negative negative negative
3 "Morocco Sends Sp~ <NA> negative <NA> negative
4 "‘It’s a Lie’: De~ negative negative negative negative
5 "‘Everything Chan~ <NA> <NA> <NA> negative
6 "‘Finally, I Am S~ positive positive positive negative
7 "‘Find Him and Ki~ negative negative negative negative
8 "Soccer Players U~ negative negative negative negative
9 "Plan Would Injec~ <NA> positive <NA> negative
10 "As Pandemic Take~ negative negative negative negative
# ... with 30 more rows, and 1 more variable: congruence <lgl>
I’ll take the overall sentiment score and join them in one data frame and visualize it. After taking the value ‘positive’ or ‘negative’ that is in the majority of the 3 evaluations, the dataset is overwhelmingly ‘negative’ (100%).
article print_bing print_nrc print_afinn print_final main_bing
1 1 <NA> negative <NA> negative <NA>
2 2 negative negative negative negative positive
3 3 <NA> negative <NA> negative <NA>
4 4 negative negative negative negative negative
5 5 <NA> <NA> <NA> negative <NA>
6 6 positive positive positive negative positive
main_nrc main_afinn main_final
1 negative <NA> negative
2 neutral positive negative
3 negative negative negative
4 negative negative negative
5 negative <NA> negative
6 positive positive negative
main_graph <- read.csv("main_graph.csv")
library(ggplot2)
main_plot <- main_graph %>%
ggplot(aes(date, sentiment, fill = lexicon)) +
geom_col(show.legend = FALSE) +
facet_wrap(~lexicon, ncol = 1, scales = "free_y") +
scale_fill_manual(values=c("#993333", "#336699", "#669900")) +
theme_minimal()
main_plot
This research makes use of the NRC Word-Emotion Association Lexicon, created by Saif Mohammad and Peter Turney at the National Research Council Canada.
This research makes use of the Bing Lexicon. This dataset was first published in Minqing Hu and Bing Liu, ``Mining and summarizing customer reviews.’’, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004.
This research makes use of the AFINN Lexicon, Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. arXiv preprint arXiv:1103.2903.