Analysis of PDF Articles

text as data NYT text analysis project pdf analysis

Text as Data Project-Article Sentiment Research

Kristina Becvar https://kbec19.github.io/NYT-Analysis/ (UMass DACSS Program (My Academic Blog Link))https://kristinabecvar.com
04/17/2022

Getting Started

The primary goal of this aspect of research is to refine the process for examining the content of the full articles for which the main vs. print headlines are the most different from each other in the primary project analysis.

Pulling in the PDF docs

I have the PDF files in my working directory. Using the “list.files()” function from the “pdftools” package, I can create a vector of PDF file names, specifying only files that end in “.pdf”.

Show code

Extracting PDF Files being examined (random at this time - exploratory)

Show code
#create file names
files <- list.files(pattern = "pdf$")

#extract the pdf file data
nyt_articles <- lapply(files, pdf_text)

#apply length functions
lapply(nyt_articles, length)
[[1]]
[1] 4

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4

[[5]]
[1] 10

[[6]]
[1] 6

[[7]]
[1] 2

[[8]]
[1] 5

[[9]]
[1] 5

[[10]]
[1] 2
Show code
#view the structure of the list
str(nyt_articles)
List of 10
 $ : chr [1:4] "                                 https://www.nytimes.com/2021/04/13/us/politics/samantha-power-biden.html\n\n\n"| __truncated__ "                                Secretary of State Antony J. Blinken at the opening session of talks with China"| __truncated__ "The mostly benign prodding by Democrats and Republicans during the hearing signaled how countering China has be"| __truncated__ "“There is so much that can be done between bombing and nothing,” Mr. Prendergast said, paraphrasing Luis Moreno"| __truncated__
 $ : chr [1:2] "                            https://www.nytimes.com/2021/04/29/world/asia/central-asia-border-\n               "| __truncated__ "In announcing the cease-fire, the Kyrgyz Ministry of Interior said that it “does not have\ndesigns on foreign t"| __truncated__
 $ : chr [1:3] "                                https://www.nytimes.com/2021/08/05/us/politics/taliban-afghanistan-peace-deal.h"| __truncated__ "The statement came as Taliban representatives met with Afghan government officials, including Mr. Abdullah, for"| __truncated__ "“The Taliban is not interested in negotiating seriously right now because of what’s happening on the battlefiel"| __truncated__
 $ : chr [1:4] "                                https://www.nytimes.com/2021/08/08/us/politics/taliban-afghanistan-united-state"| __truncated__ "Over the past week, Taliban fighters have moved swiftly to retake cities around Afghanistan, assassinated gover"| __truncated__ "                                 Ms. Psaki speaking to reporters at the White House, on Friday. Tom Brenner for"| __truncated__ "Mr. Biden, declaring that the United States had long ago accomplished its mission of denying terrorists a haven"| __truncated__
 $ : chr [1:10] "                                https://www.nytimes.com/2021/08/30/world/asia/us-withdrawal-afghanistan-kabul.h"| __truncated__ "Old Soviet tanks litter the grounds of Bala Hissar, outside Kunduz. Jim Huylebroek for The New York Times\n" "  Khalil Haqqani, a Taliban leader, appeared at Friday prayers in Kabul this month with an American-made M-4 ri"| __truncated__ "The Taliban’s leverage, earned after years of fighting the world’s most advanced military, multiplied as they c"| __truncated__ ...
 $ : chr [1:6] "                                 https://www.nytimes.com/2021/09/01/world/asia/afghanistan-taliban-government-l"| __truncated__ "  Internally displaced Afghans fleeing the fighting in the north still live at a camp in the Sarawi Shomali par"| __truncated__ "  A vendor selling Taliban flags in Kabul on Friday near posters of the senior Taliban officials Amir Khan Mutt"| __truncated__ "The Taliban are also fighting stubborn opposition forces led by National Resistance Front leaders in Panjshir P"| __truncated__ ...
 $ : chr [1:2] "                               https://www.nytimes.com/2021/09/02/us/politics/congress-pentagon-budget-biden.ht"| __truncated__ "The lopsided vote underscored another reality: Even as the hard-charging liberal bloc of lawmakers pledging to "| __truncated__
 $ : chr [1:5] "                                 https://www.nytimes.com/2021/09/07/us/politics/afghan-war-iraq-veterans.html\n"| __truncated__ "                                Jen Burch said the doctors who examined her in 2014 found ground glass nodules "| __truncated__ "                                 Melissa Gauntner has dealt with dual traumas and has at times been gripped wit"| __truncated__ "In military families, scholars find what they call secondary traumatic distress, symptoms of anxiety stemming f"| __truncated__ ...
 $ : chr [1:5] "                                 https://www.nytimes.com/2020/10/05/world/asia/afghan-peace-talks-children.html"| __truncated__ "                                   Fatima Gailani, whose father was one of the leaders of the mujahedeen resist"| __truncated__ "                                Anas Haqqani, the youngest son of the insurgent chief Jalaluddin Haqqani, is pa"| __truncated__ "                                 Jalaluddin Haqqani in an undated photo from a video released by the Taliban on"| __truncated__ ...
 $ : chr [1:2] "                                https://www.nytimes.com/2020/03/04/world/asia/afghanistan-taliban-violence.html"| __truncated__ "     Understand the Taliban Takeover in Afghanistan\n\n     Who are the Taliban? The Taliban arose in 1994 amid"| __truncated__

Inspect the first article

Show code
head(nyt_articles[1])
[[1]]
[1] "                                 https://www.nytimes.com/2021/04/13/us/politics/samantha-power-biden.html\n\n\n\nAfter Backing Military Force in Past, U.S.A.I.D. Nominee Focuses on Deploying Soft\nPower\nIf confirmed to oversee the U.S. Agency for International Development, Samantha Power will confront adversaries by bolstering\ndemocracy and human rights. China is an early focus.\n\n\n          By Lara Jakes\n\nPublished April 13, 2021   Updated April 14, 2021\n\n\nWASHINGTON — Near the end of the 2014 documentary “Watchers of the Sky,” which chronicles the origins of the legal definition\nof genocide, Samantha Power grows emotional. At the time, Ms. Power was President Barack Obama’s ambassador to the United\nNations, and, she said, had “great visibility into a lot of the pain” in the world.\n\nFrom that perch, preventing mass atrocities abroad required “thinking through what we can do about it, to exhaust the tools at your\ndisposal,” Ms. Power said in the film. “And I always think about the privilege of, you know, of getting to try — just to try.”\n\nFew doubt Ms. Power’s zeal — given her career as a war correspondent, human rights activist, academic expert and foreign policy\nadviser — even if it has meant advocating military force to stop widespread killings.\n\nNow, as President Biden’s nominee to lead the United States Agency for International Development, she is preparing to rejoin the\ngovernment as an administrator of soft power, and resist using weapons as a means of deterrence and punishment that she has\npushed for in the past.\n\nA Senate committee is expected to vote Thursday on her nomination to lead one of the world’s largest distributors of humanitarian\naid.\n\nIf she is confirmed, Mr. Biden will also seat her on the National Security Council, where during the Obama administration she\npressed for military intervention to protect civilians from state-sponsored attacks in Libya in 2011 and Syria in 2013. (However, she\nalso opposed the 2003 invasion of Iraq.)\n\nThat she will be back at the table at the council — and again almost certain to be debating whether to entangle American forces in\nenduring conflicts — has concerned some officials, analysts and think tank experts who demand military restraint from the Biden\nadministration. Mr. Biden appears to be leaning that way: He has embraced economic sanctions as a tool of hard power and is\nexpected to announce a full withdrawal of American troops from Afghanistan by Sept. 11, ending the United States’ longest war.\n\n“If you’re talking about humanitarianism, famine, the wars — really, other than natural causes, war is the No. 1 cause of famine\naround the world,” Senator Rand Paul, Republican of Kentucky, told Ms. Power last month during her Senate confirmation hearing.\n“Are you willing to admit that the Libyan and Syrian interventions that you advocated for were a mistake?”\n\nMs. Power did not. “When these situations arise, it’s a question almost of lesser evils — that the choices are very challenging,” she\nsaid.\n\nBy its very nature, the U.S. aid agency takes a long-term view of the world compared with the immediacy of military action. Beyond\nthe roughly $6 billion in humanitarian aid it is delivering this year to disaster-ridden nations, the agency seeks to prevent conflict at\nits roots, largely bolstering economies, countering state corruption and fostering democracy and human rights.\n\nThat mission is central to Mr. Biden’s foreign policy, and will perhaps prove nowhere more pivotal than in his global competition\nwith China.\n\nLast month, Secretary of State Antony J. Blinken assured allies that they would not be backed into an “‘us-or-them’ choice with\nChina” as the two superpowers vie for economic, diplomatic and military advantage.\n"
[2] "                                Secretary of State Antony J. Blinken at the opening session of talks with China at the\n                                Captain Cook hotel in Anchorage. Pool photo by Frederic J. Brown\n\n\n\nInstead, the United States is highlighting what officials call China’s malign ideology and self-interests as it expands an influence\ncampaign across Africa, Europe and South America with financial loans, infrastructure funds, coronavirus vaccines and advanced\ntechnology.\n\nThe Trump administration also seized on China’s human rights abuses — particularly against ethnic Uyghurs in the country’s\nwestern region of Xinjiang — to persuade allies to turn against Beijing. On the Trump administration’s final day in office, Mike\nPompeo, the secretary of state, declared China’s oppression against Uyghurs as an act of genocide, and he criticized Beijing’s\nviolent suppression of dissidents in Hong Kong and military harassment of Taiwan.\n\n\n                                Sign Up for On Politics A guide to the political news cycle, cutting\n                                through the spin and delivering clarity from the chaos. Get it sent to your\n                                inbox.\n\n\nOfficials said China’s much-debated Belt and Road Initiative was a prime battleground for U.S.A.I.D. to challenge Beijing.\n\nRepresentative Tom Malinowski, Democrat of New Jersey and a former assistant secretary of state for democracy and human\nrights for Mr. Obama, described a “perception that China is exporting corruption” with its loans and development projects.\n\nFor example, a study in February by the International Republican Institute, a private nonprofit group that receives government\nfunding and promotes democracy, concluded that Panama’s decision in 2017 to sever diplomatic ties with Taiwan “appears to have\nbeen driven by payoffs” from China. It also noted that Nepal regularly revoked the legal status of Tibetan refugees after becoming\neconomically reliant on Beijing.\n\nThe American aid agency alone cannot match the funds that China has seeded in developing countries. But Mr. Malinowski said its\nsupport to journalists, legal advisers and legitimate opposition groups could “expose and combat” corrosive foreign leaders who\nhad benefited from Beijing’s financial backing and playbook for how to remain in power.\n\n“There is one issue that has risen to the top in this administration that I know she is very focused on, and that’s fighting corruption,”\nMr. Malinowski said of Ms. Power. “And U.S.A.I.D. has a very important role to play there, potentially.”\n\nAt her confirmation hearing in March, Ms. Power told senators she was moved to pursue a career in foreign policy after the 1989\nmassacre of protesters in Tiananmen Square in Beijing. She described China’s “coercive and predatory approach, which is so\ntransactional” in its dealings with developing countries that ultimately become dependent on Beijing through what she called “debt-\ntrap diplomacy.”\n\n“I think it’s not going over that well, and that creates an opening for the United States,” Ms. Power told Senator Todd Young,\nRepublican of Indiana.\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[3] "The mostly benign prodding by Democrats and Republicans during the hearing signaled how countering China has become a rare,\nif reliable, issue of bipartisanship in Congress. “It’s absolutely essential that our development dollars, I think, be used to advance\nour geostrategic priorities,” Mr. Young said.\n\nThe aid agency and the State Department have budgeted about $2 billion on programs to foster democracy, human rights and open\ngovernance abroad in the 2021 fiscal year — one-third as much as funding for humanitarian assistance.\n\nIt is an area that Ms. Power is expected to expand. The Biden administration’s first budget blueprint, released on Friday, asserted it\nwould commit an unspecified but “significant increase in resources” to advance human rights and democracy while thwarting\ncorruption and authoritarianism.\n\n\n\n\n                               Asylum seekers from Central America crossing the Paso del Norte International Bridge,\n                               in Ciudad Juarez, Mexico. One of Ms. Power’s priorities will be to target corruption,\n                               violence and poverty in the region. Jose Luis Gonzalez/Reuters\n\n\n\nThe spending plan also will support another of Ms. Power’s priorities: targeting corruption, violence and poverty in Central\nAmerica as a means to curb the flow of thousands of migrants who head to the southwestern border each year. The Biden\nadministration is banking on a $4 billion strategy through 2025 — including an initial tranche of $861 million proposed this year — to\nhelp stabilize the region.\n\nIn El Salvador, for example, homicides dropped 61 percent after a U.S.A.I.D. effort to reduce violence from 2015 to 2017, Ms. Power\ntold the senators, and the agency’s programs in Honduras have yielded similar results. The programs not only supported local\nprosecutors but also brought together government officials, businesses and church and community leaders to divert young people\nfrom gangs through job training, tutoring and artistic activities.\n\nShe was met with some skepticism.\n\nSenator Rob Portman, Republican of Ohio, noted that the number of children from Central America at the border had steadily\nincreased since January, even though the United States spent $3.6 billion over the past five years on similar efforts.\n\n“The results are not impressive,” Mr. Portman said. “It’s an economic issue, primarily,” and “people will still be looking to come to\nthe United States.”\n\nExplaining foreign policy decisions to the American people, and making it relevant to their lives, is a driving theme of the State\nDepartment under Mr. Biden. Ms. Power can reach back to her own experiences as both an immigrant from Ireland and a\nstoryteller to make the case for easing the border crisis by attacking its root causes.\n\n“That’s part of the job, too — you’ve got to be a salesperson, you’ve got to go out there and explain to people, ‘Here’s why we need\nmore resources to do this work, and here’s where U.S.A.I.D. can be an incredibly important partner,’” said John Prendergast, a\nlongtime human rights and anticorruption activist and close friend to Ms. Power.\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[4] "“There is so much that can be done between bombing and nothing,” Mr. Prendergast said, paraphrasing Luis Moreno Ocampo, the\nformer prosector of the International Criminal Court who was featured in the same documentary about genocide as Ms. Power.\n“And Samantha’s whole work and life has been between those two extremes.”\n\nGayle Smith, who ran the aid agency for Mr. Obama and is now the State Department’s coronavirus vaccine envoy, put it more\nbluntly.\n\n“It’s not like U.S.A.I.D. is going to invade somebody,” she said.\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            

Inspecting Individual Articles

Now I’m going to use “purrr” to “pluck()” each of the articles as its’ own vector and create a corpus of each article to examine.

Show code
article_111 <- nyt_articles %>% 
  pluck(1)
article_111 <- as_vector(article_111)

article_111_corpus <- corpus(article_111)
article_111_summary <- summary(article_111_corpus)
article_111_summary
Corpus consisting of 4 documents, showing 4 documents:

  Text Types Tokens Sentences
 text1   357    688        24
 text2   289    523        19
 text3   304    562        18
 text4    76    105         4

I also found a very interesting way to pull the text and save them as individual .txt files, but for now I’m just going to note that as an alternative process. I’ve struggled quite a bit to get the PDF text read compared to the headlines.

Show code
convertpdf2txt <- function(dirpath){
  files <- list.files(dirpath, full.names = T)
  x <- sapply(files, function(x){
  x <- pdftools::pdf_text(x) %>%
  paste(sep = " ") %>%
  stringr::str_replace_all(fixed("\n"), " ") %>%
  stringr::str_replace_all(fixed("\r"), " ") %>%
  stringr::str_replace_all(fixed("\t"), " ") %>%
  stringr::str_replace_all(fixed("\""), " ") %>%
  paste(sep = " ", collapse = " ") %>%
  stringr::str_squish() %>%
  stringr::str_replace_all("- ", "") 
  return(x)
    })
}
# apply function
txts <- convertpdf2txt("./files")
# inspect the structure of the txts element
str(txts)
 Named chr [1:10] "https://www.nytimes.com/2021/04/13/us/politics/samantha-power-biden.html After Backing Military Force in Past, "| __truncated__ ...
 - attr(*, "names")= chr [1:10] "./files/article_111.pdf" "./files/article_132.pdf" "./files/article_193.pdf" "./files/article_196.pdf" ...
Show code
#apply length functions
lapply(txts, length)
$`./files/article_111.pdf`
[1] 1

$`./files/article_132.pdf`
[1] 1

$`./files/article_193.pdf`
[1] 1

$`./files/article_196.pdf`
[1] 1

$`./files/article_278.pdf`
[1] 1

$`./files/article_288.pdf`
[1] 1

$`./files/article_293.pdf`
[1] 1

$`./files/article_300.pdf`
[1] 1

$`./files/article_56.pdf`
[1] 1

$`./files/article_7.pdf`
[1] 1
Show code
#view the structure of the list
str(txts)
 Named chr [1:10] "https://www.nytimes.com/2021/04/13/us/politics/samantha-power-biden.html After Backing Military Force in Past, "| __truncated__ ...
 - attr(*, "names")= chr [1:10] "./files/article_111.pdf" "./files/article_132.pdf" "./files/article_193.pdf" "./files/article_196.pdf" ...
Show code
# add names to txt files
names(txts) <- paste("nyt", 1:length(txts), sep = "")
# save result to disc
lapply(seq_along(txts), function(i)writeLines(text = unlist(txts[i]),
    con = paste("./txts", names(txts)[i],".txt", sep = "")))
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

[[5]]
NULL

[[6]]
NULL

[[7]]
NULL

[[8]]
NULL

[[9]]
NULL

[[10]]
NULL

Unlist

Documenting, for now, the ways I’m struggling with so I can find out why. Primarily, I’m struggling with the ‘unlist’ command as it applies to documents originating as PDF files, though this is not an issue when I use it in other types of situations.

Show code
#convert list to vector
#nyt_vector <- unlist(nyt_articles, recursive = TRUE)
#put articles into data frame
#nyt_df <- as.data.frame(nyt_vector, row.names = NULL, stringsAsFactors = FALSE)
Show code
#create corpus
#nyt_corpus <- corpus(txts)
#confirming class of corpus
#class(nyt_corpus)
#confirm length of corpus
#length(nyt_corpus)

Conclusion

I will not be able to fit this type of analysis into the scope of my current project. I will use this in further studies.