Applying Amazon Comprehend on foreign news articles in R

Karola Takács
7 min readDec 9, 2020

Introduction

In this short exploratory project I decided to combine several of Amazon’s currently available Machine Learning services, namely Amazon Translate and Comprehend. The former is a text translation service using advanced machine learning technologies to provide high-quality translation on demand. The language pairs (input-output) used here were german-english and spanish-english. The latter service uses natural language processing (NLP) to extract insights about the content of documents. With the help of this service the user can recognize the most common entities, key phrases and sentiments in a document.

My idea of the workflow was that I search for two non-english news articles with the same topic, scrape them with R and then apply the Amazon services to better understand the standpoints and general “feelings” about the current topic of the COVID vaccination. I was interested in how different countries report on it and how do they approach this new and sometimes controversial topic. In this short analysis I picked two European countries, two which I believe possess contrary characteristics — I am not going to try to typify Germany and Spain to avoid generalization, rather let’s jump into the set up of the analysis.

Setup

‘aws.translate’ and ‘aws.comprehend are R packages that allow users to easily access the AWS Translate and Comprehend services. Beforehand it is necessary to have AWS credentials, and more importantly our own keypair created via IAM — Amazon’s Identity and Access Management system. To connect to AWS from R the following code is needed:

# accessKeys.csv == the CSV downloaded from AWS containing your Access & Secret keys
keyTable <- read.csv("accessKeys.csv", header = T)
AWS_ACCESS_KEY_ID <- as.character(keyTable$Access.key.ID)
AWS_SECRET_ACCESS_KEY <- as.character(keyTable$Secret.access.key)
#activate
Sys.setenv("AWS_ACCESS_KEY_ID" = AWS_ACCESS_KEY_ID,
"AWS_SECRET_ACCESS_KEY" = AWS_SECRET_ACCESS_KEY,
"AWS_DEFAULT_REGION" = "eu-west-1")

For installation of the two packages:

# install.packages("aws.translate", repos = c(getOption("repos"), "http://cloudyr.github.io/drat"))
# install.packages("aws.comprehend", repos = c(cloudyr = "http://cloudyr.github.io/drat", getOption("repos")))

Since I don’t speak spanish and only a little german, the choice of which articles to pick was quite random. I translated the same phrase “Vaccination for COVID-19” into spanish and german and searched news outlets. The spanish text can be found on BBC Mundo, which is part of the BBC World Service’s foreign language output for the Spanish-speaking world;

while the german text is on dw.com: Deutsche Welle is a German public state-owned international broadcaster funded by the German federal tax budget and content is intended to be independent of government influence.

After reading the html code from the websites, I have used CSS selectors to scrape these pages. Selector Gadget is a Chrome add-in that makes scraping quite easy with the help of just point-and-click for the desired parts of the webpage and then pasting the path into R as:

description_html_spanish <- html_nodes(url_spanish,'.e57qer20:nth-child(34) .e1cc2ql70 , .e57qer20:nth-child(33) .e1cc2ql70 , .e57qer20:nth-child(32) .e1cc2ql70 , .e57qer20:nth-child(30) .e1cc2ql70 , .e57qer20:nth-child(29) .e1cc2ql70 , .e57qer20:nth-child(28) .e1cc2ql70 , .e57qer20:nth-child(26) .e1cc2ql70 , .e57qer20:nth-child(25) .e1cc2ql70 , .e57qer20:nth-child(24) .e1cc2ql70 , .e57qer20:nth-child(23) .e1cc2ql70 , .e57qer20:nth-child(21) .e1cc2ql70 , .e57qer20:nth-child(20) .e1cc2ql70 , .e57qer20:nth-child(19) .e1cc2ql70 , .e57qer20:nth-child(16) .e1cc2ql70 , .ewc4zcb0 , .e57qer20:nth-child(5) .e1cc2ql70 , .e57qer20:nth-child(14) .e1cc2ql70 , .css-1fct1nm-GridComponent+ .e57qer20 .e1cc2ql70')description_html_german <- html_nodes(url_german,'.longText > p , .intro , h1')

This way I ended up with two descriptions that contained html then I transformed these into text for further steps. The translation consisted of basically two blocks and resulted in two vectors:

trans1 <- NULL
tr <- NULL
# for each element in the original story
for (i in 1:length(description_german)){
# translate from the automatically detected language to English and take only the translation (omitting attributes)
tr <- translate(description_german[i], from = "auto", to = "en")[1]
trans1 <- (rbind(trans1, tr))
}
# Save the translation as a character vector
trans_from_de <- as.character(trans1)
trans2 <- NULL
tr <- NULL
# for each element in the original story
for (i in 1:length(desc_sp_final)){
# translate from the automatically detected language to English and take only the translation (omitting attributes)
tr <- translate(description_spanish[i], from = "auto", to = "en")[1]
trans2 <- (rbind(trans2, tr))
}
# Save the translation as a character vector
trans_from_sp <- as.character(trans2)

Analysis

Sentiments

Amazon Comprehend can determine the emotional sentiment of a document, showing the likelihood for each category in the given text. Actually determine sentiment operations can be performed using any of the primary languages supported by Amazon Comprehend given they have the same language. This means that I could also just plug in the original texts after scraping — since both german and spanish are supported — but that would not aid the understanding of the text and thus understanding the phrases and entities later on (except person names and instituions of course).

Similarily to the translation I could perform the sentiment analysis too:

df1 <- NULL
a1 <- NULL
# for each element in the story do the following
for (i in 1:length(trans1)) {
# but only if this condition is met
if (trans1[i] > "") {
df1 <- detect_sentiment(trans1[i]) # get the sentiment
a1 <- rbind(a1, df1)
}
}
df2 <- NULL
a2 <- NULL
# for each element in the story do the following
for (i in 1:length(trans2)) {
# but only if this condition is met
if (trans2[i] > "") {
df2 <- detect_sentiment(trans2[i]) # get the sentiment
a2 <- rbind(a2, df2)
}
}

I have visualized the two dataframes: a1 (german text) contains 15 observations while a2 (spanish text) contains 47.

In the return we see sentiments and a likelihood score for the correct detection by the algorithm. In case of the first article it is 99.95% likely that the text has a neutral sentiment, meanwhile for the second article this figure is 94.8%. After all, both results indicate that the articles are neutral but we can realize subtle differences, i.e the neautral sentiments’ relative frequency is about 12% higher in case of the german article than the spanish one’s and it is also easy to spot that the positive sentiments are missing from the german article. Overall I would say that the spanish article contains more mixed feelings with a bit of positive and mixed voices on the detriment of neutral expressions.

Entities and phrases

We have the opportunity to dive deeper in the exact phrases, words and frequent entities/actors in the texts and extract additional insights. The code is simple for this:

# get entities from german text
df3 <- detect_entities(trans_from_de)
#get entities from spanish text
df4 <- detect_entities(paste(trans_from_sp, collapse=''))

I plotted the identified entities by Comprehend so let’s compare the two texts again:

Here we can discover a bit more diversity than previously with the sentiments. Germans mention more dates — it might signal that they are planning more often or refer back to previous dates more frequent — and refer a lot more to persons. It seems logical that quantities appear on the second place in both pieces since the topic went around mostly how many people can get and how many vaccines I would guess. The most interesting “finding” is the fact that for the spanish article the ‘other’ entity type dominates while ‘organization’ is only third in a row compared to the german text where the ‘other’ category is not really relevant and organizations play a crutial role.

In order to get to the exact phrases I applied the following codes:

df5 <- detect_phrases(trans_from_de)
df6 <- detect_phrases(paste(trans_from_sp, collapse=''))

The output for the german article has 218 observations while for the spanish one 144, but this is simply due to the length of the news.

Conclusion

Current project’s scope was not intended to go deeper to the level of words, so to sum it up I would say that in the german piece of news they might havily rely on and expect actions from organizations/insitutions and the most important aspects are time and quantity. The second article is mostly concerned with ‘other’ and quantity and insitutions to some extent. It turned out that ‘other’ is just the words covid-19/European/Scandinavian — so they mention some other European countries probably for reference. This also means that these two latter should be characterized rather as ‘Location’ type and not as ‘other’, as I can see Italy and Baltic are already in the ‘Location’ type. This would already balance a bit out the majority of the ‘other’ types. I think the 25 observations that comprise the output is not much to draw conclusions and this last example also suggest that categories might not be that accurate (‘European’ had a score of 0.9243979 likeliness for being ‘Other’).

For me the most interesting result is that how much information can be drawn about the articles without even reading them or speaking the language. Germans also seem a bit less positive/optimistic as well as detached than Spaniards.

--

--