Extract text from dynamic Web page using R

Question

I am working on a data prep tutorial, using data from this article: https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#

None of the text is hard-coded, everything is dynamic and I don't know where to start. I've tried a few things with packages rvest and xml2 but I can't even tell if I'm making progress or not.

I've used copy/paste ang regexes in notepad++ to get a tabular structure like this:

Target	Attack
AAA News	Fake News
AAA News	Fake News
AAA News	A total disgrace
...	...
Mr. ZZZ	A real nut job

but I'd like to show how to do everything programmatically (no copy/paste).

My main question is as follows: is that even possible with reasonable effort? And if so, any clues on how to get started?

PS: I know that this could be a duplicate, I just can't tell of which question since there are totally different approaches out there :\

Dave2e · Answer 1 · 2021-01-30T19:50:30.320

I used my free articles allocation at The NY Times for the month, but here is some guidance. It looks like the web page uses several scripts to create and display the page.

If you uses your browser's developer tools and look at the network tab, you will find 2 CSV files:

tweets-full.csv located here: https://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-full.csv
tweets-reduced.csv located here: https://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-reduced.csv

It looks like the reduced file creates the table quoted above and the tweets-full is the full tweet. You can download these files directly with read.csv() and the process this information as needed.

Be sure to read the term of service before scraping any webpage.

The Reduced list has 4 columns: The target, the insult, the tweet date and the tweet number (in the tweets-full). For example, the first line was Thomas Frieden was called a fool in tweet 1347 on 10-9-2014 — Dave2e, Jan 30 '21 at 15:16
Good catch, thanks. Looks like the data cleaning tutorial is going to take a turn into reverse engineering! — Dominic Comtois, Jan 30 '21 at 15:55

Ian Campbell · Accepted Answer · 2021-01-31T14:18:36.090

Here's a programatic approach with RSelenium and rvest:

library(RSelenium)
library(rvest)
library(tidyverse)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#")
page.source <- client$getPageSource()[[1]]

#Extract nodes for each letter using XPath
Letters <- read_html(page.source) %>%
  html_nodes(xpath = '//*[@id="mem-wall"]/div[2]/div') 

#Extract Entities using CSS
Entities <- map(Letters, ~ html_nodes(.x, css = 'div.g-entity-name') %>%
                  html_text)

#Extract quotes using CSS
Quotes <- map(Letters, ~ html_nodes(.x, css = 'div.g-twitter-quote-container') %>%
                            map(html_nodes, css = 'div.g-twitter-quote-c') %>%
                            map(html_text))

#Bind the entites and quotes together. There are two letters that are blank, so fall back to NA
map2_dfr(Entities, Quotes,
         ~ map2_dfr(.x, .y,~ {if(length(.x) > 0 & length(.y)){data.frame(Entity = .x, Insult = .y)}else{
                                                        data.frame(Entity = NA, Insult = NA)}})) -> Result

#Strip out the quotes
Result %>%
  mutate(Insult = str_replace_all(Insult,"(^“)|([ .,!?]?”)","") %>% str_trim) -> Result

#Take a look at the result
Result %>%
  slice_sample(n=10)
                   Entity                                                              Insult
1             Mitt Romney                                       failed presidential candidate
2         Hillary Clinton                                                             Crooked
3  The “mainstream” media                                                           Fake News
4               Democrats                                             on a fishing expedition
5           Pete Ricketts                                             illegal late night coup
6  The “mainstream” media                                                   anti-Trump haters
7     The Washington Post do nothing but write bad stories even on very positive achievements
8               Democrats                                                                weak
9             Marco Rubio                                                         Lightweight
10     The Steele Dossier                                                      a Fake Dossier

The xpath was obtained by inspecting the webpage source (F9 in Chrome), hovering over elements until the correct one was highlighted, right clicking, and choosing copy XPath as shown:

I got it to work with Firefox... RSelenium + Chrome was very unstable. It's exactly what I wanted to accomplish. Can I ask how you got to `html_nodes(xpath = '//*[@id="mem-wall"]/div[2]/div')`? Thanks! — Dominic Comtois, Jan 31 '21 at 09:49

Extract text from dynamic Web page using R

2 Answers2