How to scrape data from GDELT

Question

I am struggling to scrape data from GDELT.

http://data.gdeltproject.org/events/index.html

I aim to write code that automatically downloads, unzips, and merges files during specific periods, but despite numerous attempts, I have failed to do so.

Although the "gdeltr2" package exists, it does not retrieve some variables correctly from the original data.

I need your help.

Can you show what you tried so far – user438383 Jul 06 '23 at 20:10 — user438383, Jul 06 '23 at 20:10

Till · Accepted Answer · 2023-07-07T13:15:45.593

4

The rvest package has the appropriate tools for this. We extract the href attributes from all link <a href = ...>...</a> nodes, filter down to those that end with ".CSV.zip" and build the full URLs. Now we can download each file and readr::read_tsv() will unpack, read, and combine the files for us!

library(rvest)
library(tidyverse)

gdelt_index_url <- 
  "http://data.gdeltproject.org/events"

gdelt_dom <- read_html(gdelt_index_url)

url_df <- 
  gdelt_dom |> 
  html_nodes("a") |> 
  html_attr("href") |> 
  tibble() |> 
  set_names("path") |> 
  filter(str_detect(path, ".CSV.zip$")) |> 
  mutate(url = file.path(gdelt_index_url, path)) |> 
  slice(1:3) # For the purpose of demonstration we use only the first three files
  
map2(url_df$url,
     url_df$path,
     download.file)

gdelt_event_data <- 
  read_tsv(url_df$path, col_names = FALSE)

edited Jul 07 '23 at 13:15

answered Jul 06 '23 at 21:31

Till

3,845
1
11
18

1

just a remark, readr handles compressed files and can load a file list, i.e . `read_tsv(url_df$path, col_names = FALSE)` – margusl Jul 07 '23 at 11:22
1

Wow, that is so cool! Thank you so much for sharing! I updated my answer. – Till Jul 07 '23 at 13:16
1

It works very well. Thanks a lot! – Injae Jeon Jul 12 '23 at 04:51

How to scrape data from GDELT

1 Answers1