1

I have a twitter developer account with access to the v2 API and am trying to count the number of tweets posted by certain organisations (replaced it with POTUS in the example code) over the past years. However, I can only count the tweets from the last month. If I want to see the months prior to that I need to manually add the next_token to the script. This would be very time-consuming. Instead, I would like to have a script with automatic pagination.

My knowledge is very rudimentary and I don't understand how others have fixed this issue. I know I should write some kind of loop, but this is above my head.

library(httr)

bearer_token = ""

headers = c(
  `Authorization` = sprintf('Bearer %s', bearer_token)
)

params = list(
  `query` = 'from:POTUS',
  `start_time` = '2017-01-01T00:00:00Z',
  `end_time` = '2022-01-01T00:00:00Z',
  `granularity` = day
)

response <- httr::GET(url = 'https://api.twitter.com/2/tweets/counts/all', httr::add_headers(.headers=headers), query = params)

body <-
  content(
    response,
    as = 'parsed',
    type = 'application/json',
    simplifyDataFrame = TRUE
  )


View(body$data)
sum(body$data$tweet_count)
Steven-f
  • 11
  • 1

1 Answers1

0

The Twitter API documentation notes:

The process of looking for a next_token and including it in a subsequent request can be repeated until all (or some number of) Tweets are collected, or until a specified number of requests have been made. If data fidelity (collecting all matches of your query) is key to your use case, a simple "repeat until request.next_token is null" design will suffice.

What they're trying to communicate with the last sentence is that you need to implement some sort of code that repeats itself until either (a) the token is not in the response (indicating you collected all matches) or (b) you have enough data.

In R, we call this type of "repeat until" a while loop or repeat loop. Here's the structure of your desired loop:

  1. Make a request.
  2. Store the results in some object/file.
  3. Extract the next_token from "meta" attribute of the JSON response.
  4. Construct a new query with the next_token string.
  5. Repeat steps 1-4 until either no next_token is in the response OR you've gone through as many pages as you desire.

Since (a) it sounds like you know how to make a call and get the token but that you are worried about how long it would take to manually doing all of this and (b) I cannot access your data, I'm gong to focus on a stylized solution.

In the below, I introduce a function auto_paginate(). If you replace the placeholder functions I've inserted with code to accomplish the specified tasks, it will automatically paginate. The loop is embedded within the function and it's well-annotated where the loop begins/ends, how it exits, etc.

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Placeholder functions
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# Function that takes a query as input and outputs a results object
# can be replaced with e.g. httr::GET() but need to add additional
# arguments to auto_paginate function to make sure you can pass
# everything you need to GET().
F_GET <- function(query) {
  result <- query # replace this with function that calls API
  x
}

# Function that takes the result of F_GET (or replacement function) and
# extracts the next_token value if the result contains a token. Must return
# NULL if no token is found.
F_find_next_token <- function(result) {
  token <- result # replace this ensuring NULL if none found
  return(token)
}

###############################################################################
# FUNCTION: auto_paginate
# Given a query and page limit, returns a list where each element of the
# list is the result for a unique page. Either returns all pages or returns
# no more than the page limit of pages.
#
# Arguments:
# - page_limit: numeric value for maximum number of pages to query
# - query: string containing an initial API call
###############################################################################

auto_paginate <- function(page_limit = NULL, query) {
  # Initialize objects needed for the loop or that do not need to be repeated
  null_limit <- is.null(page_limit)
  result_list <- list()
  page_counter <- 1
  
  # Begin loop: everything within the brackets repeats until exit condition met
  repeat {
    # Request and save result
    result <- F_GET(query)
    result_list[[page_counter]] <- result
    
    # Increment the page counter
    page_counter <- page_counter + 1
    
    # CONDITIONALLY EXIT LOOP: if desired page limit has been met
    # note: nested IF statements because the first being true is required for
    # the second test
    if (null_limit == FALSE) {
      if (page_counter > page_limit) {
        break
      }
    }
    
    # Look for next_token
    next_token <- F_find_next_token(result)
    
    # CONDITIONALLY EXIT LOOP: if there is no next_token (no more pages)
    if (is.null(next_token)) {
      break
    }
    
    # Create next query by:
    # (1) removing the next_token if previously added (uses regular expression)
    # (2) adding a next_token
    query <- gsub("&next_token=.*", "", query)
    query <- paste0(query, "&next_token=", next_token)
  }
  
  # Loop ended, return results
  
  return(result_list)
}
socialscientist
  • 3,759
  • 5
  • 23
  • 58