I have a large set of tweets on bigquery and now want to filter those that contain at least one of a list of hashtags. The hashtags are saved in an array column (uploaded from a list column in R). How can I select rows that contain one of multiple values in any place in that array?
Below the code that I would use for the analysis in R. Unsurprisingly, dbplyr
cannot translate the purrr
part, and I am happy to learn to create the SQL myself, but haven't yet found a good starting point. Thanks for any pointers.
PS: I have not yet uploaded the Tweets to bigquery, they currently live in 80 GB of RDS filed. If any simple data transformation would make this easier, I could still include that while uploading.
tweets_sample <- tibble::tribble(
~text, ~hashtags,
"Hello", list("World", "You"),
"Goodbye", list("Friend", "You"),
"Not", list("interested")
)
hashtag_list <- c("World", "interested")
tweets_sample %>% filter(purrr::map_lgl(hashtags, ~ .x %in% hashtag_list %>%
any()))