Filter tibble column to only include values found in separate tibble

Question

Question

I have a tibble containing basic stock symbol information (available here as a .csv file: https://www.nasdaq.com/market-activity/stocks/screener).

How do I filter this tibble (call it symbolData) for only the companies listed in a second, much smaller tibble (call it DowJones)? Note, the company names are not perfectly consistent between the datasets (i.e. 'Apple Inc. - Common Stock' in symbolData vs. 'Apple Inc.' in DowJones).

Reprex

#packages
library(dplyr)       library(tibble)
library(httr)        library(utils)
library(reshape2)    library(xml2)
library(rvest)

remove_arrows <- function(x) {sub("[[:space:]]↑", "", x)}

DowJones <- "https://en.wikipedia.org/wiki/Historical_components_of_the_Dow_Jones_Industrial_Average" %>% 
  GET(config = config(ssl_verifypeer = FALSE)) %>% read_html() %>% html_node(".wikitable") %>% html_table(fill = TRUE) %>% 
  as_tibble() %>%
  filter(!grepl('↓|Dropped', X1)) %>%
  rowid_to_column("index") %>% 
  melt(id.vars="index", value.name="Dow Jones Industrial Average") %>%
  select(-c("variable","index")) %>%
  mutate(across("Dow Jones Industrial Average", remove_arrows)) %>% as_tibble()

symbolData <- read.csv("~/nasdaq_screener.csv") %>% as_tibble()

> head(DowJones)
# A tibble: 6 × 1
  `Dow Jones Industrial Average`
  <chr>                         
1 3M Company                    
2 American Express Company      
3 Amgen Inc.                    
4 Apple Inc.                    
5 The Boeing Company            
6 Caterpillar Inc.

> head(symbolData)  
# A tibble: 6 × 11
  Symbol Name                                                   Last.Sale  Net.Change X..Change  Market.Cap Country         IPO.Year  Volume Sector Industry
  <chr>  <chr>                                                  <chr>           <dbl> <chr>           <dbl> <chr>              <int>   <int> <chr>  <chr>   
1 A      "Agilent Technologies Inc. Common Stock"               "$133.73 "     5.58   4.35%     40167959890 "United States"     1999 3144474 "Capi… "Electr…
2 AA     "Alcoa Corporation Common Stock "                      "$77.85 "      4.55   6.21%     14332165382 ""                  2016 7327361 "Basi… "Metal …
3 AAC    "Ares Acquisition Corporation Class A Ordinary Shares" "$9.76 "       0.01   0.10%      1220000000 ""                  2021   99883 "Fina… "Busine…
4 AACG   "ATA Creativity Global American Depositary Shares"     "$1.36 "       0.02   1.49%        42672611 "China"               NA    7920 "Misc… "Servic…
5 AACI   "Armada Acquisition Corp. I Common Stock"              "$9.81 "       0.01   0.10%       203160195 "United States"     2021     264 ""     ""      
6 AACIW  "Armada Acquisition Corp. I Warrant"                   "$0.23 "      -0.0599 -20.66%             0 "United States"     2021  184363 ""     ""

Previous Attempts

I've already tried a number of methods, including %in%/%chin%, grep/grepl, agrepl/agrep, str_detect, turning the DowJones data frame into a list, and all sorts of other methods I can't remember. Everything I've tried so far has returned either an empty tibble, or an error message related to differing column lengths. Some examples:

filter(symbolData, sapply(1:nrow(.), function(i) grepl(DowJones$`Dow Jones Industrial Average`[i], symbolData$Security.Name[i])))
#returns empty tibble

filter(symbolData, str_detect(symbolData$Security.Name, DowJones$`Dow Jones Industrial Average`) == TRUE)
Warning message:
In stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) :
  longer object length is not a multiple of shorter object length
#returns empty tibble

filter(symbolData, unlist(Map(function(x, y) grepl(x, y), DowJones$`Dow Jones Industrial Average`, symbolData$Security.Name)))
Warning message:
In mapply(FUN = f, ..., SIMPLIFY = FALSE) :
  longer argument not a multiple of length of shorter
#returns empty tibble

filter(symbolData, map2_lgl(symbolData$Security.Name, DowJones$`Dow Jones Industrial Average`,  str_detect))
Error: Problem with `filter()` input `..1`.
ℹ Input `..1` is `map2_lgl(...)`.
x Mapped vectors must have consistent lengths:
* `.x` has length 5587
* `.y` has length 30

filter(symbolData, agrepl(DowJones$`Dow Jones Industrial Average`, symbolData$Security.Name, ignore.case = T, fixed = F))
#returns empty tibble

score 0 · Answer 1 · answered Feb 25 '22 at 22:21

How many should it return? This returns 7

symbolData %>% filter(str_remove_all(Security.Name, " - .*") %in% DowJones$`Dow Jones Industrial Average`)

# A tibble: 7 x 8
  Symbol Security.Name                                 Market.Category Test.Issue Financial.Status Round.Lot.Size ETF   NextShares
  <chr>  <chr>                                         <chr>           <chr>      <chr>                     <int> <chr> <chr>     
1 AAPL   Apple Inc. - Common Stock                     Q               N          N                           100 N     N         
2 AMGN   Amgen Inc. - Common Stock                     Q               N          N                           100 N     N         
3 CSCO   Cisco Systems, Inc. - Common Stock            Q               N          N                           100 N     N         
4 HON    Honeywell International Inc. - Common Stock   Q               N          N                           100 N     N         
5 INTC   Intel Corporation - Common Stock              Q               N          N                           100 N     N         
6 MSFT   Microsoft Corporation - Common Stock          Q               N          N                           100 N     N         
7 WBA    Walgreens Boots Alliance, Inc. - Common Stock Q               N          N                           100 N     N

There should be 30 total! But as I tested your solution, I realized my `symbolData` is incomplete, so I've found a new source and I'm updating my reprex to reflect that. However, even with the complete data set, there are still companies missing using this method. Thanks! I hadn't thought to alter the strings being searched/searched for... — SourceCoda, Feb 25 '22 at 22:32
you are pulling symbols from nasdaq only, so yeah, you will need to expand your source of SymbolData. — langtang, Feb 25 '22 at 22:38
Mhmm, exactly! So in the new `symbolData` I see all 30 companies listed, but I'm still only returning 15 of them when I use "[[:space:]]Common Stock" as the pattern in `str_remove_all`. Is there a way to remove multiple patterns at once using this function? — SourceCoda, Feb 25 '22 at 22:47

Filter tibble column to only include values found in separate tibble

Question

Reprex

Previous Attempts

1 Answers1