7

I would like to merge 2 dataframes:

df1:

                                              cik0        cik1        cik2  
'MKTG, INC.'                            0001019056        None        None   
1 800 FLOWERS COM INC                   0001104659  0001437749        None   
11 GOOD ENERGY INC                      0000930413        None        None   
1347 CAPITAL CORP                       0001144204        None        None   
1347 PROPERTY INSURANCE HOLDINGS, INC.  0001387131        None        None 

df2:

              cik Ticker
0      0001144204   AABB
1      0001019056      A
2      0001387131   AABC
3      0001437749     AA
4      0000930413  AAACU

Expected result:

                                              cik0        cik1  cik2 ticker
'MKTG, INC.'                            0001019056        None  None      A
1 800 FLOWERS COM INC                   0001104659  0001437749  None     AA
11 GOOD ENERGY INC                      0000930413        None  None  AAACU
1347 CAPITAL CORP                       0001144204        None  None   AABB
1347 PROPERTY INSURANCE HOLDINGS, INC.  0001387131        None  None   AABC

I would to match cik0 with df2['cik'], if it doesn't work, I would like to look at cik1, and so on.

Thanks for your help!

anky
  • 74,114
  • 11
  • 41
  • 70
Roger
  • 407
  • 1
  • 4
  • 16
  • 3
    SEC Edgar data! I work with this a lot. What's the end goal? (I upvoted @jpps solution, but might be able to help out more generally if you need it) – David Jan 22 '19 at 14:44
  • Thanks @David! Generally, I downloaded SEC data from BigQuery, however the "numbers" dataset contains only the CIK number but not the ticker. Since I would like to add the stock prices to the dataset, I would need the tickers. But some companies have multiple CIK, so I'm trying to clean that up! – Roger Jan 22 '19 at 15:33
  • Got it. There will be loads of CIK numbers without tickers. For example JP Morgan has a big complex legal hierarchy so they have multiple CIK numbers, but not all have tickers associated with them. If you need a good mapping, the most watertight way is probably to query 10Q filings for each CIK and getting the ticker from there. You will also need to lookup which exchange the ticker is for, as generally you'll need to prefix when looking up market data to avoid mismatches where two companies have the same ticker, but on different exchanges.... – David Jan 24 '19 at 17:15

2 Answers2

4

You can use pd.Series.map with fillna a few times:

ticker_map = df2.set_index('cik')['Ticker']

df1['ticker'] = df1['cik0'].map(ticker_map)\
                           .fillna(df1['cik1'].map(ticker_map))\
                           .fillna(df1['cik2'].map(ticker_map))

This, however, is a bit tedious. You can define a function to do this iteratively:

def apply_map_on_cols(df, cols, mapper):
    s = df[cols[0]].map(mapper)
    for col in cols[1:]:
        s = s.fillna(df[col].map(mapper))
    return s

df1['ticker'] = df.pipe(apply_map_on_cols,
                        cols=[f'cik{i}' for i in range(3)],
                        mapper=df2.set_index('cik')['Ticker'])
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thanks jpp, it works. Only problem is that my df2 dataframe doesn't seem to contain the good cik numbers... – Roger Jan 22 '19 at 15:30
  • @Roger, Then you should get `NaN` values for tickers that cannot be mapped. Input data isn't something I can help with, unfortunately. – jpp Jan 22 '19 at 15:31
  • 1
    Yes, thanks again! Anyway the solution your proposed works. I just need to fix the data ! – Roger Jan 22 '19 at 15:32
1

Another possibility is to use pd.merge to merge the dataframes:

dfs = []  # list to temporarily store partially merged dfs
df1.reset_index(inplace=True)  # reset index to maintain correct index order
for col in df1:  # iterate over columns
    # append partially merged columns
    dfs.append(pd.merge(df1, df2, left_on=col, right_on=['cik']))
# concat all partial results:
df_result = pd.concat(dfs, axis=0)
df_result.set_index('index', inplace=True)  # set old index
df_result.drop('cik', axis=1, inplace=True)  # drop 'cik' row

This should be several times faster than any method using map when df1.shape[0] >> df1.shape[1] (>> means substantially larger), which should be true for most real use-case data sets.

JE_Muc
  • 5,403
  • 2
  • 26
  • 41