0

I'd like to split a column of a dataframe into two separate columns. Here is how my dataframe looks like (only the first 3 rows):

enter image description here

I'd like to split the column referenced_tweets into two columns: type and id in a way that for example, for the first row, the value of the type column would be replied_to and the value of id would be 1253050942716551168.

Here is what I've tried:

df[['type', 'id']] = df['referenced_tweets'].str.split(',', n=1, expand=True)

but I get the error:

ValueError: Columns must be the same length as key

(I think I get this error because the type in the referenced_tweets column is NOT always replied_to (e.g., it can be retweeted, and therefore, the lengths would be different)

halfer
  • 19,824
  • 17
  • 99
  • 186
mOna
  • 2,341
  • 9
  • 36
  • 60

1 Answers1

1

Why not get the values from the dict and add it two new columns?

def unpack_column(df_series, key):
    """ Function that unpacks the key value of your column and skips NaN values """
    return [None if pd.isna(value) else value[0][key] for value in df_series]
    
    
df['type'] = unpack_column(df['referenced_tweets'], 'type')
df['id'] = unpack_column(df['referenced_tweets'], 'id')

or in a one-liner:

df[['type', 'id']] = df['referenced_tweets'].apply(lambda x: (x[0]['type'], x[0]['id']))
chatax
  • 990
  • 3
  • 17
  • Thanks for your answer. I tried your code, but I get this error: `TypeError: 'float' object is not subscriptable` – mOna Aug 06 '21 at 20:23
  • 1
    Seems that you have some NaN values (missing data) in your column. You need to clean that first – chatax Aug 06 '21 at 20:24
  • Yes, you are right, but I can't remove those NaN values (I need the values of other columns where the `referenced_tweets` is NaN) – mOna Aug 06 '21 at 20:26
  • Of which column? And what does it contain? These kind of dict lists as well? – chatax Aug 06 '21 at 20:34
  • 1
    @mOna my edit does take NaN into account and skips over it – chatax Aug 06 '21 at 20:42