1

I have column of dtype objects which look ostensibly like lists:

import pandas as pd
import numpy as np

raw = '/******/*******/******/data.txt'
df = pd.read_csv(raw, sep='\t')
df.head()

id  val_0  val_1  val_2  feat_0  feat_1  feat_2  \
0  a      2      0      2       2       2       0   
1  b      1     -1      1       1       1      -2   
2  c      0     -2     -2       0       2       1   
3  d     -1      1     -1      -1       1      -2   
4  e     -2      2      0      -2       0       2       

                              objs_0                             objs_1  \
0  [u'word_0', u'word_1', u'word_2']  [u'word_0', u'word_1', u'word_2']   
1  [u'word_0', u'word_1', u'word_2']  [u'word_0', u'word_1', u'word_2']   
2  [u'word_0', u'word_1', u'word_2']  [u'word_0', u'word_1', u'word_2']   
3  [u'word_0', u'word_1', u'word_2']  [u'word_0', u'word_1', u'word_2']   
4  [u'word_0', u'word_1', u'word_2']  [u'word_0', u'word_1', u'word_2']       

                              objs_2  
0  [u'word_0', u'word_1', u'word_2']  
1  [u'word_0', u'word_1', u'word_2']  
2  [u'word_0', u'word_1', u'word_2']  
3  [u'word_0', u'word_1', u'word_2']  
4  [u'word_0', u'word_1', u'word_2']  

df['objs_0'].values

array(["[u'word_0', u'word_1', u'word_2']",
   "[u'word_0', u'word_1', u'word_2']",
   "[u'word_0', u'word_1', u'word_2']",
   "[u'word_0', u'word_1', u'word_2']",
   "[u'word_0', u'word_1', u'word_2']"], dtype=object)

Ultimately, I need to convert this df to "long" format, and I want to run to run this using the code here: pandas: When cell contents are lists, create a row for each element in the list

But the problem is that I cannot convert these strings to lists.

I have already tried:

df['objs_0'] = df['objs_0'].apply(lambda row: list(row))
df['objs_0']

But this just breaks the entire string up by character. Also, my "string lists" are of unpredictable length, so I cannot rely on the str.partition() method. Any help on this would be greatly appreciated!

2 Answers2

1

You can use a pandas function, pd.wide_to_long:

df = pd.DataFrame({'id':[0,1],
                  'val_0':[2,1],
                  'val_1':[0,-1],
                  'feat_0':[2,1],
                  'feat_1':[2,1],
                  'objs_0':[['word_0','word_1'],['aword_0','aword_1']],
                  'objs_1':[['word_2','word_3'],['bword_0','bword_1']]})

pd.wide_to_long(df,['val','feat','objs'],'id','varcount','_','\w+').reset_index()

Output:

   id  varcount  val  feat                objs
0   0         0    2     2    [word_0, word_1]
1   1         0    1     1  [aword_0, aword_1]
2   0         1    0     2    [word_2, word_3]
3   1         1   -1     1  [bword_0, bword_1]

And, continuing on break make objs in the long format:

pd.DataFrame(df_out['objs'].values.tolist()).stack().to_frame(name='obj')\
  .reset_index(level=1, drop=True)\
  .join(df_out)

Output:

       obj  id  varcount  val  feat                objs
0   word_0   0         0    2     2    [word_0, word_1]
0   word_1   0         0    2     2    [word_0, word_1]
1  aword_0   1         0    1     1  [aword_0, aword_1]
1  aword_1   1         0    1     1  [aword_0, aword_1]
2   word_2   0         1    0     2    [word_2, word_3]
2   word_3   0         1    0     2    [word_2, word_3]
3  bword_0   1         1   -1     1  [bword_0, bword_1]
3  bword_1   1         1   -1     1  [bword_0, bword_1]
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Thank you for your response; however, it was not able to work with my data. I did some more digging around and discovered that I needed to convert my "string lists" into actual lists. To do this, I used AST import ast def obj_to_list(row): string = str(row) list_ = ast.literal_eval(string) return list_ df['objs_0_cln'] = df[objs_0'].apply(lambda x: obj_to_list(x)) – datFrugalBoi Aug 20 '18 at 17:42
  • I think you might can achieve this using yaml also. import yaml. df = df.applymap(yaml.load) or try do just that column. df['objs'] = df['objs'].apply(yaml.load) – Scott Boston Aug 20 '18 at 18:05
0

You can use the following code for this condition

df[["new1","new2","new3"]]=pd.DataFrame(df.objs_0.values.tolist(),index=df.index)

so u can split the columns that are a list into new columns named new1,new2,new3 here

do realise that the number of columns that you are assigning should be exactly equal to the number of values the lists have in the column that you are splitting into new columns.

The code above will keep the original columns intact just add new columns in the dataframe, you can choose to drop the original columns if you like.

Inder
  • 3,711
  • 9
  • 27
  • 42