Explode dataframe column to multiple rows (TypeError: Cannot cast array data from dtype('int64') to dtype('int32'))

Question

I'll try to make a dataframe with this data:

test1   test2                 test3
test    [test1, test2]        [testbelongsto1, testbelongst2]

To something like this:

test1   test2                 test3
test    test1                 testbelongsto1
test    test2                 testbelongsto2

I found this question answer https://stackoverflow.com/a/38652414 Looks exactly what I need right? There are alot questions which answer my question..

However, whatever I try i'm stuck with this error:

TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

with this function (see link):

 def explode(self, df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)

Important note! the date comes from read_csv function. The columns I need to explode are strings, so I wrote this piece of code to convert them to lists:

   df['users'] = df['users'].apply(literal_eval)

Tried everything with converting from dtype to saving them in other formats. But nothing solves the issue...

Please help

UPDATE: A 'real' dataset example of a few rows is displayed below: 'test2' => 'users' and 'test3' => 'interests', the arrays are the same size.

{'index': [0, 1, 2, 3, 4], 'Unnamed: 0': [0, 1, 4, 5, 6], 'users': ['[1, 1, 28, 28, 68]', '[1, 1, 16]', '[32, 37, 66, 67, 54, 117]', '[31, 37, 66, 67, 100, 113, 117]', '[32, 37, 66, 67, 54, 117]'], 'interests': ['[set(), set(), set(), set(), set()]', '[set(), set(), set()]', '[set(), set(), set(), set(), {1535, 1542, 1527}, set()]', '[set(), set(), set(), set(), set(), set(), set()]', '[set(), set(), set(), set(), {1535, 1542, 1527}, set()]']}

UPDATE 2: Ok this is exactly what I try to want. Current data I got now:

`
index       lift        confidence         interests         users
0                                          {333, 333}        1   
0                                          set()             22
0                                          set()             77
0           0           0.75               set()             88
4                                          set()             33
4           3           0.50               set()             44
`

So it seems like only the last of each iteration gets added. This is what I want:

`
index       lift        confidence         interests         users
0           88          0.33               344,              1  
0           88          0.33               333               1   
0           88          0.33               set()             22
0           88          0.33               set()             77
0           88          0.33               set()             88
4           38          0.50               set()             33
4           38          0.50               set()             44
`

So what I want is that each data row (serie) is repeated per user and that interests per user are aswell.

Can you try upgrade pandas/numpy to last versions? Because it seems like bug... — jezrael, Jul 22 '17 at 17:35
Please post `df.reset_index().head().to_dict('list')` so we can see an unambiguous representation of a few rows of your DataFrame. Maybe then we'll be able to reproduce the error you are seeing. — unutbu, Jul 22 '17 at 17:55

unutbu · Accepted Answer · 2017-07-22T21:18:37.687

If you can trust your data does not contain malicious strings then you could convert the strings into Python objects using eval. Be very wary though -- eval'ing malicious strings can in theory run arbitrary code on your computer!

Having highlighted the danger of eval, you could parse and reshape your DataFrame using the apply(pd.Series) trick:

import pandas as pd

df = pd.DataFrame({'test': [0, 1, 4, 5, 6], 'test2': [0, 10, 40, 50, 60], 'users': ['[1, 1, 28, 28, 68]', '[1, 1, 16]', '[32, 37, 66, 67, 54, 117]', '[31, 37, 66, 67, 100, 113, 117]', '[32, 37, 66, 67, 54, 117]'], 'interests': ['[set(), set(), set(), set(), set()]', '[set(), set(), set()]', '[set(), set(), set(), set(), {1535, 1542, 1527}, set()]', '[set(), set(), set(), set(), set(), set(), set()]', '[set(), set(), set(), set(), {1535, 1542, 1527}, set()]']})

for col in df.columns.difference(['test', 'test2']):
    df[col] = df[col].apply(eval)

interests = df['interests'].apply(pd.Series)
interests = interests.stack().apply(lambda x: pd.Series(list(x)))
users = df['users'].apply(pd.Series)
users = users.stack()

result = pd.concat({'users': users, 'interests':interests}, axis=1)
result = result.stack() 
result['users'] = result['users'].ffill()
result.index = result.index.droplevel(level=[1,2])
result = df.drop(['interests','users'], axis=1).join(result)
print(result)

yields

   test  test2  interests  users
0     0      0        NaN    1.0
0     0      0        NaN    1.0
0     0      0        NaN   28.0
0     0      0        NaN   28.0
0     0      0        NaN   68.0
1     1     10        NaN    1.0
1     1     10        NaN    1.0
1     1     10        NaN   16.0
2     4     40        NaN   32.0
2     4     40        NaN   37.0
2     4     40        NaN   66.0
2     4     40        NaN   67.0
2     4     40     1535.0   54.0
2     4     40     1542.0   54.0
2     4     40     1527.0   54.0
2     4     40        NaN  117.0
3     5     50        NaN   31.0
3     5     50        NaN   37.0
3     5     50        NaN   66.0
3     5     50        NaN   67.0
3     5     50        NaN  100.0
3     5     50        NaN  113.0
3     5     50        NaN  117.0
4     6     60        NaN   32.0
4     6     60        NaN   37.0
4     6     60        NaN   66.0
4     6     60        NaN   67.0
4     6     60     1535.0   54.0
4     6     60     1542.0   54.0
4     6     60     1527.0   54.0
4     6     60        NaN  117.0

The main idea is to use apply(pd.Series) to "explode" the lists into columns:

In [572]: interests = df['interests'].apply(pd.Series); interests
Out[572]: 
    0   1   2    3                   4    5    6
0  {}  {}  {}   {}                  {}  NaN  NaN
1  {}  {}  {}  NaN                 NaN  NaN  NaN
2  {}  {}  {}   {}  {1535, 1542, 1527}   {}  NaN
3  {}  {}  {}   {}                  {}   {}   {}
4  {}  {}  {}   {}  {1535, 1542, 1527}   {}  NaN

Since you wish to "explode" the sets as well, apply the pd.Series trick a second time:

In [573]: interests = interests.stack().apply(lambda x: pd.Series(list(x))); interests
Out[573]: 
          0       1       2
0 0     NaN     NaN     NaN
  1     NaN     NaN     NaN
  2     NaN     NaN     NaN
  3     NaN     NaN     NaN
  4     NaN     NaN     NaN
1 0     NaN     NaN     NaN
  1     NaN     NaN     NaN
  2     NaN     NaN     NaN
2 0     NaN     NaN     NaN
  1     NaN     NaN     NaN
  2     NaN     NaN     NaN
  3     NaN     NaN     NaN
  4  1535.0  1542.0  1527.0
  ...

After doing the same for the users column, combine both DataFrames into one:

result = pd.concat({'users': users, 'interests':interests}, axis=1)

Move the inner column index level to the index, and forward-fill the users column to propage the users values when the user has multiple interests:

result = result.stack() 
result['users'] = result['users'].ffill()
#        interests  users
# 0 0 0        NaN    1.0
#   1 0        NaN    1.0
#   2 0        NaN   28.0
#   3 0        NaN   28.0
#   4 0        NaN   68.0
# 1 0 0        NaN    1.0
#   1 0        NaN    1.0
#   2 0        NaN   16.0
# 2 0 0        NaN   32.0
#   1 0        NaN   37.0
#   2 0        NaN   66.0
#   3 0        NaN   67.0
#   4 0     1535.0   54.0
#     1     1542.0   54.0
#     2     1527.0   54.0
# ...

Finally, drop the 2 inner-most index levels and join the result back into df:

result.index = result.index.droplevel(level=[1,2])
result = df.drop(['interests','users'], axis=1).join(result)

Thank you for your answer, what is the column 'test'? Its the same value as "unnamed: 0", but can't access the column ;) @unutbu — iLuvCode, Jul 22 '17 at 19:02
solved that issue: `df['index'] = df['Unnamed: 0']` i try now your code — iLuvCode, Jul 22 '17 at 19:10
I get the following error `TypeError: eval() arg 1 must be a string, bytes or code object` @unutbu I also got other data beside these two columns, is it possible to repeat these columns, when I try to do that now they just empty? — iLuvCode, Jul 22 '17 at 19:25
Ah, other columns that you don't want `eval`'d should be treated like `test`. I'll modify the example above to show what I mean. — unutbu, Jul 22 '17 at 19:36
I edited my question again, I try to make it clear as possible what I want :) @unutbu — iLuvCode, Jul 22 '17 at 19:51
Please update the `df.head().to_dict('list')` output to show `lift` and `confidence` columns so we know what we are starting with. Then given the data you posted, show the desired output. — unutbu, Jul 22 '17 at 20:03
those are not really special, see my edited question, they're both floats. I almost got it working using your example (adding `lift` and `confidence` to melt.id_vars and setindex function). The only part where i'm stuck is splitting the `interests` column the same way we did with `users` , so repeating the rows another time, but now with each row containing one interest and filter the empty `interests` lists. — iLuvCode, Jul 22 '17 at 20:19
Melting does not seem so attractive now that I see the `interests` and `users` columns have to be handled differently. The `interests` need to be "exploded" twice, while the `users` only once. So I've modified the code above accordingly. — unutbu, Jul 22 '17 at 21:21

Explode dataframe column to multiple rows (TypeError: Cannot cast array data from dtype('int64') to dtype('int32'))

1 Answers1