-1

I have the following dataframe :

network   date       count2    count3  user2     user3 
3         20170721   [6, 7]    [1,3]   [57,88]   [47,58] 
4         20170721   [6]       []      [12]      []
43        20170721   []        [7,2]   []        [57,62]

and I would like to split the list per row but count and user must correspond :

network   date       count2   count3  user2    user3 
3         20170727   6       Nan      57       Nan
3         20170727   7       Nan      88       Nan
3         20170727   Nan     1        Nan      47
3         20170727   Nan     3        Nan      58
4         20170727   6       Nan      12       Nan
43        20170727   Nan     7        Nan      57
43        20170727   Nan     2        Nan      62

How can I do it in a fast way ? The user list is in reality really long (more than 50k entry). Thank you!

eleanor
  • 275
  • 4
  • 13
  • 1
    What have you tried so far? Could you show me what the results are when you convert the dataframe into an array using .asarray? – Julian Rachman Jul 27 '17 at 16:04
  • 1
    Show us what you have done, even it's slow – minhhn2910 Jul 27 '17 at 16:06
  • What do you mean by "count and user must correspond"? You don't list any examples of corresponding count and user values except `Nan`. What is your expected outcome? What exactly does "split the list per row" mean? What happens to each row of data??? – A Magoon Jul 27 '17 at 16:06
  • I have tried this solution : https://stackoverflow.com/questions/33793622/pandas-dataframe-expand-rows-with-lists-to-multiple-row-with-desired-indexing-f for user 2 then user3 and concat the dataframe but it is not optimal – eleanor Jul 27 '17 at 16:07
  • @AMagoon. I think the example shows what is to be done pretty clearly. – Mad Physicist Jul 27 '17 at 16:07
  • @eleanor. Great. That fixes everything. Why are you even asking a question then? – Mad Physicist Jul 27 '17 at 16:08
  • Use this as well: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.fillna.html – Julian Rachman Jul 27 '17 at 16:09
  • Because groupby then concat is taking a lots a time and I want to be the fastest possible – eleanor Jul 27 '17 at 16:10
  • Why do you need it to be so fast? Python is python. – Julian Rachman Jul 27 '17 at 16:16
  • @MadPhysicist Ya I see it now. What really threw me off was "count and user must **correspond**", as in match. – A Magoon Jul 27 '17 at 16:25
  • @AMagoon. Me too. I think the relatively poor prose can be forgiven given that expected results are provided. What can not be forgiven is the lack of any attempt to solve the problem: https://meta.stackoverflow.com/q/256328/2988730 – Mad Physicist Jul 27 '17 at 16:27

1 Answers1

1

One way you could do this and achieve the result you are looking for without all the extra NaN.

df = pd.DataFrame({'network':[3,4,43],'date':['20170721']*3,
                   'count2':[[6,7],[6],[]],
                   'count3':[[1,3],[],[7,2]],
                   'user2':[[57,88],[12],[]],
                   'user3':[[47,58],[],[57,62]]})

df = df.set_index(['network','date'])

(df.apply(lambda x: pd.DataFrame(x.tolist(),index=x.index)
                      .stack()
                      .rename(x.name))
   .reset_index())

Output:

   network      date  level_2  level_0  count2  count3  user2  user3
0        3  20170721        0      0.0     6.0     1.0   57.0   47.0
1        3  20170721        1      NaN     7.0     3.0   88.0   58.0
2        4  20170721        0      1.0     6.0     NaN   12.0    NaN
3       43  20170721        0      2.0     NaN     7.0    NaN   57.0
4       43  20170721        1      NaN     NaN     2.0    NaN   62.0
Scott Boston
  • 147,308
  • 15
  • 139
  • 187