0

I have a CSV that contains 1000 versions of the same df/table(each from different sources) with columns Name and Age. Here is a sample df to illustrate what this csv looks like

data = [['11', 'Nick', '10', 'Dave', '4', 'Greg'], ['7', 'Nick', '10', 'Steve', '4', 'Jeff'], 
        ['1', 'Brad', np.nan, 'Jim', '3', "jas"],['1','Steve', '2', 'Jack', np.nan,np.nan]]
  
# Create DataFrame
df = pd.DataFrame(data, columns=['Age','Name', 'Age', 'Name', 'Age', 'Name'])
  
# Print the output.
df

Result]

I want to take this data and make it look like the following - stacking each version of Name + Age beneath the one immediately preceding it to the left. So I go from 2000 columns to just two. Or in the example above, from 6 to 2.

Desired Result

***Name and Age headers are not in the right spot, but please disregard

There are two important caveats

  1. Each version of Name + Age needs to maintain it's unique order.
  2. I don't care about instances where Name and Age are both NaN and I can ignore instances where Name is NaN too.

I've tried

df = pd.concat([df["Name"].stack().reset_index(drop=True), df["Age"].stack().reset_index(drop=True)], axis=1, keys=["Name", "Age"])

However, this does things all out of order and the NaNs are screwing up my results. I end up with this

Bad Result]

sdpro19
  • 11
  • 2
  • Have you tried to stack without dropping na (default behaviour)? ie. `stack(dropna=False)`. Then the ordering should be preserved – LarryBird Mar 16 '23 at 01:07

1 Answers1

0

Given that you can do the operation in memory, you can simply reshape the arrays and get them back into Pandas with this 1 liner:

pd.DataFrame.from_dict({"Name": df.Name.values.reshape(-1, order="F"), "Age": df.Age.values.reshape(-1, order="F")})

That should get you the dataframe you want

cLwill
  • 102
  • 6