I have a CSV that contains 1000 versions of the same df/table(each from different sources) with columns Name and Age. Here is a sample df to illustrate what this csv looks like
data = [['11', 'Nick', '10', 'Dave', '4', 'Greg'], ['7', 'Nick', '10', 'Steve', '4', 'Jeff'],
['1', 'Brad', np.nan, 'Jim', '3', "jas"],['1','Steve', '2', 'Jack', np.nan,np.nan]]
# Create DataFrame
df = pd.DataFrame(data, columns=['Age','Name', 'Age', 'Name', 'Age', 'Name'])
# Print the output.
df
I want to take this data and make it look like the following - stacking each version of Name + Age beneath the one immediately preceding it to the left. So I go from 2000 columns to just two. Or in the example above, from 6 to 2.
***Name and Age headers are not in the right spot, but please disregard
There are two important caveats
- Each version of Name + Age needs to maintain it's unique order.
- I don't care about instances where Name and Age are both NaN and I can ignore instances where Name is NaN too.
I've tried
df = pd.concat([df["Name"].stack().reset_index(drop=True), df["Age"].stack().reset_index(drop=True)], axis=1, keys=["Name", "Age"])
However, this does things all out of order and the NaNs are screwing up my results. I end up with this