Is there any faster/ less RAM using way to pool the data using Python?

Question

Consider:

https://kin-phinf.pstatic.net/20221001_267/1664597566757fY2pz_PNG/%C8%AD%B8%E9_%C4%B8%C3%B3_2022-10-01_001049.png?type=w750

I want to pool data like the figure above, but it takes too much time and RAM usage.

Can I make it faster / efficient?

My code is like this:

data = df.groupby(['Name', 'Age', 'Pet', 'Allergy']).apply(lambda x: pd.Series(range(x['Amount'].squeeze()))).reset_index()
data = df.groupby(['Name', 'Age', 'Pet', 'Allergy']).apply(lambda x: pd.Series(range(x['Amount'].squeeze()))).reset_index()[['Name', 'Age', 'Pet', 'Allergy']]

It's kind of an abbreviated form, but my actual dataset is 3.5 GB... So it takes a really long time. Is there another way to do this work faster?

please do not post links or images, a question should be self-contained — juanpa.arrivillaga, Jan 24 '23 at 23:20
Can you post that input as an initialized dataframe so we can experiement? I take it the goal is to duplicate rows based on the number in Amount? — tdelaney, Jan 24 '23 at 23:20
Please read; [Please do not upload images of code/data/errors.](//meta.stackoverflow.com/q/285551) *(Then replace the images with formatted text.)* Also useful to read; [mre] — MatBailie, Jan 24 '23 at 23:20
Please review *[Should we edit a question to transcribe code from an image to text?](https://meta.stackoverflow.com/questions/415040)* and *[Why not upload images of code/errors when asking a question?](https://meta.stackoverflow.com/questions/285551/)* (e.g., *"Images should only be used to illustrate problems that* ***can't be made clear in any other way,*** *such as to provide screenshots of a user interface."*) and [do the right thing](https://stackoverflow.com/posts/75228254/edit). Thanks in advance. — Peter Mortensen, Jan 28 '23 at 03:44

score 0 · Answer 1 · answered Jan 25 '23 at 00:38

You could preallocate the final dataframe, then iterate the original dataframe, reassigning rows in the final.

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name":["Male", "Female"],
    "Age":[29, 43], "Pet":["Cat", "Dog"],
    "Allergy":["Negative", "Positive"],
    "Amount":[2, 4]})

amounts = df["Amount"]
df.drop("Amount", axis=1, inplace=True)
counts = amounts.sum()

new_df = pd.DataFrame(columns=df.columns, index=np.arange(counts))
new_index = 0

for amount, (_, row) in zip(amounts, df.iterrows()):
    for i in range(new_index, new_index+amount):
        new_df.iloc[i] = row
    new_index = new_index+amount

del df, amounts, row

print(new_df)

Is there any faster/ less RAM using way to pool the data using Python?

1 Answers1