2

I have a huge pandas DataFrame df, sorted by id and then year:

id        gender        year
3         male          1983
3         male          1983
3         male          1985
3         male          1990
6         female        1991
6         female        1992
7         male          1980
...
592873    female        1989
592873    female        1996
593001    male          2001
593428    female        2007
593428    female        2009

My goal is to create another column ca which is computed by:

  • year - minimum year of that id

Hence, the first six rows of df should return:

id        gender        year        ca
3         male          1983        0
3         male          1983        0
3         male          1985        2
3         male          1990        7
6         female        1991        0
6         female        1992        1

(In other words, I'm searching for a Pythonic answer to this question.)


One solution I could think of is to make a list and use a for loop:

ca_list = []

for i in range(len(df)):
  if df['id'][i] != df['id'][i-1]:
    num = df['year'][i]
    ca_list.append(0)
  else:
    ca_list.append(df['year'][i] - num)

df['ca'] = ca_list

But I believe there is a more optimal way to devise this. Any insights are much appreciated.

smci
  • 32,567
  • 20
  • 113
  • 146
jstaxlin
  • 517
  • 4
  • 18
  • Since you're grouping by 'id' column, both pandas and R generally use [Split-Apply-Combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) approaches; please skim that quickstart. – smci Jul 06 '21 at 01:06
  • Since your dataframe is already sorted by 'id' then 'year', the min(year) will occur on the first line for that 'id' value. So you could determine it at read time, by wrapping and chunking the reader code. And if you have an issue with memory, just chunk the reader code. – smci Jul 06 '21 at 01:11

1 Answers1

3

Try:

df["ca"] = df.groupby("id")["year"].transform(lambda x: x - x.min())
print(df)

Prints:

   id  gender  year  ca
0   3    male  1983   0
1   3    male  1983   0
2   3    male  1985   2
3   3    male  1990   7
4   6  female  1991   0
5   6  female  1992   1
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thank you, that definitely works; out of curiosity, as it takes up to about a minute due to the sample size, is there another way to reduce the runtime? – jstaxlin Jul 06 '21 at 01:01
  • 1
    @raven What is the sample size? Try `df["ca"] = df["year"] - df.groupby("id")["year"].transform("min")` – Andrej Kesely Jul 06 '21 at 01:03
  • 1
    @raven: how large is your dataframe? (rows x columns)? If you want performance comparison of alternatives on large dataframes, please post code that generates a big seeded-random dataframe, using `np.random.seed()`. – smci Jul 06 '21 at 01:03
  • 1
    ...and if it's because you're reading in a huge dataframe which blows out memory, you could **chunk the read, and compute `min(year)` as you read it in** (it'll be the first line for that 'id' value, since your dataframe is sorted). So you wouldn't need to spend any extra memory for this computation. – smci Jul 06 '21 at 01:08