drop_duplicates in pandas for a large data set

Question

I am new to pandas so sorry for naiveté.

I have two dataframe. One is out.hdf:

999999  2014    1   2   15  19  45.19   14.095  -91.528 69.7    4.5 0.0 0.0 0.0 603879074
999999  2014    1   2   23  53  57.58   16.128  -97.815 23.2    4.8 0.0 0.0 0.0 603879292
999999  2014    1   9   12  27  10.98   13.265  -89.835 55.0    4.5 0.0 0.0 0.0 603947030
999999  2014    1   9   20  57  44.88   23.273  -80.778 15.0    5.1 0.0 0.0 0.0 603947340

and another one is out.res (the first column is station name):

061Z    56.72   0.0 P   603879074
061Z    29.92   0.0 P   603879074
0614    46.24   0.0 P   603879292
109C    87.51   0.0 P   603947030
113A    66.93   0.0 P   603947030
113A    26.93   0.0 P   603947030
121A    31.49   0.0 P   603947340

The last columns in both dataframes are ID. I want to creat a new dataframe which puts the same IDs from two dataframes together in this way (first reads a line from hdf, then puts the lines from res with the same ID beneath it, but doesn't keep the ID in res).

The new dataframe:

"999999 2014    1   2   15  19  45.19   14.095  -91.528 69.7    4.5 0.0 0.0 0.0 603879074"
061Z    56.72   0.0 P
061Z    29.92   0.0 P
"999999 2014    1   2   23  53  57.58   16.128  -97.815 23.2    4.8 0.0 0.0 0.0 603879292"
0614    46.24   0.0 P
"999999 2014    1   9   12  27  10.98   13.265  -89.835 55.0    4.5 0.0 0.0 0.0 603947030"
109C    87.51   0.0 P
113A    66.93   0.0 P
113A    26.93   0.0 P
"999999 2014    1   9   20  57  44.88   23.273  -80.778 15.0    5.1 0.0 0.0 0.0 603947340"
121A    31.49   0.0 P

My code to do this is:

import csv
import pandas as pd
import numpy as np

path= './'
hdf = pd.read_csv(path + 'out.hdf', delimiter = '\t', header = None)
res = pd.read_csv(path + 'out.res', delimiter = '\t', header = None)


###creating input to the format of ph2dt-jp/ph
with open('./new_df', 'w', encoding='UTF8') as f:
    writer = csv.writer(f, delimiter='\t')

    i=0
    with open('./out.hdf', 'r') as a_file:
        for line in a_file:
            liney = line.strip()
            writer.writerow(np.array([liney]))
            print(liney)
            j=0
            with open('./out.res', 'r') as a_file:
                for line in a_file:
                    if res.iloc[j, 4] == hdf.iloc[i, 14]:
                        strng = res.iloc[j, [0, 1, 2, 3]]
                        print(strng)
                        writer.writerow(np.array(strng))
                    j+=1
            i+=1

The goal is to keep just unique stations in the 3rd dataframe. I used these commands for res to keep unique stations before creating the 3rd dataframe:

res.drop_duplicates([0], keep = 'last', inplace = True)

and

res.groupby([0], as_index = False).last()

and it works fine. The problem is for a large data set, including thousands of lines, using these commands causes some lines of res file to be omitted in the 3rd dataframe. Could you please let me know what I should do to give the same result for a large dataset? I am going crazy and thanks for your time and help in advance.

for the purpose of asking the question it might help you to shring the problem down to its minimal size so that users can identify the issue for you. Thus is a useful guide: https://stackoverflow.com/help/minimal-reproducible-example — D.L, Mar 13 '22 at 15:26
"it omits some station names" => I don't see that. More explanation required. — mcsoini, Mar 13 '22 at 16:22
@mcsoini thanks I edited my post and showed this omitted station names. — Sonia Bazargan, Mar 13 '22 at 16:32
Try hdf.groupby("station_name").last() to see if that changes anything (or whatever the name of the dataframe with station names is). — mcsoini, Mar 13 '22 at 16:49
@mcsoini I added as_index: funiq_station = df.groupby([20], as_index = False).last() and it worked fine but just for a small set of data. For a large dataset, it worked like the drop_duplicates and omits some station names. — Sonia Bazargan, Mar 13 '22 at 20:21
You'll need to provide a minimal reproducible example (see @D.L's comment) for people to work with, otherwise it's a shot in the dark. — mcsoini, Mar 13 '22 at 20:54
@mcsoini. I did it. Hope it is a minimal reproducible example right now. — Sonia Bazargan, Mar 13 '22 at 23:22

score 0 · Accepted Answer · answered Mar 14 '22 at 20:42

I found the problem and hope it is helpful for others in the future. In a large data set, the duplicated stations were repeating many times but not consecutively. Drop_duplicates() were keeping just one of them. However, I wanted to drop just consecutive stations not all of them. And I've done this using shift:

unique_stations = res.loc[res[0].shift() != res[0]]

drop_duplicates in pandas for a large data set

1 Answers1