Removing duplicate data from pandas Dataframe

Question

I am trying to retrive data every few hours and since the data will have a lot of duplicate data I will remove all the duplicate data which has the same date and keep the first instance. Here is my code to do this:

import pandas as pd
import datetime
import pandas as pd
from pandas_datareader import data as web
import matplotlib.pyplot as plt
from alpha_vantage.foreignexchange import ForeignExchange
import os
from os import path
from alpha_vantage.timeseries import TimeSeries 
import matplotlib.pyplot as plt 
import sys

while True:
    if path.exists('stockdata.csv') == True:
        data1 = pd.read_csv('stockdata.csv', sep= ' ', index_col=0)
        ts = TimeSeries(key='1ORS1XLM1YK1GK9Y', output_format='pandas')
        data, meta_data = ts.get_intraday(symbol = 'spy', interval='1min', outputsize='full')
        data = data.rename(columns={ '1. open':'Open','2. high': 'High','3. low': 'Low', '4. close':'Close', '5. volume': 'Volume'})
        data1 = data1.append(data)
        data1 = data1.loc[data1.duplicated(keep='first'), :] # This should remove the rows which have duplicate dates
        data1.to_csv('stockdata.csv', sep= ' ')
        break
    else:
        data1 = pd.DataFrame()
        data1.to_csv('stockdata.csv', sep= ' ')

data1 = data1.loc[data1.duplicated(keep='first'), :] This should remove 1 instance of the the rows which have duplicate dates but if i run the code 2 times i still have double the rows.

As a example to demonstrate what i am trying to accomplish:

"2019-11-20 09:57:00" 311.405 311.44 311.4 311.41 42081.0
"2019-11-20 09:56:00" 311.38 311.425 311.33 311.41 63551.0
"2019-11-20 09:55:00" 311.34 311.4525 311.34 311.39 98011.0
"2019-11-20 09:53:00" 311.55 311.59 311.46 311.4832 125541.0
"2019-11-20 09:57:00" 311.405 311.44 311.4 311.41 42081.0
"2019-11-20 09:56:00" 311.38 311.425 311.33 311.41 63551.0
"2019-11-20 09:55:00" 311.34 311.4525 311.34 311.39 98011.0
"2019-11-20 09:54:00" 311.49 311.51 311.33 311.33 85857.0
"2019-11-20 09:53:00" 311.55 311.59 311.46 311.4832 125541.0

As you can see what is happening is data is being appended thus as in the example above it is being pasted 2 times. There will however be a new unique entry when new data is appended. So here i am trying to remove all duplicates to keep just new entries.

if i run data1.loc[data1.index.duplicated(keep = 'last'), :] in a different cell i get :

All the entries are removed.

you really need to post a snippet of code that anyone can copy and paste to see your actual vs expected output .... no one can duplicate your code since it uses data not included ... **just include a small dataframe or series and what you get vs what you expect** .... this wall of text is hard to read through and grok — Joran Beasley, Nov 27 '19 at 21:11
@JoranBeasley The reason why i have not included a snippet of the dataframe is because in its current form it contains over 3731 row and some are duplicates. — , Nov 27 '19 at 21:29
so make one that looks like `[[1,2,3],[2,3,4]]` instead and use that to demonstrate the problem ... surely you dont need 4000 rows nor the real data entries to show the problem — Joran Beasley, Nov 27 '19 at 21:33

Valdi_Bo · Answer 1 · 2019-11-27T21:45:02.177

Consider the following DataFrame (2 rows for each day):

Dat               Val
2019-06-03 12:07  20
2019-06-03 16:10  23
2019-06-04 10:04  30
2019-06-04 14:47  32
2019-06-06 11:42  49
2019-06-06 18:17  51

To leave only the first reading from each day, you can:

group by Dat column with D frequency,
print only the first row from each group,
drop rows from dates where there was no input row (option).

The code to do it is:

df.groupby(pd.Grouper(key='Dat', freq='D')).first().dropna()

The result is:

             Val
Dat             
2019-06-03  20.0
2019-06-04  30.0
2019-06-06  49.0

If your input column is of integer type and you want to keep it as integer, you can add .astype(int) to the above code.

Edit

If the above solution computes too slow, try:

df.groupby(df.Dat.dt.date).first().drop(columns='Dat')

This time no "empty dates" are inserted, so there is no need to dropna.

using groupby to drop duplicates is **much** more expensive computationally than using pandas `drop_duplicates ` — Joran Beasley, Nov 27 '19 at 21:36

Removing duplicate data from pandas Dataframe

1 Answers1

Edit