0

In Python, I have a DataFrame with column 'Date' (format e.g. 2020-06-26). This column is sorted in descending order: 2020-06-26, 2020-06-25, 2020-06-24... The other column 'Reviews' is made of text reviews of a website. My data can have multiple reviews on a given date or no reviews on another date. I want to find what dates are missing in column 'Date'. Then, for each missing date, add one row with date in ´´format='%Y-%m-%d'´´, and an empty review on 'Reviews', to be able to plot them. How should I do this?

from datetime import date, timedelta
d = data['Date'] 
print(d[0])
print(d[-1])

date_set = set(d[-1] + timedelta(x) for x in range((d[0] - d[-1]).days))
missing = sorted(date_set - set(d))
missing = pd.to_datetime(missing, format='%Y-%m-%d')

idx = pd.date_range(start=min(data.Date), end=max(data.Date), freq='D')

#tried this
data = data.reindex(idx, fill_value=0)
data.head()

#Got TypeError: 'fill_value' ('0') is not in this Categorical's categories.


#also tried this
df2 = (pd.DataFrame(data.set_index('Date'), index=idx).fillna(0) + data.set_index('Date')).ffill().stack()
df2.head()

#Got ValueError: cannot reindex from a duplicate axis
uk_butterfly
  • 93
  • 1
  • 2
  • 8

1 Answers1

0

This is my code:

for i in range(len(df)):
    if i > 0:
        prev = df.loc[i-1]["Date"]
        current =df.loc[i]["Date"]
        for a in range((prev-current).days):
            if a > 0:
                 df.loc[df["Date"].count()] = [prev-timedelta(days = a), None]
                 
df = df.sort_values("Date", ascending=False)
print(df)
NumberC
  • 596
  • 7
  • 17
  • have tried this and worked partially. All rows with same date are randomly sorted. Each time I run the program there is a different index order for 'Reviews' with same 'Date'. I want to keep the initial order for each day ('Date'), and join missing dates where appropriate. How to do this? If I use `df = df.reset_index(drop=True)` the initial order is not kept... – uk_butterfly Jul 14 '20 at 09:37