1

Goal: From an excel file, I want to get all the records which have dates that fall within a range and write them to a new excel file. The infile I'm working with has 500K+ rows and 21 columns.

What I've tried: I've read the infile to a Pandas dataframe then returned the DatetimeIndex. If I print the range variable I get the desired records.

import pandas as pd

in_excel_file = r'path\to\infile.xlsx'
out_excel_file = r'path\to\outfile.xlsx'

df = pd.read_excel(in_excel_file)
range = (pd.date_range(start='1910-1-1', end='2021-1-1'))
print(range)

##prints
DatetimeIndex(['1990-01-01', '1990-01-02', '1990-01-03', '1990-01-04',
               '1990-01-05', '1990-01-06', '1990-01-07', '1990-01-08',
               '1990-01-09', '1990-01-10',
               ...
               '2020-12-23', '2020-12-24', '2020-12-25', '2020-12-26',
               '2020-12-27', '2020-12-28', '2020-12-29', '2020-12-30',
               '2020-12-31', '2021-01-01'],
              dtype='datetime64[ns]', length=11324, freq='D')

Where I'm having trouble is getting the above DatetimeIndex to the outfile. The following gives me an error:

range.to_excel(out_excel_file, index=False)
AttributeError: 'DatetimeIndex' object has no attribute 'to_excel'

I'm pretty sure that when writing to excel it has to be a dataframe. So, my question is how do I get the range variable to a dataframe object?

Pfalbaum
  • 586
  • 3
  • 10
  • 26
  • 1
    It seems that you need to look up how to put a sequence of values into a data frame. You may need to convert to a list, but it's a straightforward process. Is that enough of a hint? – Prune May 20 '21 at 21:51

2 Answers2

1

Goal: From an excel file, I want to get all the records which have dates that fall within a range and write them to a new excel file. The infile I'm working with has 500K+ rows and 21 columns.

You could use an indexing operation to select only the data you need from the original DataFrame and save the result in an Excel file.

In order to do that first you need to check if the date column from your original DataFrame is already converted to a datetime/date object:

import numpy as np

date_column = "date"  # Suppose this is your date column name
if not np.issubdtype(df[date_column].dtype, np.datetime64):
    df.loc[:, date_column] = pd.to_datetime(df[date_column], format="%Y-%m-%d")

Now you can use a regular indexing operation to get all values you need:

mask = (df[date_column] >= '1910-01-01') & (df[date_column] <= '2021-01-01')  # Creates mask for date range
out_dataframe = df.loc[mask]  # Here we select the indices using our mask
out_dataframe.to_excel(out_excel_file)
guimorg
  • 85
  • 4
  • Thanks, that worked! One thing though, I'm getting the time too. Here's how a field looks in Excel ```1975-04-03 00:00:00```. I'll have to look for a way to change the format to only show only Y-m-d. – Pfalbaum May 21 '21 at 15:49
  • 1
    OK! I've just added a little fix for the response. The function `pd.to_datetime` takes a `format` argument. Specifying the date format will fix your issue – guimorg May 21 '21 at 20:58
1

You can try to create a dataframe from the DatetimeIndex before writing it to Excel, as follows:

range_df = pd.DataFrame(index=range).rename_axis(index='range').reset_index()

or as suggested by @guimorg, we can also do it as:

range_df = range.to_frame(index=False, name='range')

Then, continue with your code to write it to Excel:

range_df.to_excel(out_file, index=False)
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • One could also use simply `range.to_frame(index=False)` here. – guimorg May 20 '21 at 22:37
  • 1
    @guimorg Nice suggestion! Thank you very much! – SeaBean May 20 '21 at 22:45
  • @guimorg ```range.to_frame(index=false)``` got me the desired fields, but by themselves. I need those fields along with the rest of the columns. – Pfalbaum May 21 '21 at 14:31
  • @SeaBean your suggestion resulted in the same thing. After writing to Excel I only have the range column, but I need its corresponding columns too. – Pfalbaum May 21 '21 at 14:38
  • @Pfalbaum check my other answer that it may help you. – guimorg May 21 '21 at 15:41
  • 1
    @Pfalbaum Oh, I overlooked your ultimate goal but concentrated on the problem you faced on your trial. You can refer to @guimorg answer which is in the right direction for your problem. One thing you can also try is that now you created a date range, you can make use of it by checking the date is in the range. e.g. `mask = df[date_column].isin(range)` In this way, you can make use of the defined range and no need to hard code dates in the boolean mask. Good luck! – SeaBean May 21 '21 at 18:11