-1

I have data stored in a S3 bucket which uses "yyyy/MM/dd" format to store the files per date, like in this sample S3a path: s3a://mybucket/data/2018/07/03. The files in these buckets are in json.gz format and I would like to import all these files to a spark dataframe per day. After that I want to feed these spark dfs to some written code via a for loop:

for date in date_range:
    s3a = 's3a://mybucket/data/{}/{}/{}/*.json.gz'.format(date.year, date.month, date.day)
    df = spark.read.format('json').option("header", "true").load(s3a)
    # Execute code here

In order to read the data, I tried to format the date_range like below:

from datetime import datetime
import pandas as pd
def return_date_range(start_date, end_date):
    return pd.date_range(start=start_date, end=end_date).to_pydatetime().tolist()

date_range = return_date_range(start_date='2018-03-06', end_date='2018-03-12')
date_range

[datetime.datetime(2018, 3, 6, 0, 0),
 datetime.datetime(2018, 3, 7, 0, 0),
 datetime.datetime(2018, 3, 8, 0, 0),
 datetime.datetime(2018, 3, 9, 0, 0),
 datetime.datetime(2018, 3, 10, 0, 0),
 datetime.datetime(2018, 3, 11, 0, 0),
 datetime.datetime(2018, 3, 12, 0, 0)]

The problem is that pydatetime() returns the days and months without a '0'. How do I make sure that my code returns a list of values with '0's, like below:

[datetime.datetime(2018, 03, 06, 0, 0),
 datetime.datetime(2018, 03, 07, 0, 0),
 datetime.datetime(2018, 03, 08, 0, 0),
 datetime.datetime(2018, 03, 09, 0, 0),
 datetime.datetime(2018, 03, 10, 0, 0),
 datetime.datetime(2018, 03, 11, 0, 0),
 datetime.datetime(2018, 03, 12, 0, 0)]
sampeterson
  • 459
  • 4
  • 16

1 Answers1

0

This is one approach using .strftime("%Y/%m/%d")

Ex:

from datetime import datetime
import pandas as pd
def return_date_range(start_date, end_date):
    return pd.date_range(start=start_date, end=end_date).strftime("%Y/%m/%d").tolist()

date_range = return_date_range(start_date='2018-03-06', end_date='2018-03-12')
print(date_range)

Output:

['2018/03/06',
 '2018/03/07',
 '2018/03/08',
 '2018/03/09',
 '2018/03/10',
 '2018/03/11',
 '2018/03/12']

for date in date_range:
    s3a = 's3a://mybucket/data/{}/*.json.gz'.format(date)
    print(s3a)

s3a://mybucket/data/2018/03/06/*.json.gz
s3a://mybucket/data/2018/03/07/*.json.gz
s3a://mybucket/data/2018/03/08/*.json.gz
s3a://mybucket/data/2018/03/09/*.json.gz
s3a://mybucket/data/2018/03/10/*.json.gz
s3a://mybucket/data/2018/03/11/*.json.gz
s3a://mybucket/data/2018/03/12/*.json.gz
Rakesh
  • 81,458
  • 17
  • 76
  • 113