0

I am trying to format the column 'Data' to make a pattern with dates.

The formats I have are:

1/30/20 16:00
1/31/2020 23:59
2020-02-02T23:43:02

Here is the code for the dataframe.

import requests
import pandas as pd
import numpy as np
url = "https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports"
csv_only  = [i.split("=")[1][1:-1] for i in requests.get(url).text.split(" ") if '.csv' in i and 'title' in i]

combo = [pd.read_csv(url.replace("github","raw.githubusercontent").replace("/tree/","/")+"/"+f) for f in csv_only]

one_df = pd.concat(combo,ignore_index=True)

one_df["País"] = one_df["Country/Region"].fillna(one_df["Country_Region"])
one_df["Data"] = one_df["Last Update"].fillna(one_df["Last_Update"])

I tried adding the code bellow but it doesn't bring the result I wanted

pd.to_datetime(one_df['Data'])
one_df.style.format({"Data": lambda t: t.strftime("%m/%d/%Y")})

Any help?

UPDATE

This is the complete code, but it doesn't work. Many exceptions printed with different date formats.

import requests
import pandas as pd
import numpy as np
from datetime import datetime
url = "https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports"
csv_only  = [i.split("=")[1][1:-1] for i in requests.get(url).text.split(" ") if '.csv' in i and 'title' in i]

combo = [pd.read_csv(url.replace("github","raw.githubusercontent").replace("/tree/","/")+"/"+f) for f in csv_only]

one_df = pd.concat(combo,ignore_index=True)

df = pd.DataFrame()
DATE_FORMATS = ["%m/%d/%y %H:%M", "%m/%d/%Y %H:%M", "%Y-%m-%dT%H:%M:%S", "%Y-%m-%d %H:%M:%S", "%Y-%m-%d %H:%M:%S", "%Y-%m-%d  %H:%M:%S"]

df["Região"] = one_df["Province/State"].fillna(one_df["Admin2"])
df["País"] = one_df["Country/Region"].fillna(one_df["Country_Region"])
df["Data"] = one_df["Last Update"].fillna(one_df["Last_Update"])
df["Confirmados"] = one_df["Confirmed"]
df["Mortes"] = one_df["Deaths"]
df["Recuperados"] = one_df["Recovered"]

def parse(x_):
    for fmt in DATE_FORMATS :
        try:
            tmp = datetime.strptime(x_, fmt).strftime("%m/%d/%Y")
            return tmp
        except ValueError:
            print(x_)

pd.to_datetime(df['Data'])
df['Data'] = df['Data'].apply(lambda x: parse(x))

#df['Data'].strftime('%m/%d/%Y')
#df['Data'] = df['Data'].map(lambda x: x.strftime('%m/%d/%Y') if x else '')

df.to_excel(r'C:\Users\guilh\Downloads\Covid2\Covid-19.xlsx', index=False,  encoding="utf8")
print(df)
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
guialmachado
  • 506
  • 5
  • 17

2 Answers2

3
from datetime import datetime
import pandas as pd

You could save all possible formats in a list as -

DATE_FORMATS = ["%Y-%m-%d %H:%M:%S", "%Y-%m-%dT%H:%M:%S", "%m/%d/%y %H:%M", "%m/%d/%Y %H:%M"]

Define a function that loops through the formats and tries to parse it. (Fixed a bug, where the print statement should have been outside the for loop)

issues = set()
def parse(x_):
    for fmt in DATE_FORMATS:
        try:
            return datetime.strptime(x_, fmt).strftime("%m/%d/%Y")
        except ValueError:
            pass
    issues.add(x_)


sample = ["1/30/20 16:00", "1/31/2020 23:59", "2020-02-02T23:43:02"]

df = pd.DataFrame({'data': sample})
df['data'] = df['data'].apply(lambda x: parse(x))

assert df['Data'].isna().sum() == len(issues) == 0, "Issues observed, nulls observed in dataframe"

print("Done")

Output

         data
0  01/30/2020
1  01/31/2020
2  02/02/2020

If df.apply() comes across a particular date format that hasn't been defined in the list, it would simply print None since nothing would be returned by the function parse()

tidakdiinginkan
  • 918
  • 9
  • 18
  • How could I add a line to print the exception dates so I can map them in DATE_FORMATS? I can't get None on the dates since the data analysis depends on precise dates and countries – guialmachado Apr 25 '20 at 14:56
  • within the `except` clause you can add a `print(x_)` (remove the `pass`), this will print the date with the unrecognized format. – tidakdiinginkan Apr 25 '20 at 16:21
  • I got many exceptions, with `2020-02-24T14:43:03`, `2020-02-24 14:43:03` and other date formats. I updated the question with my complete code including this answer of yours, am i using it wrong? – guialmachado Apr 25 '20 at 17:59
  • @guialmachado So, basically there was a flaw in my code, all the dates were being parsed but were needlessly printed out due to the incorrect position of the statement `print(x_)`, kindly check my edited code now. – tidakdiinginkan Apr 25 '20 at 20:13
0

also here, letting pd.to_datetime infer the format does the trick:

import pandas as pd

s = pd.to_datetime(["1/30/20 16:00", "1/31/2020 23:59", "2020-02-02T23:43:02"])

print(s)
# DatetimeIndex(['2020-01-30 16:00:00', '2020-01-31 23:59:00',
#                '2020-02-02 23:43:02'],
#               dtype='datetime64[ns]', freq=None)

Note that if your date/time format generally provides the day first (e.g. 30.1.2021 for Jan 30th 2021), set keyword dayfirst=True.

FObersteiner
  • 22,500
  • 8
  • 42
  • 72