The problem you’re having is that cells in excel have datatypes. So here the data type is a date or a time, and it’s formatted for display only. Loading it “directly” means loading a datetime type*.
This means that, whatever you do with the dtype=
argument, the data will be loaded as a date, and then converted to string, giving you the result you see:
>>> pd.read_excel('test.xlsx').head()
Date Time Datetime
0 2020-03-08 10:00:00 2020-03-08 10:00:00
1 2020-03-09 11:00:00 2020-03-09 11:00:00
2 2020-03-10 12:00:00 2020-03-10 12:00:00
3 2020-03-11 13:00:00 2020-03-11 13:00:00
4 2020-03-12 14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx').dtypes
Date datetime64[ns]
Time object
Datetime datetime64[ns]
dtype: object
>>> pd.read_excel('test.xlsx', dtype='string').head()
Date Time Datetime
0 2020-03-08 00:00:00 10:00:00 2020-03-08 10:00:00
1 2020-03-09 00:00:00 11:00:00 2020-03-09 11:00:00
2 2020-03-10 00:00:00 12:00:00 2020-03-10 12:00:00
3 2020-03-11 00:00:00 13:00:00 2020-03-11 13:00:00
4 2020-03-12 00:00:00 14:00:00 2020-03-12 14:00:00
>>> pd.read_excel('test.xlsx', dtype='string').dtypes
Date string
Time string
Datetime string
dtype: object
Only in csv files are datetime data stored as string in the file. There, loading it “directly” as a string makes sense. In an excel file, you may as well load it as a date and format it with .dt.strftime()
That’s not to say that you can’t load the data as it is formatted, but you’ll need 2 steps:
- load data
- re-apply formatting
There is some translation to be done between formatting types, and you can’t use pandas directly − however you can use the engine that pandas uses as a backend:
import datetime
import openpyxl
import re
date_corresp = {
'dd': '%d',
'mm': '%m',
'yy': '%y',
'yyyy': '%Y',
}
time_corresp = {
'hh': '%h',
'mm': '%M',
'ss': '%S',
}
def datecell_as_formatted(cell):
if isinstance(cell.value, datetime.time):
dfmt, tfmt = '', cell.number_format
elif isinstance(cell.value, (datetime.date, datetime.datetime)):
dfmt, tfmt, *_ = cell.number_format.split('\\', 1) + ['']
else:
raise ValueError('Not a datetime cell')
for fmt in re.split(r'\W', dfmt):
if fmt:
dfmt = re.sub(f'\\b{fmt}\\b', date_corresp.get(fmt, fmt), dfmt)
for fmt in re.split(r'\W', tfmt):
if fmt:
tfmt = re.sub(f'\\b{fmt}\\b', time_corresp.get(fmt, fmt), tfmt)
return cell.value.strftime(dfmt + tfmt)
Which you can then use as follows:
>>> wb = openpyxl.load_workbook('test.xlsx')
>>> ws = wb.worksheets[0]
>>> datecell_as_formatted(ws.cell(row=2, column=1))
'08/03/20'
(You can also complete the _corresp
dictionaries with more date/time formatting items if they are incomplete)
* It is stored as a floating-point number, which is the number of days since 1/1/1900, as you can see by formatting a date as number or on this excelcampus page.