3

I have been trying to use panda to parse date as string from my csv.

What I am trying to achieve here is to extract a row (from the csv) which contains two type of data, string and integer.

From here, I want to use docx-mailmerge to produce Microsoft Word reports.

For context, I was referring to the solutions here when figuring out: datetime dtypes in pandas read_csv

Thank you.

Here is error I am getting:

ValueError: invalid literal for int() with base 10: 'Num'

Here is my code:

import pandas as pd   

headers = ['AA', 'Num']
dtypes = {'AA': 'str', 'Num': 'int'}
pdate = ['AA']

df = pd.read_csv('test1.csv', 
header=None,
names = headers,
dtype = dtypes,
parse_dates = pdate, 
date_parser=None) 

top = df.head()
print(top)

Here is my csv file:

AA,Num
1/1/2020,5
2/1/2020,10
3/1/2020,15
4/1/2020,20
5/1/2020,25
6/1/2020,30
7/1/2020,35
8/1/2020,40
9/1/2020,45
10/1/2020,50
11/1/2020,55

2 Answers2

1

I think problem is parameter header=None, it is used if no headers in csv. So remove it. Also parameter names = headers should be omit, if need parse all columns:

#here is set new columns names, so original headers are shift to first row of data
df = pd.read_csv('test1.csv',
header=None,
names = headers,
#dtype = dtypes,
#parse_dates = pdate, 
date_parser=None) 

top = df.head()
print(top)
         AA  Num
0        AA  Num
1  1/1/2020    5
2  2/1/2020   10
3  3/1/2020   15
4  4/1/2020   20

Correct parsing:

df = pd.read_csv('test1.csv',
#header=None,
#names = headers,
dtype = dtypes,
parse_dates = pdate, 
date_parser=None) 

top = df.head()
print(top)
          AA  Num
0 2020-01-01    5
1 2020-02-01   10
2 2020-03-01   15
3 2020-04-01   20
4 2020-05-01   25

If need filter only some columns instead names = headers use usecols=headers parameter.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thank you for this! & I dont think it is allowable to use 'use_cols'. – Syahmi Shaarani Mar 20 '20 at 07:58
  • 1
    @SyahmiShaarani - Sorry, my typo. It is `usecols` and used if many columns and need parse only few. – jezrael Mar 20 '20 at 07:58
  • Yes it is working! But now I am getting "2020-01-01 00:00:00" when I print the exact cell (row and column) but in the dataframe it shows only the date. How do I remove the timestamp? – Syahmi Shaarani Mar 20 '20 at 08:09
  • 1
    @SyahmiShaarani - In pandas if no times are not displayed `00:00:00`, you can check it `print (df['AA'].tolist())`. If want no times, convert column `AA` to dates like `df['AA'] = df['AA'].dt.date` – jezrael Mar 20 '20 at 08:12
  • Another question, what if I am to use Excel directly? The prob is the date is not considered as 'str' then how do I go about my code? I am receiving this error " dateparse = lambda x: pd.datetime.strptime(x, '%d/%m/%Y') TypeError: strptime() argument 1 must be str, not datetime.datetime" when I used excel as the date column is not parsed by pandas. – Syahmi Shaarani Mar 20 '20 at 08:27
  • @SyahmiShaarani - I tjink error means already datetimes, not necessary parsed. – jezrael Mar 20 '20 at 08:30
  • When I did "df.iloc[0,0]", it does not extract the date column as it is not considered a part of the dataframe? Same goes when I did "df.columns", it does not show the date column. – Syahmi Shaarani Mar 20 '20 at 08:33
  • it extract one value only - intersection of first row and first column – jezrael Mar 20 '20 at 08:34
  • Yes, the first row and first column is the date item but it is not extracted. Any idea why? was it because it is not a string/object? – Syahmi Shaarani Mar 20 '20 at 08:38
  • @SyahmiShaarani - I guess datetimeindex, then need `print (df.index[0])` – jezrael Mar 20 '20 at 08:41
  • I think I got it working but I am getting "2020-01-30T00:00:00.000000000", how do I remove the T00.. and only keep the date? I try parsing it but it says TypeError: strptime() argument 1 must be str, not datetime.datetime. Got it, I used split string to remove the characters after T, thanks a bunch! – Syahmi Shaarani Mar 20 '20 at 08:47
  • @SyahmiShaarani - Why is used `strptime` ? Mybe problem because used it. In pandas is used [`to_datetime`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html), I hope it help – jezrael Mar 20 '20 at 08:50
0

Here is the working code:

headers = ['AA', 'Num']
dtypes = {'AA': 'str', 'Num': 'int'}
pdate = ['AA']

df = pd.read_csv('test1.csv', 
dtype = dtypes,
parse_dates = pdate, 
date_parser=None)

top = df.head()
print(top)