1

I'm trying to convert the value of a column named date in my csv file to another format, for instance:

Original records

transfer id,player id,player name,season,date
732058,1126,,12/13,Jul 1- 2012
581951,1126,,11/12,Jun 3- 2011
295000,1126,,09/10,Aug 12- 2009
98459,1126,,06/07,Nov 6- 2006
7267,1126,,03/04,Jul 2- 2003
...

and I'd like to get a result like

transfer id,player id,player name,season,date
732058,1126,,12/13,2012-07-01
581951,1126,,11/12,2011-06-03
295000,1126,,09/10,2009-08-12
98459,1126,,06/07,2006-11-06
7267,1126,,03/04,2003-07-02
...

as the data stored in csv file are string, I have written a method which can convert the date format:

import time

# convert date from original format to new format
def date_convert(_date,fmt_original,fmt_new):
    if date_validate(_date,fmt_original):
        timeArray=time.strptime(_date,fmt_original)
        return time.strftime(fmt_new,timeArray)
    else:
        return '0001-01-01'

def date_validate(_date,fmt_original):
    try:
        time.strptime(_date, fmt_original)
        return True
    except ValueError:
        return False

then I try to change the date in csv file,and I try to utilize pandas, as is told by @MaxU:

and I write a code like

import pandas as pd
import date_format

df=pd.read_csv('the_transfer_info_test.csv',delimiter=',')
df.date=date_format.date_convert(df.date, '%b %d- %Y', '%Y-%m-%d')
print df

and at first I got an exception like this:

TypeError: expected string or buffer

I thought it may be related to the data type,as the df.date get a type of Series in pandas, so I coded as

df.date=date_format.date_convert(str(df.date), '%b %d- %Y', '%Y-%m-%d')

but it returns all 0001-01-01 which is the exception date in date_format, therefore I searched how to convert the Series to String and find an answer like @Amit, and I tried methods below:

df['date'].astype(basestring)
df.date.apply(str)
df['date'].astype(str)
df['date'].astype('str')

but they don't work for me, I got the same exception like:

TypeError: expected string or buffer

I wonder how I can convert specific column values in csv file,either utilize pandas or not.

BTW, my python version is 2.7.12 with IDE PyCharm and Anoconda 4.0.0 and pandas 0.18.0.

Any help is appreciated,thank you.


Thanks to @jezrael, for my sample above it both work well, it's my fault that I meant to simplify my issue and simplified my question,actually my original data is like:

transfer id,player id,player name,season,date,move from,move from id,move to,move to id,market value,transfer fee
732058,1126,,12/13,Jul 1- 2012,e-frankfurt,24,1-fc-koln,3,£1.06m,Free transfer
581951,1126,,11/12,Jul 1- 2011,fc-st-pauli,35,eintracht-frankfurt,24,£1.70m,£425k
295000,1126,,09/10,Jul 1- 2009,alem-aachen,8,fc-st-pauli,35,£850k,Free transfer
98459,1126,,06/07,Jul 1- 2006,1860-munich,72,alemannia-aachen,8,£1.36m,£765k
7267,1126,,03/04,Jul 1- 2003,stuttgart-ii,102,tsv-1860-munich,72,-,£21k
...

and actually these approaches work well with part of my data I mean if I test it with afew lines with the same format,but when it comes to the original data which is about 40000 records,it's wired that those approaches do not work any more, for to_datetime method,I got an exception like

ValueError: time data '-' does not match format '%b %d- %Y' (match)

while with the second method as parse_dates, the date format remains the same like Jun 11- 2016.

Again,any help will be appreciated.

Community
  • 1
  • 1
LancelotHolmes
  • 659
  • 1
  • 10
  • 31

1 Answers1

1

I think you need to_datetime:

df.date = pd.to_datetime(df.date, format='%b %d- %Y')
print (df)
   transfer id  player id  player name season       date
0       732058       1126          NaN  12/13 2012-07-01
1       581951       1126          NaN  11/12 2011-06-03
2       295000       1126          NaN  09/10 2009-08-12
3        98459       1126          NaN  06/07 2006-11-06
4         7267       1126          NaN  03/04 2003-07-02

but is seems you can use parameter parse_dates in read_csv:

import pandas as pd
from pandas.compat import StringIO

temp=u"""transfer id,player id,player name,season,date
732058,1126,,12/13,Jul 1- 2012
581951,1126,,11/12,Jun 3- 2011
295000,1126,,09/10,Aug 12- 2009
98459,1126,,06/07,Nov 6- 2006
7267,1126,,03/04,Jul 2- 2003
"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), parse_dates=['date'])

print (df)
   transfer id  player id  player name season       date
0       732058       1126          NaN  12/13 2012-07-01
1       581951       1126          NaN  11/12 2011-06-03
2       295000       1126          NaN  09/10 2009-08-12
3        98459       1126          NaN  06/07 2006-11-06
4         7267       1126          NaN  03/04 2003-07-02

EDIT by comment:

You need parameter errors='coerce' for replacing bad data (which cannot match format to NaT):

df.date = pd.to_datetime(df.date, format='%b %d- %Y', errors='coerce')

print (df)
   transfer id  player id  player name season         date     move from  \
0       732058       1126          NaN  12/13  Jul 1- 2012   e-frankfurt   
1       581951       1126          NaN  11/12  Jul 1- 2011   fc-st-pauli   
2       295000       1126          NaN  09/10  Jul 1- 2009   alem-aachen   
3        98459       1126          NaN  06/07  Jul 1- 2006   1860-munich   
4         7267       1126          NaN  03/04  Jul 1- 2003  stuttgart-ii   
5         7267       1126          NaN  03/04            -  stuttgart-ii   

   move from id              move to  move to id market value   transfer fee  
0            24            1-fc-koln           3       £1.06m  Free transfer  
1            35  eintracht-frankfurt          24       £1.70m          £425k  
2             8          fc-st-pauli          35        £850k  Free transfer  
3            72     alemannia-aachen           8       £1.36m          £765k  
4           102      tsv-1860-munich          72            -           £21k  
5           102      tsv-1860-munich          72            -           £21k 
df.date = pd.to_datetime(df.date, format='%b %d- %Y', errors='coerce')
print (df)
   transfer id  player id  player name season       date     move from  \
0       732058       1126          NaN  12/13 2012-07-01   e-frankfurt   
1       581951       1126          NaN  11/12 2011-07-01   fc-st-pauli   
2       295000       1126          NaN  09/10 2009-07-01   alem-aachen   
3        98459       1126          NaN  06/07 2006-07-01   1860-munich   
4         7267       1126          NaN  03/04 2003-07-01  stuttgart-ii   
5         7267       1126          NaN  03/04        NaT  stuttgart-ii   

   move from id              move to  move to id market value   transfer fee  
0            24            1-fc-koln           3       £1.06m  Free transfer  
1            35  eintracht-frankfurt          24       £1.70m          £425k  
2             8          fc-st-pauli          35        £850k  Free transfer  
3            72     alemannia-aachen           8       £1.36m          £765k  
4           102      tsv-1860-munich          72            -           £21k  
5           102      tsv-1860-munich          72            -           £21k  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you for your help, your method work well with sample records but failed with a comparative larger csv file like 40000 records csv file,and I have updated my description, it will be nice of you to offer a further help. – LancelotHolmes Oct 27 '16 at 11:25
  • I add sample data, check last row of sample - `-` is convert to `NaT` – jezrael Oct 27 '16 at 11:32
  • Yeah it works well,thanks,so the reason for the failure of your another method like `parse_dates` is also for the wrong data in records? – LancelotHolmes Oct 27 '16 at 11:39
  • Well thanks again and I'll explore the pandas since it's quite new for me – LancelotHolmes Oct 27 '16 at 11:41