0

I have a large csv file contains some bus network information.

The stop code are made of a large number with a certain letter in the end. However, some of them are only numbers. When I read them into pandas, the large numbers become in scientific notion. like

code_o  lat_o   lon_o   code_d
490016444HN 51.56878    0.1811568   490013271R
490013271R  51.57493    0.1781319   490009721A
490009721A  51.57708    0.1769355   490010407C
490010407C  51.57947    0.1775409   490011659G
490011659G  51.5806 0.1831088   490009810M
490009810M  51.57947    0.1848733   490014448S
490014448S  51.57751    0.185111    490001243Y
490001243Y  51.57379    0.1839945   490013654S
490013654S  51.57143    0.184776    490013482E
490013482E  51.57107    0.187039    490015118E
490015118E  51.5724 0.1923417   490011214E
490011214E  51.57362    0.1959939   490006980E
490006980E  51.57433    0.1999537   4.90E+09
4.90E+09    51.57071    0.2087701   490003049E
490003049E  51.5631 0.2146196   490004001A
490004001A  51.56314    0.2165552   490015350F

The type of them are object, however I need them to be a normal number in order to cross join other tables.

Since the column is not an 'int' or 'float', I cannot modify them by a whole column.

Any suggestion?

I attached the file from dropbox

https://www.dropbox.com/s/jhbxsncd97rq1z4/gtfs_OD_links_L.csv?dl=0
GDI
  • 647
  • 1
  • 7
  • 17
  • Do you actually want them to be numbers, or do you want them to be strings (like the ones with letters are)? – BrenBarn Jul 17 '16 at 01:30
  • Please post a few rows from the csv file, not a link to the whole file. I (and many other people) don't want to download unknown files from the Interwebz --- viruses, malware and other weeds. – Alicia Garcia-Raboso Jul 17 '16 at 02:30
  • I want them to be string to be a future key for joining. – GDI Jul 17 '16 at 03:05

2 Answers2

0

IIUC, try forcing object type for the code_d column on import:

import numpy as np
import pandas as pd

df = pd.read_csv('your_original_file.csv', dtype={'code_d': 'object'})

You can then parse that column, discarding the letter at the end and casting the result to integer type:

df['code_d'] = df['code_d'].str[:-1].astype(np.int)
Alicia Garcia-Raboso
  • 13,193
  • 1
  • 43
  • 48
  • Not working. If I use df['code_d'] = df['code_d'].str[:-1].astype(np.int), I get a error ValueError: invalid literal for long() with base 10: '490016444H' which refers to the first row. If I use If I use df['code_d'] = df['code_d'].str[:-2].astype(np.int) I get ValueError: invalid literal for long() with base 10: '4.90E+'. – GDI Jul 17 '16 at 02:06
  • Please post the corresponding rows *before* importing into a pandas DataFrame. Also, `490016444H` is in the `code_o` column, not `code_d`. – Alicia Garcia-Raboso Jul 17 '16 at 02:10
  • It is also the only case with two letters at the end instead of one. Without you defining things properly, it is impossible to answer your question. – Alicia Garcia-Raboso Jul 17 '16 at 02:16
  • I changed to code_o, as this should be the primary key. And since I have thousands of rows below which I have not put it on, it seems inevitable this two letter case would happen again. – GDI Jul 17 '16 at 02:18
  • @ZXie I have the same problem. Did you ever find a solution to this problem? – stanm87 Feb 26 '19 at 22:44
0

Keep it simple: df=pd.read_csv('myfile.csv',dtype=str) and it will read everything in as strings. Or as was posted earlier by @Alberto to specify that column only just: df=pd.read_csv('myfile.csv',dtype={'code_o':str})

Matt
  • 2,602
  • 13
  • 36