Import text file from geonames using pandas python

Question

I downloaded one of the country datasets from geonames and I used this line to parse the dataset into columns:

data = pd.read_csv("C:/Users/Documents/TR.txt", sep="\t", header = None)

But for some reason this doesn't parse all of the rows correctly. Most of the rows are correctly parsed and about 2K are not. I used this line to be able to see that it's not parsed correctly:

data.to_csv("C:/Users/Documents/output.csv")

I then opened the output.csv in excel and saw that some of the rows are not parsed. But when I open the original TR.txt dataset on excel and use tab delimiter all the rows are correctly shown as parsed. So I am doing something wrong in my python code but I can't figure out what. Am I outputting the dataset wrong?? Thank you

Opening TR.txt in excel as tab-delimited shows at least 7 lines not parsing properly, first one being line 15191 with value 311071 in first column - do you also get this? — Stefan, Dec 28 '15 at 18:36

score 6 · Accepted Answer · edited May 29 '23 at 00:57

Always read the readme.txt file.

In this particular case, there are two noteworthy issues going on.

The elevation (column 15) is expected as an int, but contains blanks. If you specify an int as the datatype, this will generate an error because there is no NaN values for ints. The work around is to cast it as a float. If you really want an int, then create a sentinal value for missing fields (e.g. -99999), fillna() with this value, and then cast as an int.

Some column contain comma separated lists (e.g. column 3, alternate names). When you used data.to_csv("C:/Users/Documents/output.csv") you destroyed the tab delimited parsing. You need to specify sep='\t'.

dtypes_dict = {
    0: int, # geonameid
    1: unicode,  # name
    2: str,  # asciiname
    3: str,  # alternatenames
    4: float, # latitude
    5: float, # longitude
    6: str, # feature class
    7: str, # feature code
    8: str, # country code
    9: str, # cc2
    10: str, # admin1 code
    11: str, # admin2 code
    12: str, # admin3 code
    13: str, # admin4 code
    14: int, # population
    15: int, # elevation
    16: int, # dem (digital elevation model)
    17: str, # timezone
    18: str # modification date yyyy-MM-dd
}

data = pd.read_csv("TR.txt", sep="\t", header = None, dtype=dtypes_dict)
data.to_csv('output.txt', sep='\t')

I didn't parse the final date column because it is probably not relevant.

I didn't use the dtypes_dict you have here since elevation is recognized as a float when I import and the other variable datatypes are good as is. I didn't know that I had to specify the delimiter when outputting though so Thanks so much for that! — hope288, Dec 28 '15 at 20:33
Specifying the dtypes results in faster reads as the parser doesn't have to guess the datatype. — Alexander, Dec 28 '15 at 20:37

Import text file from geonames using pandas python

1 Answers1