4

I downloaded one of the country datasets from geonames and I used this line to parse the dataset into columns:

data = pd.read_csv("C:/Users/Documents/TR.txt", sep="\t", header = None)

But for some reason this doesn't parse all of the rows correctly. Most of the rows are correctly parsed and about 2K are not. I used this line to be able to see that it's not parsed correctly:

data.to_csv("C:/Users/Documents/output.csv")

I then opened the output.csv in excel and saw that some of the rows are not parsed. But when I open the original TR.txt dataset on excel and use tab delimiter all the rows are correctly shown as parsed. So I am doing something wrong in my python code but I can't figure out what. Am I outputting the dataset wrong?? Thank you

hope288
  • 725
  • 12
  • 23
  • Opening TR.txt in excel as tab-delimited shows at least 7 lines not parsing properly, first one being line 15191 with value 311071 in first column - do you also get this? – Stefan Dec 28 '15 at 18:36
  • What is size of your `RAM`? – jezrael Dec 28 '15 at 18:46

1 Answers1

6

Always read the readme.txt file.

In this particular case, there are two noteworthy issues going on.

  1. The elevation (column 15) is expected as an int, but contains blanks. If you specify an int as the datatype, this will generate an error because there is no NaN values for ints. The work around is to cast it as a float. If you really want an int, then create a sentinal value for missing fields (e.g. -99999), fillna() with this value, and then cast as an int.

  2. Some column contain comma separated lists (e.g. column 3, alternate names). When you used data.to_csv("C:/Users/Documents/output.csv") you destroyed the tab delimited parsing. You need to specify sep='\t'.

    dtypes_dict = {
        0: int, # geonameid
        1: unicode,  # name
        2: str,  # asciiname
        3: str,  # alternatenames
        4: float, # latitude
        5: float, # longitude
        6: str, # feature class
        7: str, # feature code
        8: str, # country code
        9: str, # cc2
        10: str, # admin1 code
        11: str, # admin2 code
        12: str, # admin3 code
        13: str, # admin4 code
        14: int, # population
        15: int, # elevation
        16: int, # dem (digital elevation model)
        17: str, # timezone
        18: str # modification date yyyy-MM-dd
    }
    
    data = pd.read_csv("TR.txt", sep="\t", header = None, dtype=dtypes_dict)
    data.to_csv('output.txt', sep='\t')
    

I didn't parse the final date column because it is probably not relevant.

Makyen
  • 31,849
  • 12
  • 86
  • 121
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • I didn't use the dtypes_dict you have here since elevation is recognized as a float when I import and the other variable datatypes are good as is. I didn't know that I had to specify the delimiter when outputting though so Thanks so much for that! – hope288 Dec 28 '15 at 20:33
  • 2
    Specifying the dtypes results in faster reads as the parser doesn't have to guess the datatype. – Alexander Dec 28 '15 at 20:37
  • Oh ok, I'll make sure to do that then. Thanks a lot! – hope288 Dec 28 '15 at 20:59
  • note `unicode` in python 3+ can just be `str` – dangel May 29 '23 at 01:29