1

I have a big csv file in which one column contains a list of gps coordinates as float-tuples. Of course if I read in the file as a pandas dataframe, their type is simply String, which is not that useful.

What I want to do is to convert the Strings to lists of float-tuples and save them in a format that I can simply load into a dataframe and run queries on. The important constraint on this is, that I cannot compute this conversion for the whole big dataset, so dumping the binary object is no option.

From what I read, it seems that HDF-stores should be able to save the data in the required format AND allow me to append rows, so I can convert the strings to GPS-entries bit by bit, thereby not getting problems with my memory limitations.

However, when I try to create the HDF-file I get:

TypeError: Cannot serialize the column [POLYLINE] because
its data contents are [mixed] object dtype

The code that produces this error is:

df.to_hdf(filename, 'data', mode='w', format='table')

df contains a slice of the original data, after mapping the conversion function on the POLYLINE column, which should at least mean that all the data in this column has the same type.

How can I solve the error generated or which other ways are there that might work well?

EDIT: df.head() before conversion results in:

TRIP_ID CALL_TYPE  ORIGIN_CALL  ORIGIN_STAND   TAXI_ID  \
0  1372636858620000589         C          NaN           NaN  20000589   
1  1372637303620000596         B          NaN             7  20000596   
2  1372636951620000320         C          NaN           NaN  20000320   
3  1372636854620000520         C          NaN           NaN  20000520   
4  1372637091620000337         C          NaN           NaN  20000337   

    TIMESTAMP DAY_TYPE MISSING_DATA  \
0  1372636858        A        False   
1  1372637303        A        False   
2  1372636951        A        False   
3  1372636854        A        False   
4  1372637091        A        False   

                                        POLYLINE  
0  [[-8.618643, 41.141412], [-8.618499, 41.141376...  
1  [[-8.639847, 41.159825999999995], [-8.64035099...  
2  [[-8.612964, 41.140359000000004], [-8.613378, ...  
3  [[-8.574678, 41.151951], [-8.574705, 41.151942...  
4  [[-8.645994, 41.18049], [-8.645949, 41.180517]...  

and after converting the POLYLINE column (leaving out the other columns to save space):

                                            POLYLINE  
0  [[-8.618643, 41.141412], [-8.618499, 41.141376...  
1  [[-8.639847, 41.159826], [-8.640351, 41.159871...  
2  [[-8.612964, 41.140359], [-8.613378, 41.14035]...  
3  [[-8.574678, 41.151951], [-8.574705, 41.151942...  
4  [[-8.645994, 41.18049], [-8.645949, 41.180517]...  

I've changed it now so that the resulting output is also a list of lists instead of a list of tuples. Inspecting a single element of these shows that the entry before conversion is a string and a single item in the inner list is a float, as it should be, although the way it's printed out doesn't make this clear.

SGer
  • 544
  • 4
  • 18
  • 1
    can you show is `df.head()`. tbh I think you should just split this column into two floats. – Andy Hayden May 23 '15 at 20:31
  • I added the output in the original post. The problem is, that the the column contains a LIST of coordinate pairs. Ignore that the conversion seems to be totally wrong for a moment :P I could of course create a new column and then turn the list of GPS coordinates into new entries, but then I'd also have to introduce a new column to keep track of the original order of the coordinates which would blow up the data by quite a bit. Also I wouldn't mind turning it into a list of lists instead of a list of tuples if that makes things easier. – SGer May 23 '15 at 22:15
  • Yeah, I'd break it to another DataFrame. Having lists as items in a column is bad business/pandas. You might want to checkout geopandas and see how they deal with these polyline objects. I think how you store it is going to depend completely on what you want to do with it (atm it's a list of lists, and you can't serialize that)... but like I say, I reckon a two col DataFrame is your best bet. – Andy Hayden May 24 '15 at 04:39
  • Okay thank you. For now I'll add another column that contains the number of GPS-Coordinates, as they're taken after a specified time interval and then I'll probably create a dict that maps from TRIP_ID to converted POLYLINE objects, so I will only load them if I need them – SGer May 26 '15 at 10:47

0 Answers0