I have a big csv file in which one column contains a list of gps coordinates as float-tuples. Of course if I read in the file as a pandas dataframe, their type is simply String, which is not that useful.
What I want to do is to convert the Strings to lists of float-tuples and save them in a format that I can simply load into a dataframe and run queries on. The important constraint on this is, that I cannot compute this conversion for the whole big dataset, so dumping the binary object is no option.
From what I read, it seems that HDF-stores should be able to save the data in the required format AND allow me to append rows, so I can convert the strings to GPS-entries bit by bit, thereby not getting problems with my memory limitations.
However, when I try to create the HDF-file I get:
TypeError: Cannot serialize the column [POLYLINE] because
its data contents are [mixed] object dtype
The code that produces this error is:
df.to_hdf(filename, 'data', mode='w', format='table')
df contains a slice of the original data, after mapping the conversion function on the POLYLINE column, which should at least mean that all the data in this column has the same type.
How can I solve the error generated or which other ways are there that might work well?
EDIT: df.head() before conversion results in:
TRIP_ID CALL_TYPE ORIGIN_CALL ORIGIN_STAND TAXI_ID \
0 1372636858620000589 C NaN NaN 20000589
1 1372637303620000596 B NaN 7 20000596
2 1372636951620000320 C NaN NaN 20000320
3 1372636854620000520 C NaN NaN 20000520
4 1372637091620000337 C NaN NaN 20000337
TIMESTAMP DAY_TYPE MISSING_DATA \
0 1372636858 A False
1 1372637303 A False
2 1372636951 A False
3 1372636854 A False
4 1372637091 A False
POLYLINE
0 [[-8.618643, 41.141412], [-8.618499, 41.141376...
1 [[-8.639847, 41.159825999999995], [-8.64035099...
2 [[-8.612964, 41.140359000000004], [-8.613378, ...
3 [[-8.574678, 41.151951], [-8.574705, 41.151942...
4 [[-8.645994, 41.18049], [-8.645949, 41.180517]...
and after converting the POLYLINE column (leaving out the other columns to save space):
POLYLINE
0 [[-8.618643, 41.141412], [-8.618499, 41.141376...
1 [[-8.639847, 41.159826], [-8.640351, 41.159871...
2 [[-8.612964, 41.140359], [-8.613378, 41.14035]...
3 [[-8.574678, 41.151951], [-8.574705, 41.151942...
4 [[-8.645994, 41.18049], [-8.645949, 41.180517]...
I've changed it now so that the resulting output is also a list of lists instead of a list of tuples. Inspecting a single element of these shows that the entry before conversion is a string and a single item in the inner list is a float, as it should be, although the way it's printed out doesn't make this clear.