0

When I use petl to wrangle my data and then write it to a pandas DataFrame, I notice the wonderful ability of pandas to auto-recognize types goes away. Is there any way to use the features of petl while also preserving pandas's ability to auto recognize data types?

df = pd.read_csv(csvpath)

df
Out[17]: 
  cis  boom  bah   bish
0   A     1  NaN   True
1   B     4  1.0  False
2   C    11  NaN  False
df.dtypes
Out[18]: 
cis      object
boom      int64
bah     float64
bish       bool
dtype: object

tbl = etl.fromcsv(csvpath)
df2 = etl.todataframe(tbl)


df2
Out[20]: 
  cis boom bah   bish
0   A    1       TRUE
1   B    4   1  FALSE
2   C   11      FALSE
df2.dtypes
Out[21]: 
cis     object
boom    object
bah     object
bish    object
dtype: object
Selah
  • 7,728
  • 9
  • 48
  • 60

2 Answers2

0

I've not used petl so I'm just guessing here. I suspect that petl is guessing on data types and passing those guesses to Pandas... and I bet it's not as good at guessing as Pandas is. You could write a little wrapper function that dumps the petl data to a csv then suck it back in using pd.read_csv()That way Pandas is forced to do the type inference. You could also inspect the csv and make sure petl is not molesting the date format in some odd way.

More involved, you could write out only a few hundred rows, use pandas to read those back in, then iterate over the pandas dataframe and determine the row types. Then you could assign those to the petl container.

convert(tbl, 'bar', int)
JD Long
  • 59,675
  • 58
  • 202
  • 294
  • That definitely sounds possible, but I imagine it would take ~3x as long which would be significant for the data I am working with. – Selah May 04 '17 at 17:51
0

This might be a basic approach to what you need although I don't know how resource hungry it might prove to be.

import petl as etl
import pandas as pd

table = etl.frompickle('temp.pickle')

print (table)

dtypes = {}
for fieldName in table.fieldnames():
    typeset = [_ for _ in list(etl.typeset(table, fieldName)) if _ != 'NoneType']
    if len(typeset) > 1:
        print ('Warning: more than one type found, using convenient value found')
    dtypes[fieldName]=typeset[0]

cols = etl.columns(table)

first = True
for fieldname in table.fieldnames():
    if first:
        df = pd.DataFrame.from_dict({fieldname: cols[fieldname]}, dtype=dtypes[fieldname])
        first = False
    else:
        column = pd.DataFrame.from_dict({fieldname: cols[fieldname]}, dtype=dtypes[fieldname])
        df = df.join(column)

print (df)

Here is the output for your example, slightly modified.

+-----+------+------+-------+
| cis | boom | bah  | bish  |
+=====+======+======+=======+
| A   |    1 | None | True  |
+-----+------+------+-------+
| B   | None |  1.0 | False |
+-----+------+------+-------+
| C   |   11 | None | False |
+-----+------+------+-------+

  cis  boom  bah   bish
0   A     1  NaN   True
1   B  None  1.0  False
2   C    11  NaN  False

I omitted one of the integer values (for 'boom') because there's no NaN for integers and I wanted to learn how to cope with this. I'm not clear what the answer might be for numpy — some answers seemed quite ugly — so I avoided it. Without a declaration of dtype pandas converts boom to float. Which is why I've included the code that exercises petl's typeset stuff. This is a very basic approach. If you have columns that include both integer and float values then you might wish to expand this so that float is chosen over integer type and passed to pandas.

Bill Bell
  • 21,021
  • 5
  • 43
  • 58