0

Big Data file formats like parquet, feather and hdf5 are able to work with a columnar oriented table to accelerate the speed of reading columns.

In my use case I would like to switch from netcdf4 files to a feather file format because I can read some columns 10 times faster than using netcdf4. But unfortunately I am losing dtype specification which increases the size of the file.

So my idea is to define dtypes of rows but pandas only accepting column dtypes.

Is there a way to handle DataFrames more like a columnar oriented table and specifiy dtypes for each row?

dl.meteo
  • 1,658
  • 15
  • 25
  • Hi, what do you mean by "pandas only accepting column dtypes"? – Laurent Feb 05 '22 at 15:06
  • @Laurent you can only define a dtype for a column and not for a row. – dl.meteo Feb 07 '22 at 06:06
  • My understanding is that Pandas `astype` method works for columns as well as rows, so I suppose you could transpose your dataframe and define new types for rows (columns before transposition)? – Laurent Feb 07 '22 at 12:25
  • `KeyError: 'Only a column name can be used for the key in a dtype mappings argument.` – dl.meteo Feb 09 '22 at 07:30

1 Answers1

0

Pandas dataframes are a collection of series objects, so you can't have more than one data type specified per column (i.e. a column with [2, 'dog', 3] will have the dtype object because of the string likewise [2, 2.5, 3] can't be type int because of the 2.5.

If you want to work row-based you'll need to transpose your DataFrame usingdf.transpose() (or shorthand df.T) this will make your columns become rows. If you're importing your data you can transpose your dataframe and cast to each column to the data type you want, if it's the case that you're preparing data to be exported then at your last step before exporting transpose.

Eg:

import pandas as pd

df = pd.DataFrame({'col_1': [1, 'cat', 3], 
                  'col_2': [4, 'dog', 6]},
                  index=['row_1', 'row_2', 'row_3'])

>>> df
      col_1 col_2
row_1     1     4
row_2   cat   dog
row_3     3     6

# Due to the the strings both columns are dtype object
>>> df.dtypes
col_1    object
col_2    object

# Transpose the df
>>> df.T
      row_1 row_2 row_3
col_1     1   cat     3
col_2     4   dog     6

# Now our data is in columns but still dtype object
>>> df.T.dtypes
row_1    object
row_2    object
row_3    object

# We can cast our columns (originally rows) to new dtypes now
>>> df.T.astype({'row_1': 'int', 'row_2': str, 'row_3': 'int'})
       row_1 row_2  row_3
col_1      1   cat      3
col_2      4   dog      6

>>> df.T.astype({'row_1': 'int', 'row_2': str, 'row_3': 'int'}).dtypes
row_1     int64
row_2    object
row_3     int64
Jason
  • 4,346
  • 10
  • 49
  • 75