2

pandas data frame are in general represented in long ( a lot of rows) or wide (a lot of columns) format.

I'm wondering which format is faster to read and occupies less memory when saved as hdf file (df.to_hdf).

Is there a general rule or some cases where one of the format should be preferred?

Donbeo
  • 17,067
  • 37
  • 114
  • 188
  • What kind of dtypes are you going to use? Is it really not important for you how to store your DFs or are you going to transpose them? – MaxU - stand with Ukraine Nov 11 '16 at 10:18
  • I have different dataframes. Some have only floats while others have string and floats. They are quiet large (100GB) and I want to reduce the memory usage and the reading time as much as possible. – Donbeo Nov 11 '16 at 10:23

1 Answers1

0

IMO long format is much more preferable as you will have much less metadata overhead (information about column names, dtypes, etc.).

In term of memory usage they are going to be more or less the same:

In [22]: long = pd.DataFrame(np.random.randint(0, 10**6, (10**4, 4)))

In [23]: wide = pd.DataFrame(np.random.randint(0, 10**6, (4, 10**4)))

In [24]: long.shape
Out[24]: (10000, 4)

In [25]: wide.shape
Out[25]: (4, 10000)

In [26]: sys.getsizeof(long)
Out[26]: 160104

In [27]: sys.getsizeof(wide)
Out[27]: 160104

In [28]: wide.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Columns: 10000 entries, 0 to 9999
dtypes: int32(10000)
memory usage: 156.3 KB

In [29]: long.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
0    10000 non-null int32
1    10000 non-null int32
2    10000 non-null int32
3    10000 non-null int32
dtypes: int32(4)
memory usage: 156.3 KB
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419