3

What are the reasons that pandas DataFrame values are implemented as C-contiguous numpy array for numeric data types and as F-contiguous numpy array for string (object) data types?

import numpy as np, pandas as pd
ai = np.array([[0,1,2],[3,4,5]], dtype=np.int64)
af = np.array([[0.,1.,2.],[3.,4.,5.]], dtype=np.float64)
ao = np.array([['000','111','222'],['333','444','555']], dtype=object)
print('C-contiguous:',
' i=',ai.flags['C_CONTIGUOUS'],
' f=',af.flags['C_CONTIGUOUS'],
' o=',ao.flags['C_CONTIGUOUS'],sep='')
print('C-contiguous:',
' i=',pd.DataFrame(ai).values.flags['C_CONTIGUOUS'],
' f=',pd.DataFrame(af).values.flags['C_CONTIGUOUS'],
' o=',pd.DataFrame(ao).values.flags['C_CONTIGUOUS'],sep='')

gives

C-contiguous: i=True f=True o=True
C-contiguous: i=True f=True o=False

Thank you for your help!

S.V
  • 2,149
  • 2
  • 18
  • 41
  • `pd.DataFrame(ai).values` is a `view` of `ai`. Modifying an element in `ai` changes the corresponding element in the dataframe. The same isn't true for the object case. `values` is a copy. It's as though the column oriented `series` is the primary data store, rather than `_values`. But I don't know how to test that. – hpaulj Aug 06 '19 at 16:17
  • @hpaulj Thank you! I was wondering if there is any advantage of storing the `values` array of a DataFrame as an F-contiguous array for objects. I find myself having to worry about what typed memory view I need to use for numeric and object values in Cython and was hoping to understand why it is done this way. Why it is a bad idea to just have all data types to be stored in C-contiguous arrays? – S.V Aug 06 '19 at 16:45
  • I don't know how much control we have over the `values` attribute when creating a dataframe. Are you thinking of `cython` typed memory views? There's a good relationship between `ndarray` and these, but I haven't seen anything about `pandas` and `cython`. – hpaulj Aug 06 '19 at 16:59
  • @hpaulj Yes, I am actually working with pandas DataFrames, but since, as you pointed out, Cython does not have integration with pandas, in reality I am working with the underlying numpy arrays, and because the arrays implement the Python buffer protocol, the best way work with the arrays are typed memory views which represent the buffer at the Cython level. – S.V Aug 06 '19 at 19:06
  • Just guessing here, but since each column of the dataframe may have a different type in pandas, it may be more convenient to store the values of each column together when dealing with objects, i.e., as Fortran contiguous. – bmello Apr 02 '20 at 08:07

0 Answers0