15

I have a very simple function like this one:

import numpy as np
from numba import jit
import pandas as pd

@jit
def f_(n, x, y, z):
    for i in range(n):
        z[i] = x[i] * y[i] 

f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)

To which I pass

df = pd.DataFrame({"x": [1, 2, 3], "y": [3, 4, 5], "z": np.NaN})

I expected that function will modify data z column in place like this:

>>> f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)
>>> df

   x  y     z
0  1  3   3.0
1  2  4   8.0
2  3  5  15.0

This works fine most of the time, but somehow fails to modify data in others.

I double checked things and:

  • I haven't determined any problems with data points which could cause this problem.
  • I see that data is modified as expected when I print the result.
  • If I return z array from the function it is modified as expected.

Unfortunately I couldn't reduce the problem to a minimal reproducible case. For example removing unrelated columns seems to "fix" the problem making reduction impossible.

Do I use jit in a way that is not intended to be used? Are there any border cases I should be aware of? Or is it likely to be a bug?

Edit:

I found the source of the problem. It occurs when data contains duplicated column names:

>>> df_ = pd.read_json('{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"v","type":"integer"},{"name":"y","type":"integer"},
... {"name":"v","type":"integer"},{"name":"x","type":"integer"},{"name":"z","type":"number"}],"primaryKey":["index"],"pandas_version":"0.20.
... 0"}, "data": [{"index":0,"v":0,"y":3,"v":0,"x":1,"z":null}]}', orient="table")
>>> f_(df_.shape[0], df_["x"].values, df_["y"].values, df_["z"].values)
>>> df_
   v  y  v  x   z
0  0  3  0  1 NaN

If duplicate is removed the function works like expected:

>>> df_.drop("v", axis="columns", inplace=True)
>>> f_(df_.shape[0], df_["x"].values, df_["y"].values, df_["z"].values)
>>> df_
   y  x    z
0  3  1  3.0
Tai
  • 7,684
  • 3
  • 29
  • 49

1 Answers1

8

Ah, that's because in your "failing case" the df["z"].values returns a copy of what is stored in the 'z' column of df. It has nothing to do with the numba function:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])
>>> np.shares_memory(df['z'].values, df['z'])
False

While in the "working case" it's a view into the 'z' column:

>>> df = pd.DataFrame([[0, 3, 1, np.nan]], columns=['v', 'y', 'x', 'z'])
>>> np.shares_memory(df['z'].values, df['z'])
True

NB: It's actually quite funny that this works, because the copy is made when you do df['z'] not when you access the .values.

The take-away here is that you cannot expect that indexing a DataFrame or accessing the .values of a Series will always return a view. So updating the column in-place may not change the values of the original. Not only duplicate column names could be a problem. When the property values returns a copy and when it returns a view is not always clear (except for pd.Series then it's always a view). But these are just implementation details. So it's never a good idea to rely on a specific behavior here. The only guarantee that .values is making is that it returns a numpy.ndarray containing the same values.

However it's pretty easy to avoid that problem by simply returning the modified z column from the function:

import numba as nb
import numpy as np
import pandas as pd

@nb.njit
def f_(n, x, y, z):
    for i in range(n):
        z[i] = x[i] * y[i] 
    return z  # this is new

Then assign the result of the function to the column:

>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])
>>> df['z'] = f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)
>>> df
   v  y  v  x    z
0  0  3  0  1  3.0

>>> df = pd.DataFrame([[0, 3, 1, np.nan]], columns=['v', 'y', 'x', 'z'])
>>> df['z'] = f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)
>>> df
   v  y  x    z
0  0  3  1  3.0

In case you're interested what happened in your specific case currently (as I mentioned we're talking about implementation details here so don't take this as given. It's just the way it's implemented now). If you have a DataFrame it will store the columns that have the same dtype in a multidimensional NumPy array. This can be seen if you access the blocks attribute (deprecated because the internal storage may change in the near future):

>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])
>>> df.blocks
{'float64':
     z
  0  NaN
 , 
 'int64':
     v  y  v  x
  0  0  3  0  1}

Normally it's very easy to create a view into that block, by translating the column name to the column index of the corresponding block. However if you have a duplicate column name the accessing an arbitrary column cannot be guaranteed to be a view. For example if you want to access 'v' then it has to index the Int64 Block with index 0 and 2:

>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])
>>> df['v']
   v  v
0  0  0

Technically it could be possible to index the non-duplicated columns as views (and in this case even for the duplicated column, for example by using Int64Block[::2] but that's a very special case...). Pandas opts for the safe option to always return a copy if there are duplicate column names (makes sense if you think about it. Why should indexing one column return a view and another returns a copy). The indexing of the DataFrame has an explicit check for duplicate columns and treats them differently (resulting in copies):

    def _getitem_column(self, key):
        """ return the actual column """

        # get column
        if self.columns.is_unique:
            return self._get_item_cache(key)

        # duplicate columns & possible reduce dimensionality
        result = self._constructor(self._data.get(key))
        if result.columns.is_unique:
            result = result[key]

    return result

The columns.is_unique is the important line here. It's True for your "normal case" but "False" for the "failing case".

MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • 1
    _when `.values `returns a copy and when it returns a view is not always clear. ... (it's an implementation detail).`_ - that's extremely useful information. I wish I could upvote twice. Do you know what is the reason for using two different implementations here? –  Jul 21 '18 at 19:57
  • 1
    @user10102398 Thanks for your comment. Maybe I focused a bit too much on the `.values`. The actual copy happens when you index the DataFrame, not when you get the `values`. I updated the answer accordingly (and included a bit more of in-depth information). I'm not too familiar with pandas but when I looked into the problem I found several (lots) of pandas issues where it people were suprised that indexing or accessing the `values` could return a view or a copy. And the answer almost always was: These are implementation details (which you really shouldn't rely on). – MSeifert Jul 21 '18 at 23:54