4

I have a dataframe for which I'd like to update a column with some values from an array. The array is of a different lengths to the dataframe however, but I have the indices for the rows of the dataframe that I'd like to update.

I can do this with a loop through the rows (below) but I expect there is a much more efficient way to do this via a vectorized approach, but I can't seem to get the syntax correct.

In the example below I just fill the column with nan and then use the indices directly through a loop.

df['newcol'] = np.nan

j = 0
for i in update_idx:
    df['newcol'][i] = new_values[j]
    j+=1
drets
  • 2,583
  • 2
  • 24
  • 38
anthr
  • 1,026
  • 4
  • 17
  • 34
  • is this an array or a series/df? you could just assign the series directly: `df['newcol'] = new_values` or construct a series: `df['newcol'] = pd.Series(new_values)` the extra rows in `new_values` will be ignored – EdChum Dec 22 '15 at 23:56
  • The values to update are currently in an array but could be transformed if the solution requires it. Maybe I'm wrong but wouldn't your solution ignore the fact I only want to update certain indices? For example, I may want to update the 2nd, 8th, 20th.. index (in the example these are in update_idx) but wouldn't your approach just update the first N rows of the dataframe (where N is the length of new_values)? – anthr Dec 22 '15 at 23:59
  • then I think `df.loc[update_idx, 'new_col'] = new_values` should work – EdChum Dec 23 '15 at 00:00
  • Perfect - thanks very much. If you care to submit that as an answer I can accept it! – anthr Dec 23 '15 at 00:06

1 Answers1

5

if you have a list of indices already then you can use loc to perform label (row) selection, you can pass the new column name, where your existing rows are not selected these will have NaN assigned:

df.loc[update_idx, 'new_col'] = new_value

Example:

In [4]:
df = pd.DataFrame({'a':np.arange(5), 'b':np.random.randn(5)}, index = list('abcde'))
df

Out[4]:
   a         b
a  0  1.800300
b  1  0.351843
c  2  0.278122
d  3  1.387417
e  4  1.202503

In [5]:    
idx_list = ['b','d','e']
df.loc[idx_list, 'c'] = np.arange(3)
df

Out[5]:
   a         b   c
a  0  1.800300 NaN
b  1  0.351843   0
c  2  0.278122 NaN
d  3  1.387417   1
e  4  1.202503   2
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Does loc use a vectorised approach to access the numpy elements? I'd heard loc should be avoided to prevent using for loops under the hood. I'd been told to use numpy [Boolean] style indexing. Thanks :) – Chogg Apr 26 '19 at 19:18
  • No, loc does label based indexing, it has nothing to do with vectorisation. It's the operation on the result of loc that may or may not be vectorised. Don't know what the context of what you heard was but this presumption is wrong – EdChum Apr 26 '19 at 19:24
  • This timing agrees with you testa = pd.DataFrame(np.arange(10000000),columns =['q']) %timeit testb = testa.loc[testa.q>6] %timeit testc = testa[testa.q>7] # 1 loop, best of 3: 207 ms per loop # 1 loop, best of 3: 208 ms per loop – Chogg Apr 26 '19 at 19:28
  • Ok. I take it from what you say that the label based indexing is not done by a for loop for loc. What would stop the operation from then being vectorised? Thanks – Chogg Apr 26 '19 at 19:30
  • Using .apply or iterating using for or iterrows for instance is not vectorised. Sorry but if you have a question then you should post a question, using comments as a discussion is bad form for SO – EdChum Apr 26 '19 at 19:33
  • Thanks. It's here https://stackoverflow.com/questions/55874138/does-loc-in-pandas-use-vectorised-logic-or-a-for-loop/55874171#55874171 – Chogg Apr 26 '19 at 19:53