0

There is a statement in my code that goes:

df.loc[i] = [df.iloc[0][0], i, np.nan]

where i is an iteration variable that I used in the for loop that this statement is residing in,np is my imported numpy module, and df is a DataFrame that looks something like:

   build_number   name  cycles
0           390  adpcm   21598
1           390    aes    5441
2           390  dfadd     463
3           390  dfdiv    1323
4           390  dfmul     167
5           390  dfsin   39589
6           390    gsm    6417
7           390   mips    4205
8           390  mpeg2    1993
9           390    sha  348417

So as you can see, the statement in my code serves to insert new rows into my DataFrame df and fill the very last column (within that newly inserted row) under cycles with a NaN value.

However, in so doing, I get the following warning message:

/usr/local/bin/ipython:28: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Looking at the Docs, I still don't understand what's the problem or risk that I'm incurring here. I thought that using loc and iloc follows the recommendation already?

Thank you.

EDIT HERE At the request of @EdChum, I have added in the function that uses the above statement below:

def patch_missing_benchmarks(refined_dataframe):
'''
Patches up a given DataFrame, ensuring that all build_numbers have the complete
set of benchmark names, inserting NaN values at the column where the data is
supposed to be residing in.

Accepts:
--------
* refined_dataframe
DataFrame that was returned from the remove_early_retries() function and that 
contains no duplicates of benchmarks within a given build number and also has been
sorted nicely to ensure that build numbers are in alphabetical order.
However, this function can also accept the DataFrame that has not been sorted, so
long as it has no repitition of benchmark names within a given build number.

Returns:
-------
* patched_benchmark_df
DataFrame with all Build numbers filled with the complete set of benchmark data,
with those previously missing benchmarks now having NaN values for their data.
'''
patched_df_list = []
benchmark_list = ['adpcm', 'aes', 'blowfish', 'dfadd', 'dfdiv', 'dfmul', 
                'dfsin', 'gsm', 'jpeg', 'mips', 'mpeg2', 'sha']
benchmark_series = pd.Series(data = benchmark_list)

for number in refined_dataframe['build_number'].drop_duplicates().values:
  # df must be a DataFrame whose data has been sorted according to build_number
  # followed by benchmark name
  df = refined_dataframe.query('build_number == %d' % number)

  # Now we compare the benchmark names present in our section of the DataFrame
  # with the Series containing the complete collection of Benchmark names and 
  # get back a boolean DataFrame telling us precisely what benchmark names 
  # are missing
  boolean_bench = benchmark_series.isin(df['name'])
  list_names = []
  for i in range(0, len(boolean_bench)):
    if boolean_bench[i] == False:
      name_to_insert = benchmark_series[i]
      list_names.append(name_to_insert)
    else:
      continue
  print 'These are the missing benchmarks for build number',number,':'
  print list_names

  for i in list_names:
    # create a new row with index that is benchmark name itself to avoid overwriting 
    # any existing data, then insert the right values into that row, filling in the 
    # space name with the right benchmark name, and missing data with NaN
    df.loc[i] = [df.iloc[0][0], i, np.nan]  

    patched_for_benchmarks_df = df.sort_index(by=['build_number',
                                          'name']).reset_index(drop = True)

    patched_df_list.append(patched_for_benchmarks_df)

  # we make sure we call a dropna method at threshold 2 to drop those rows whose benchmark
  # names as well as cycles names are NaN, leaving behind the newly inserted rows with
  # benchmark names but that now have the data as NaN values
  patched_benchmark_df = pd.concat(objs = patched_df_list, ignore_index = 
                               True).sort_index(by= ['build_number',
                              'name']).dropna(thresh = 2).reset_index(drop = True)

  return patched_benchmark_df
Nick ODell
  • 15,465
  • 3
  • 32
  • 66
AKKO
  • 973
  • 2
  • 10
  • 18
  • I don't even know anything about pandas, but reading that documentation you linked makes me think you need to change `df.iloc[0][0]` to `df.iloc[:, (0, 0)]`. – Two-Bit Alchemist Feb 26 '15 at 06:17
  • Even though you are using `iloc` you are double subscripting which is producing the warning, can you show your code that is using this line, it's a little unclear to me, edit it into your question – EdChum Feb 26 '15 at 08:38
  • I think to make your code more readable it'd be better to do `df.iloc[0]['build_number']` – EdChum Feb 26 '15 at 10:49
  • Yes this is much more readable - however is doing that still doing double subscripting? Also, I kind of noticed that the way `.iloc` works is like `.iloc[index_number][column_name OR column_number]` is that the case? The first subscript tells it which row as indicated by the index it should get, then the second subscript tells which column's value in that particular row that we want to get? Just want to verify my understanding. – AKKO Feb 27 '15 at 02:11

1 Answers1

0

Without seeing how you are doing this, if you just want to set the 'Cycles' column then the following would work without raising any warning:

In [344]:

for i in range(len(df)):
    df.loc[i,'cycles'] = np.nan
df
Out[344]:
   build_number   name  cycles
0           390  adpcm     NaN
1           390    aes     NaN
2           390  dfadd     NaN
3           390  dfdiv     NaN
4           390  dfmul     NaN
5           390  dfsin     NaN
6           390    gsm     NaN
7           390   mips     NaN
8           390  mpeg2     NaN
9           390    sha     NaN

If you are just wanting to set the entire column then there's no need to loop just do this: df['cycles'] = np.NaN

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thank you for your suggestion, but I'm not trying to set the entire column to get `NaN`. Rather my situation is such that I have missing `name` that I need inserted to complete the `build_number` and when I insert them in I want to make their corresponding `cycles` value at that same row be `NaN`. For example, with reference to my df above in my question, I want to insert additional rows of `390 jpeg NaN` and `390 blowfish NaN` to complete the listing of all `names` for the given `build_number 390`. – AKKO Feb 26 '15 at 10:31
  • So you just want to append new rows, can you post desired output to your question – EdChum Feb 26 '15 at 10:34
  • For my desired output, please see my post on http://stackoverflow.com/questions/28739931/multiplying-just-one-column-from-each-of-the-2-input-dataframes-together/28740030#28740030 in that question, the 2 dataframes that I'm trying to multiply together are both the desired outputs. I had no problems getting my desired outputs, but I am just concerned with the Warning message and whether I can and should avoid it. – AKKO Feb 26 '15 at 10:40
  • Well you should follow my answer semantic and avoid double subscripting, that is probably why you get this error – EdChum Feb 26 '15 at 10:42
  • Ok this is really funny but now I don't get any errors at all even when I do my own way of double subscripting with `[0][0]`. This is wierd... But thank you for your suggestion. Your answer schematic `df.loc[i,'cycles'] = np.nan` might not work because I can't hardcode `cycles` in; it must be able to be `fmax` too. And apart from that, I'm not trying to iterate over each index; I want to iterate over a list of missing benchmark names and insert each of them into the dataframe. – AKKO Feb 27 '15 at 02:10