0

I have recently started using Pandas library to validate and manipulate .CSV-files and Excel-files. There is an issue I have just whitnessed. When assigning values to a numpy-series (column of the Pandas DataFrame) and the values I'd like to assign start with '=', then it seems that Pandas interprets those values as formulas.

Source (.CSV-file):

ID | Name | Size
--------------------------
 0 |  Foo | =500.000 cells
 1 |  Bar | Baz

Result (Excel-file):

ID | Name | Size
----------------
 0 |  Foo | 0     <- wrong value
 1 |  Bar | Baz

When I open the resulted Excel-file an error message is being displayed. Unfortunately the error message is in german, so I'll try to translate it as well as possible: "We have detected a problem with some of the content of 'FILENAME.xlsx'. Would you like to restore the content? If you trust this workbook, then press 'Yes'." When pressing 'Yes' the workbook is being opened and all the values that should be '=something' are '0'. Every other value was assigned correctly!

This is the line of code is use to assign the values.

ws.ix[some_row_index, some_col_index] = edit_set.ix[some_other_row_index, some_other_col_index].values

Is there a way to bypass this? Maybe by using another method to assign parameters or by setting a parameter?

EDIT: It seems that the problem occurs even if I don't change/set the value. So if the initial value, that the Excel-file that I open with Pandas, starts with '=' and I don't overwrite/change the value at all, the error still occurs when writing the DataFrame to_excel. I still haven't found a way to tell Pandas to "ignore formulas" when writing to Excel-file.

So I think the change in my code should happen in the following two lines:

writer = pd.ExcelWriter(path + name, engine='xlsxwriter', options={'strings_to_urls': False})
edit_final_file.to_excel(writer, sheet_name = 'Main', index=False, encoding='utf-8')
fmedv
  • 153
  • 1
  • 2
  • 12
  • Not sure if it is an answer, but the way to start cell data with a `=` in excel is to prepend a `'` to make the content literal. So that is the cell contains `'=500 ...` instead of `=500 ...` – Moberg Aug 13 '18 at 13:02
  • To me it sounds like pandas couldn't parse an integer from the string "=something". Try `int("4")` and `int("=4")` to see what I mean. So your solution might be to find the cells prefixed with `=` and remove the `=`. – nitzel Aug 13 '18 at 13:16
  • Thank you for your comments, but unfortunately I'm not able to get any further to reaching my goal.I'm going to edit something in my initial question, because I have just noticed something regarding my problem – fmedv Aug 16 '18 at 14:48

2 Answers2

1

As mentioned in the comments, prepend ' to the beginning of cells that start with =:

df["Size"] = df["Size"].apply(lambda x: "'" + str(x) if str(x)[0] == "=" else x)
sundance
  • 2,905
  • 4
  • 21
  • 31
0

Should avoid the call to apply. Pandas has built-in str methods.

Something like this:

mask = df['Size'].str.startswith('=')
df.loc[mask, 'Size']  = "'" + df.loc[mask, 'Size']
PMende
  • 5,171
  • 2
  • 19
  • 26
  • 1
    Agree that it's best to avoid `apply` in general, but I don't believe the pandas string methods are vectorized, so this will just introduce overhead (calculating the mask, etc.). See answer [here](https://stackoverflow.com/questions/37687806/pandas-vectorized-operation-to-get-the-length-of-string). The advantage of string methods is that they skip over null values, but that isn't an issue here since we're only applying the operation to values that start with `=` – sundance Aug 16 '18 at 15:45
  • Interesting. I had been under the impression that the methods were vectorized. Thanks for the info! – PMende Aug 16 '18 at 15:59
  • Yeah, see `str.cat` source [here](https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/core/strings.py#L114). Looks like it just ends up looping over the elements – sundance Aug 16 '18 at 16:24