0

I am using a big data with million rows and 1000 columns. I already referred this post here. Don't mark it as duplicate.

If sample data required, you can use the below

from numpy import *

m = pd.DataFrame(array([[1,0],
           [2,3]]))

I have some continuous variables with 0 values in them.

I would like to compute logarithmic transformation of all those continuous variables.

However, I encounter divide by zero error. So, I tried the below suggestion based on above linked post

df['salary'] = np.log(df['salary'], where=0<df['salary'], out=np.nan*df['salary']) #not working `python stopped working` problem`

from numpy import ma
ma.log(df['app_reg_diff'])  # error

My questions are as follows

a) How to avoid divide by zero error when applying for 1000 columns? How to do this for all continuous columns?

b) How to exclude zeros from log transformation and get the log values for rest of the non-zero observations?

The Great
  • 7,215
  • 7
  • 40
  • 128

1 Answers1

1

You can replace the zero values with a value you like and do the logarithm operation normally.

import numpy as np
import pandas as pd

m = pd.DataFrame(np.array([[1,0], [2,3]]))

m[m == 0] = 1

print(np.log(m))

Here you would get zeros for zero items. You can for example replace it with -1 to get NaN.

Shahriar
  • 768
  • 4
  • 11
  • THANKS, UPVOTED, Will try and update you – The Great Jan 24 '22 at 13:34
  • but we only have to replace the `zero` with `1` like you have done. Am I right? Only then, when we take log, it will be zero. If I replace zero with some other value, it may generate a log value (for that replacement value). Am I right? – The Great Jan 24 '22 at 13:38
  • Additionally, can I check how can I apply this to million rows dataframe with a 1000 columns? – The Great Jan 24 '22 at 13:39
  • @TheGreat You can replace it with any value. The value depends on what you will do with the output. For example, setting it to -1 makes sense cause it gives `NaN` and the logarithm of `0` is undefined. About applying it to large data, replacing zero values doesn't need new memory allocation. Maybe the way I did it in the example would allocate new memory. If it does, maybe a way can be found to eliminate the allocation for `m == 0`. – Shahriar Jan 24 '22 at 15:04