-2

I'm doing some feature engineering for my data. Each datapoint is a set of values, namely transaction amounts for a customers account. I need to create only one feature for each set, for example, I take the maximum, mean, minimum and other statistics. Still, the data is right-skewed and there are some large values, therefore I want to do logarithmic transformation.

Note: The large values are important as well, and I do not want to exclude them as outliers. Also, the data follows approximately log-normal distribution and therefore logarithmic transformation makes them almost normal.

Here comes the question: Do I first take the logarithm of amounts and then take the sum/mean/etc., or first take the sum/mean/etc. and then transform with a logarithm?

# first option
np.log(amounts).sum()

# second option
np.log(sum(amounts))

I know that the order of maximum (or median and other quantiles) and log does not matter, but it matters with the sum/mean and some other statistics. Which way is the right one? Thanks :)

1 Answers1

1

I think second option is better. In this way it reduce the effect of outliers

I think you have to use median rather than mean, max, min or sum. Then the effect of outliers is considerably low.

for an example consider an account which did 100 transactions and  98 of them are between 100 and 500 usd, 1 is higher than 100000 usd and 1 is 1$. There is a high possibility that 100000 or 1 usd to be an outlier according to the other 98 transactions. 

If you take max / min it will select one of the outliers. 
If you take mean / sum the effect of 100000 usd is also high compare to the other 98 transactions.

Therefore use this np.log(median(amounts))

If the data is still rightly skewed you have to select and remove outliers. For that you can use box plot graphs and etc...

Otherwise you can use different function to create the feature. Check something like this. np.log(median(amounts))/np.log(agg_func(amounts)).May be it will be a better feature in your case. agg_func = {sum, max, mean etc.. other than median}

YJR
  • 1,099
  • 3
  • 13
  • Thanks for the suggestion. I do not want to exclude the outliers, because they are not outliers in the sense of it - they carry important information. The reason for log transformation is mainly because the data follows almost log-normal distribution, making it almost normal when using the log transformation. Do note though, that log(median(amounts)) = median(log(amounts)), because logarithm is concave function. Therefore the answer is still to be found... – Michaela Mašková Sep 13 '22 at 07:28
  • No. What I suggesting is if you use sum or mean then the second option is better and using median better compare to the other aggregated function even you use option 1 or 2. – YJR Sep 13 '22 at 07:54
  • I updated the answer. Check whether it will work for you.. – YJR Sep 13 '22 at 07:58