I'm doing some feature engineering for my data. Each datapoint is a set of values, namely transaction amounts for a customers account. I need to create only one feature for each set, for example, I take the maximum, mean, minimum and other statistics. Still, the data is right-skewed and there are some large values, therefore I want to do logarithmic transformation.
Note: The large values are important as well, and I do not want to exclude them as outliers. Also, the data follows approximately log-normal distribution and therefore logarithmic transformation makes them almost normal.
Here comes the question: Do I first take the logarithm of amounts and then take the sum/mean/etc., or first take the sum/mean/etc. and then transform with a logarithm?
# first option
np.log(amounts).sum()
# second option
np.log(sum(amounts))
I know that the order of maximum (or median and other quantiles) and log does not matter, but it matters with the sum/mean and some other statistics. Which way is the right one? Thanks :)