0

I'm in the process of converting my existing R code to Python as a way to teach myself, but I've run into something that I can't seem to crack.

Here's a example of the R code which works as expected

var <- 0.08

a <- data.frame(a = runif(10, 0, 1), 
                b = runif(10, 0, 1), 
                c = runif(10, 0, 1), 
                d = runif(10, 0, 1))

b <- data.frame(a = c(0,4,6,8,10,12,12,14,16,18), 
                b = c(2,6,8,10,12,14,14,16,18,20), 
                c = c(4,8,10,12,14,16,16,18,20,22),
                d = c(6,10,12,14,16,18,18,20,22,24))

output <- data.table(total = seq(0, 10))

output[total%%2==0, prob:= apply(output[total%%2==0], 1, function(x) { sum(a[, 1:4] * (b[, 1:4]==x[1]))})]
output[total%%2==1, prob:= apply(output[total%%2==1], 1, function(x) { sum(a[, 1:4] * (b[, 1:4]==(x[1]-1))) * var/(1-var)})]

and here's what I tried in Python which is returning 'nan' fields in the 'prob' column

import numpy as np
import pandas as pd

var = 0.08

a = pd.DataFrame(np.random.uniform(0, 1, size=(10, 4)), columns=['a', 'b', 'c', 'd'])

b = pd.DataFrame({'a': [0, 4, 6, 8, 10, 12, 12, 14, 16, 18],
                  'b': [2, 6, 8, 10, 12, 14, 14, 16, 18, 20],
                  'c': [4, 8, 10, 12, 14, 16, 16, 18, 20, 22],
                  'd': [6, 10, 12, 14, 16, 18, 18, 20, 22, 24]})

output = pd.DataFrame({'total': range(0, 11)})

output.loc[output['total'] % 2 == 0, 'prob'] = output[output['total'] % 2 == 0].apply(lambda x: np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == x[0])), axis=1)
output.loc[output['total'] % 2 == 1, 'prob'] = output[output['total'] % 2 == 1].apply(lambda x: np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == (x[0] - 1))) * var / (1 - var), axis=1)

any help would be appreciated!

Thanks

  • Please explain in details the logic of the R code? Also provide the expected output (use `np.random.seed(0)` to make your input reproducible) – mozway Feb 25 '23 at 07:14

1 Answers1

0

Unfortunately this is some of the things you will need to learn when migrating R code to Python code. In R you know the sum of the values of a data.frame will sum every element, this is not the case with pandas. For example, see this question.

By default when you call sum in a DataFrame it will sum across the rows, not all values. What you end up having is a Series in each element of the DataFrame you use apply, when in fact you were expecting a single value. You can test this if you print each iteration.

output.loc[output['total'] % 2 == 0, 'prob'] = output[output['total'] % 2 == 0].apply(
    lambda x: print(np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == x[0]))),
    axis = 1
)

You will see a bunch of Series. The solution to your problem is to sum again the values for the Series, or convert the DataFrame into an numpy.array.

import numpy as np
import pandas as pd

var = 0.08

a = pd.DataFrame(np.random.uniform(0, 1, size=(10, 4)), columns=['a', 'b', 'c', 'd'])

b = pd.DataFrame({'a': [0, 4, 6, 8, 10, 12, 12, 14, 16, 18],
                  'b': [2, 6, 8, 10, 12, 14, 14, 16, 18, 20],
                  'c': [4, 8, 10, 12, 14, 16, 16, 18, 20, 22],
                  'd': [6, 10, 12, 14, 16, 18, 18, 20, 22, 24]})

output = pd.DataFrame({'total': range(0, 11)})

output.loc[output['total'] % 2 == 0, 'prob'] = output[output['total'] % 2 == 0].apply(
    lambda x: np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == x[0])).sum(),
    axis = 1
)

output.loc[output['total'] % 2 == 1, 'prob'] = output[output['total'] % 2 == 1].apply(
    lambda x: np.sum(a.iloc[:, 0:4] * (b.iloc[:, 0:4] == (x[0] - 1))).sum() * var / (1 - var),
    axis = 1
)

output
total prob
0 0 0.503596
1 1 0.0437909
2 2 0.20748
3 3 0.0180417
4 4 0.666049
5 5 0.0579173
6 6 1.35971
7 7 0.118235
8 8 1.33156
9 9 0.115787
10 10 2.5496

Which I guess is what you want. You should definitely provide the desired output in further questions as it make much easier to help that way.

fvall
  • 380
  • 2
  • 9