Repair Data for Markov Chain Monte Carlo Simulation

Question

As is known all probabilities need to sum up to 1. I do have a Pandas Dataframe where sometimes the probabiltiy of one event does miss.
Since I know all elements of a row need to sum up to one. I want to replace Nan by a calculated Value. With something like the following for each row in my Pandas Data Frame

for item, row in df:
    df.replace(Nan,(1-sum of row())

As an example, here's the array I do use as testing Data the moment:

    matrixsum
     e    f    g
a  0.3  0.2  Nan
b  0.2  0.2  0.6
c  0.7  0.1  Nan

By using df.fillna(0) i do get this:

  matrixsum
     e    f    g
a  0.3  0.2  0.0
b  0.2  0.2  0.6
c  0.7  0.1  0.0

An additional problem is the fact that only rows with float or int format can be summed to 1, but nan has string-formated. At the moment I just use df.fillna(0) but this is a bad thing to do.

Expectedt Output:

  matrixsum
     e    f    g
a  0.3  0.2  0.5
b  0.2  0.2  0.6
c  0.7  0.1  0.2

Where's the `nan` in your sample dataframe? Please share a proper one with expected output. — Mayank Porwal, Apr 26 '21 at 13:09
Thanks for your advice, i did change the Question and tried to implement the things you asked for. — Hans Peter, Apr 26 '21 at 13:18
If a row contains more than one nan theres no Solution, and the Data cant be repaired. But i want to reduce the ammount of data the user has to enter. — Hans Peter, Apr 26 '21 at 15:26

score 2 · Answer 1 · answered Apr 26 '21 at 13:12

2

If you are sure that your Nan for all rows always appear in a single column(let's say g), you can do this:

Consider below df:

In [21]: df
Out[21]: 
     e    f    g
a  0.3  0.2  Nan
b  0.2  0.2  0.6
c  0.7  0.1  Nan

In [22]: df['g'] = 1 - df.sum(1)

In [23]: df
Out[23]: 
     e    f    g
a  0.3  0.2  0.5
b  0.2  0.2  0.6
c  0.7  0.1  0.2

answered Apr 26 '21 at 13:12

Mayank Porwal

33,470
8
37
58

The missing Values are randomly distributed troughout the whole Dataset. So your Solution does not really solve my Problem right now, but i can imagine someone else might be very happy about this. Maybe i should have clarified this in my Question, so its my bad. Thanks for your Efford and your help to improve my Question text. – Hans Peter Apr 26 '21 at 15:35

score 1 · Accepted Answer · answered Apr 26 '21 at 13:16

You can first convert your dataframe to numeric values, and then fill the NaNs of each row by 1- row.sum():

df = df.apply(pd.to_numeric, errors="coerce")
df = df.apply(lambda row: row.fillna(1 - row.sum()), axis=1)

or equivalently, you can combine these two in a function:

def markovize(row):
    row = pd.to_numeric(row, errors="coerce")
    return row.fillna(1 - row.sum())

df = df.apply(markovize, axis=1)

Before:

     e    f    g
a  0.3  0.2  Nan
b  0.2  0.2  0.6
c  0.7  0.1  Nan

After:

     e    f    g
a  0.3  0.2  0.5
b  0.2  0.2  0.6
c  0.7  0.1  0.2

i did try this solution and as far as i could check it, it does work for me. I think i will accpet the Answer after some more compatibilty checks tomorrow. Thanks for your Help and have nice Evening. — Hans Peter, Apr 26 '21 at 15:28

Repair Data for Markov Chain Monte Carlo Simulation

2 Answers2