2

Let's say I have a pandas dataframe

   rid category
0   0       c2
1   1       c3
2   2       c2
3   3       c3
4   4       c2
5   5       c2
6   6       c1
7   7       c3
8   8       c1
9   9       c3

I want to add 2 columns pid and nid, such that for each row pid contains a random id (other than rid) that belongs to the same category as rid and nid contains a random id that belongs to a different category as rid,

an example dataframe would be:

   rid category pid nid
0   0       c2   2   1
1   1       c3   7   4
2   2       c2   0   1
3   3       c3   1   5
4   4       c2   5   7
5   5       c2   4   6  
6   6       c1   8   5
7   7       c3   9   8
8   8       c1   6   2
9   9       c3   1   2

Note that pid should not be the same as rid. Right now, I am just brute forcing it by iterating through the rows and sampling each time, which seems very inefficient.

Is there a better way to do this?

EDIT 1: For simplicity let us assume that each category is represented at least twice, so that at least one id can be found that is not rid but has the same category.

EDIT 2: For further simplicity let us assume that in a large dataframe the probability of ending up with the same id as rid is zero. If that is the case I believe the solution should be easier. I would prefer not to make this assumption though

Vikash Balasubramanian
  • 2,921
  • 3
  • 33
  • 74

3 Answers3

2

For pid column use Sattolo's algorithm and for nid get all possible values with difference all volues of column with values of group with numpy.random.choice and set difference:

from random import randrange

#https://stackoverflow.com/questions/7279895
def sattoloCycle(items):
    items = list(items)
    i = len(items)
    while i > 1:
        i = i - 1
        j = randrange(i)  # 0 <= j <= i-1
        items[j], items[i] = items[i], items[j]
    return items

def outsideGroupRand(x):
    return np.random.choice(list(set(df['rid']).difference(x)), 
                            size=len(x),
                            replace=False)


df['pid1'] = df.groupby('category')['rid'].transform(sattoloCycle)
df['nid1'] =  df.groupby('category')['rid'].transform(outsideGroupRand)
print (df)
   rid category  pid  nid  pid1  nid1
0    0       c2    2    1     4     6
1    1       c3    7    4     7     4
2    2       c2    0    1     5     3
3    3       c3    1    5     1     0
4    4       c2    5    7     2     9
5    5       c2    4    6     0     8
6    6       c1    8    5     8     3
7    7       c3    9    8     9     5
8    8       c1    6    2     6     5
9    9       c3    1    2     3     6
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1
import pandas as pd
import numpy as np

## generate dummy data
raw = {
    "rid": range(10),
    "cat": np.random.choice("c1,c2,c3".split(","), 10)   
}

df = pd.DataFrame(raw)


def get_random_ids(x):
    pids,nids = [],[]    

    sh = x.copy()
    for _ in x:
        ## do circular shift choose random value except cur_val
        cur_value = sh.iloc[0]
        sh = sh.shift(-1)
        sh[-1:] = cur_value
        pids.append(np.random.choice(sh[:-1]))

    ## randomly choose from values from other cat
    nids = np.random.choice(df[df["cat"]!=x.name]["rid"], len(x))

    return pd.DataFrame({"pid": pids, "nid": nids}, index=x.index)

new_ids = df.groupby("cat")["rid"].apply(lambda x:get_random_ids(x))
df.join(new_ids).sort_values("cat")

output

    rid cat pid nid
5   5   c1  8.0 9
8   8   c1  5.0 6
0   0   c2  6.0 1
2   2   c2  0.0 8
3   3   c2  0.0 9
6   6   c2  2.0 4
7   7   c2  3.0 1
1   1   c3  9.0 5
4   4   c3  9.0 0
9   9   c3  4.0 2
Dev Khadka
  • 5,142
  • 4
  • 19
  • 33
1

Start with defining a function computing pid:

def getPid(elem, grp):
    return grp[grp != elem].sample().values[0]

Parameters:

  • eleme - the current rid from group,
  • grp - the whole group of rid values.

The idea is to:

  • select "other" elements from the current group (for some category),
  • call sample,
  • return the only returned value from a Series returned by sample.

Then define second function, generating both new ids:

def getIds(grp):
    pids = grp.rid.apply(getPid, grp=grp.rid)
    rowNo = grp.rid.size
    currGrp = grp.name
    nids = df.query('category != @currGrp').rid\
        .sample(rowNo, replace=True)
    return pd.DataFrame({'pid': pids, 'nid': nids.values}, index=grp.index)

Note that:

  • all nid values for the current group can be computed with a single call to sample,
  • from a Series of rids for "other categories.

But pid values must be computed separately, applying getPid to each element (rid) of the current group.

The reason is that each time a different element should be eliminated from the current group, before sample is called.

And to get the result, run a single instruction:

pd.concat([df, df.groupby('category').apply(getIds)], axis=1)
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41