11

Working with data in Python 3+ with pandas. It seems like there should be an easy way to check if two columns have a one-to-one relationship (regardless of column type), but I'm struggling to think of the best way to do this.

Example of expected output:

A    B     C
0    'a'   'apple'
1    'b'   'banana'
2    'c'   'apple'

A & B are one-to-one? TRUE

A & C are one-to-one? FALSE

B & C are one-to-one? FALSE

user1895076
  • 709
  • 8
  • 19
  • what is A&B are one to one? what is the logic behind this, could you explain little more – Mohamed Thasin ah Jun 01 '18 at 12:28
  • Can you please tell, is that A and B are values or column names ? Seems confusing – user96564 Jun 01 '18 at 12:28
  • One-to-one is a commonly used term to talk about the relationship between two data objects in a database. They are one-to-one if there is a unique relationship between values. In the above case, A and B are 1:1 because 0 always corresponds to 'a', 1 always corresponds to 'b', etc. And yes, A, B and C are column names. https://en.wikipedia.org/wiki/One-to-one_(data_model) – user1895076 Jun 01 '18 at 12:39
  • For my work it would also be useful to know this in a non-reciprocal way. For example, it is okay that different A values map to a particular C value, but I don't want any single A value to map to multiple C values. I don't need a 1:1 correspondence, but the direction is important. – Liz Apr 15 '21 at 14:53

5 Answers5

10

Well, you can create your own function to check it:

def isOneToOne(df, col1, col2):
    first = df.groupby(col1)[col2].count().max()
    second = df.groupby(col2)[col1].count().max()
    return first + second == 2

isOneToOne(df, 'A', 'B')
#True
isOneToOne(df, 'A', 'C')
#False
isOneToOne(df, 'B', 'C')
#False

In case you data is more like this:

df = pd.DataFrame({'A': [0, 1, 2, 0],
                   'C': ["'apple'", "'banana'", "'apple'", "'apple'"],
                   'B': ["'a'", "'b'", "'c'", "'a'"]})
df
#   A    B         C
#0  0  'a'   'apple'
#1  1  'b'  'banana'
#2  2  'c'   'apple'
#3  0  'a'   'apple'

Then you can use:

def isOneToOne(df, col1, col2):
    first = df.drop_duplicates([col1, col2]).groupby(col1)[col2].count().max()
    second = df.drop_duplicates([col1, col2]).groupby(col2)[col1].count().max()
    return first + second == 2
zipa
  • 27,316
  • 6
  • 40
  • 58
  • Thanks! This did the trick. You even caught an edge case I didn't think of. The second solution (with duplicates removed) works on my data. Thanks again! – user1895076 Jun 01 '18 at 12:57
  • 1
    Rather than adding drop_duplicates, I changed `.count()` to `.nunique()`. – JDenman6 Oct 23 '20 at 18:35
8
df.groupby(col1)[col2]\
  .apply(lambda x: x.nunique() == 1)\
  .all()

should work fine if you want a true or false answer.

A nice way to visualize the relationship between two columns with discrete / categorical values (in case you are using Jupyter notebook) is :

df.groupby([col1, col2])\
  .apply(lambda x : x.count())\
  .iloc[:,0]\
  .unstack()\
  .fillna(0)

This matrix will tell you the correspondence between the column values in the two columns.

In case of a one-to-one relationship there will be only one non-zero value per row in the matrix.

4
df.groupby('A').B.nunique().max()==1 #Output: True

df.groupby('B').C.nunique().max()==1 #Output: False

Within each value in [groupby column], count the number of unique values in [other column], then check that the maximum for all such counts is one

Trevor B
  • 41
  • 2
1

Here is my solution (only two or three lines of codes) to check for any number of columns to see whether they are one to one match (duplicated matches are allowed, see the example bellow).

cols = ['A', 'B'] # or any number of columns ['A', 'B', 'C']
res = df.groupby(cols).count()
uniqueness = [res.index.get_level_values(i).is_unique 
              for i in range(res.index.nlevels)]
all(uniqueness)

Let's make it a function and add some docs:

def is_one_to_one(df, cols):
    """Check whether any number of columns are one-to-one match.

    df: a pandas.DataFrame
    cols: must be a list of columns names

    Duplicated matches are allowed:
        a - 1
        b - 2
        b - 2
        c - 3
    (This two cols will return True)
    """
    if len(cols) == 1:
        return True
        # You can define you own rules for 1 column check, Or forbid it

    # MAIN THINGs: for 2 or more columns check!
    res = df.groupby(cols).count()
    # The count number info is actually bootless.
    # What maters here is the grouped *MultiIndex*
    # and its uniqueness in each level
    uniqueness = [res.index.get_level_values(i).is_unique
                  for i in range(res.index.nlevels)]
    return all(uniqueness)

By using this function, you can do the one-to-one match check:

df = pd.DataFrame({'A': [0, 1, 2, 0],
                   'B': ["'a'", "'b'", "'c'", "'a'"],
                   'C': ["'apple'", "'banana'", "'apple'", "'apple'"],})

is_one_to_one(df, ['A', 'B'])
is_one_to_one(df, ['A', 'C'])
is_one_to_one(df, ['A', 'B', 'C'])
# Outputs:
# True
# False
# False
YaOzI
  • 16,128
  • 9
  • 76
  • 72
0

one way to solve this ,

df['A to B']=df.groupby('B')['A'].transform(lambda x:x.nunique()==1)
df['A to C']=df.groupby('C')['A'].transform(lambda x:x.nunique()==1)
df['B to C']=df.groupby('C')['B'].transform(lambda x:x.nunique()==1)

Output:

   A  B       C  A to B  A to C  B to C
0  0  a   apple    True   False   False
1  1  b  banana    True    True    True
2  2  c   apple    True   False   False

To check column by column:

print (df['A to B']==True).all()
print (df['A to C']==True).all()
print (df['B to C']==True).all()

True
False
False
Mohamed Thasin ah
  • 10,754
  • 11
  • 52
  • 111