How to find differences between two dataframes of different lengths?

Question

The following code will compare differences in two dataframes (synthetically imported from Excel):

import pandas as pd
import numpy as np

a = pd.DataFrame(
    {
        "A": ["1", 2, "3", 4, "5"],
        "B": ["abcd", "efgh", "ijkl", "uhyee", "uhuh"],
        "C": ["jamba", "refresh", "portobello", "performancehigh", "jackalack"],
    }
)

b = pd.DataFrame(
    {
        "A": ["1", 2, "3", 4, "5"],
        "Z": ["dah", "fupa", "ijkl", "danju", "uhuh"],
        "C": ["jamba", "dimez", "pocketfresh", "reverbb", "jackalack"],
    }
)

comparevalues = a.values == b.values

rows,cols = np.where(comparevalues == False)

for item in zip(rows, cols):
    a.iloc[item[0], item[1]] = " {} --> {} ".format(
        a.iloc[item[0], item[1]], b.iloc[item[0], item[1]]
    )

However, as soon as I extend dataframe b by another line, the code breaks:

b = pd.DataFrame(
    {
        "A": ["1", 2, "3", 4, "5", 6],
        "B": ["dah", "fupa", "ijkl", "danju", "uhuh", "freshhhhhhh"],
        "C": [
            "jamba",
            "dimez",
            "pocketfresh",
            "reverbb",
            "jackalack",
            "boombackimmatouchit",
        ],
    }
)

And I have the same problem if I extend a with an additional column:

a = pd.DataFrame(
    {
        "A": ["1", 2, "3", 4, "5"],
        "B": ["abcd", "efgh", "ijkl", "uhyee", "uhuh"],
        "C": ["jamba", "refresh", "portobello", "performancehigh", "jackalack"],
        "D": ["OQEWINVSKD", "DKVLNQIOEVM", "asdlikvn", "asdkvnddvfvfkdd", np.nan],
    }
)

How do I still compare these two data frames for differences?

How do you want the extra line to be handled? I mean, if you have nothing to compare with the extra line, it should be simply reported as the last line of the output or the last line of output should be filled with NaN? — Drakax, Jun 02 '22 at 21:57
Something like from ```np.nan``` --> ```extra item``` @Drakax — HelpMeCode, Jun 03 '22 at 02:06

Laurent · Accepted Answer · 2022-06-08T08:58:25.477

You could define several helper functions to adjust the length and widths of the two dataframes:

def equalize_length(short, long):
    return pd.concat(
        [
            short,
            pd.DataFrame(
                {
                    col: ["nan"] * (long.shape[0] - short.shape[0])
                    for col in short.columns
                }
            ),
        ]
    ).reset_index(drop=True)


def equalize_width(short, long):
    return pd.concat(
        [
            short,
            pd.DataFrame({col: [] for col in long.columns if col not in short.columns}),
        ],
        axis=1,
    ).reset_index(drop=True)


def equalize(df, other_df):
    if df.shape[0] <= other_df.shape[0]:
        df = equalize_length(df, other_df)
    else:
        other_df = equalize_length(other_df, df)
    if df.shape[1] <= other_df.shape[1]:
        df = equalize_width(df, other_df)
    else:
        other_df = equalize_width(other_df, df)
    df = df.fillna("nan")
    other_df = other_df.fillna("nan")
    return df, other_df

And then, in your code:

a, b = equalize(a, b)

comparevalues = a.values == b.values

rows, cols = np.where(comparevalues == False)

for item in zip(rows, cols):
    a.iloc[item[0], item[1]] = " {} --> {} ".format(
        a.iloc[item[0], item[1]], b.iloc[item[0], item[1]]
    )

print(a)  # with 'a' being shorter in lenght but longer in width than 'b'
# Output
             A                      B                              C                          D
0            1          abcd --> dah                           jamba        OQEWINVSKD --> nan
1            2         efgh --> fupa              refresh --> dimez        DKVLNQIOEVM --> nan
2            3                   ijkl    portobello --> pocketfresh           asdlikvn --> nan
3            4       uhyee --> danju    performancehigh --> reverbb    asdkvnddvfvfkdd --> nan
4            5                   uhuh                      jackalack                        nan        
5   nan --> 6    nan --> freshhhhhhh    nan --> boombackimmatouchit                         nan

if I add a new column to ```a``` or ```b```, the code breaks again. Is this something that can be updated to account for new columns (ie, from ```np.nan --> [new value]```? @Laurent — HelpMeCode, Jun 07 '22 at 17:20
Example: Updated ```a``` to ```a = {'A': ['1',2,'3',4,'5'], 'B' : ['abcd', 'efgh', 'ijkl', 'uhyee', 'uhuh'], 'C' : ['jamba','refresh','portobello','performancehigh','jackalack'], 'D': ['OQEWINVSKD', 'DKVLNQIOEVM', 'asdlikvn', 'asdkvnddvfvfkdd', np.nan]}``` @Laurent — HelpMeCode, Jun 07 '22 at 17:21
In your opinion, what's the best way to learn all these concepts that you used to transform the data? @Laurent — HelpMeCode, Jun 10 '22 at 04:57
would it be possible to distinguish ```Adds``` vs ```Removals``` vs ```Mods```? An Add would be anything from np.nan --> [value or string]; a Removal would be anything from [value or string] to np.nan; Mod would be anything from [value or string] to [different value or string] @Laurent perhaps define it through color? — HelpMeCode, Jun 13 '22 at 15:27
As per SO guidelines, we are not supposed to go beyond the initial question by extending comments too much. You should certainly post a new question, which will make it much easier to help you. Cheers. — Laurent, Jun 13 '22 at 17:56
Thanks, I've done just that! Check it out if you'd be so kind :-) @Laurent https://stackoverflow.com/questions/72607385/creating-conditions-in-python?noredirect=1#comment128257371_72607385 — HelpMeCode, Jun 13 '22 at 18:38

How to find differences between two dataframes of different lengths?

1 Answers1