0

I have two lists (columns from two separate pandas dataframes) and want to find the intersection of both lists while preserving the order, or ordering based on a condition. Consider the following example:

x = ['0 MO', '1 YR', '10 YR', '15 YR', '2 YR', '20 YR', '3 MO', '3 YR',
     '30 YR', '4 YR', '5 YR', '6 MO', '7 YR', '9 MO', 'Country']
y = ['Industry', '3 MO', '6 MO', '9 MO', '1 YR', '2 YR', '3 YR',
       '4 YR', '5 YR', '7 YR', '10 YR', '15 YR', '20 YR', '30 YR']

answer = set(x).intersection(y)

The variable answer yields the overlapping columns, yet the order is not preserved. Is there a way of sorting the solution such that the answer yields:

answer = ['3 MO', '6 MO', '9 MO', '1 YR', '2 YR', '3 YR',
          '4 YR', '5 YR', '7 YR', '10 YR', '15 YR', '20 YR',
          '30 YR']

i.e first sorting the intersected list by month ("MO") and integers, and then by year ("YR") and its integers?

Alternatively, is there a pandas method to obtain the same result with two dataframes of overlapping columns (preserving or stating order)?

Paddy
  • 91
  • 7

3 Answers3

1

You could simply with list comprehensions:

[this_name for this_name in x if this_name in y]

and

[this_name for this_name in y if this_name in x]
nikeros
  • 3,302
  • 2
  • 10
  • 26
1

You can use sorted function to sort answer by passing a custom function as a key. Since you want to sort first by whether it's MO or YR and then by the integer value, you can split on white space and evaluate by the second part (MO or YR) and then the integer value of the first part.

def sorter(x):
    s = x.split()
    return (s[1],int(s[0]))

out = sorted(set(x).intersection(y), key=sorter)

Output:

['3 MO', '6 MO', '9 MO', '1 YR', '2 YR', '3 YR', '4 YR', '5 YR', '7 YR', '10 YR', '15 YR', '20 YR', '30 YR']
  • This does not give the desired output. The order of the list should be all the elements with "MO" first (["3 MO", "6 MO", "9 MO"]), then the elements with "YR" in them, in the same list of course. – Paddy Jan 08 '22 at 21:53
  • @Paddy I see what the problem is. I edited to fix the issue. Now it works. –  Jan 09 '22 at 02:47
0

I don't know what you are trying to do exactly, but my answer will be for the use case you described. If you want to work with pandas, I think the following code will do what you want. If you have more complex data, I think you might need to change the columns types to timedelta to have more flexibility. The sorting is working in this case because MO is alphabetically before YR.

import pandas as pd
df1 = pd.DataFrame({'x': ['0 MO', '1 YR', '10 YR', '15 YR', '2 YR', '20 YR', '3 MO', '3 YR',
     '30 YR', '4 YR', '5 YR', '6 MO', '7 YR', '9 MO', 'Country']})
df2 = pd.DataFrame({'y': ['Industry', '3 MO', '6 MO', '9 MO', '1 YR', '2 YR', '3 YR',
       '4 YR', '5 YR', '7 YR', '10 YR', '15 YR', '20 YR', '30 YR']})

# drop 'non-standard' data 
df1["x"] = df1["x"].apply(lambda x: x if x[0].isdigit() else None)
df2["y"] = df2["y"].apply(lambda x: x if x[0].isdigit() else None)
df1.dropna(inplace=True)
df2.dropna(inplace=True)

# make two columns to sort 
df1["value"] = df1["x"].apply(lambda x: int(x[:-2]))
df1["unit"] = df1["x"].apply(lambda x: x[-2:])

df2["value"] = df2["y"].apply(lambda x: int(x[:-2]))
df2["unit"] = df2["y"].apply(lambda x: x[-2:])

# sort by unit and value
df1 = df1.sort_values(by=["unit", "value"]).drop("x", axis=1)
df2 = df2.sort_values(by=["unit", "value"]).drop("y", axis=1)

# merge 
df = pd.merge(df1, df2, on=["unit", "value"])
df["result"] = df.apply(lambda x: str(x["value"]) + " " + x["unit"], axis=1)
df.drop(["unit", "value"], axis=1, inplace=True)
df



Ahmed Elashry
  • 389
  • 2
  • 12