-1

I have 2 very large data sets each of 5 columns( h, k, l, intensity, sigma). I would like to compare them row by row with respect to h,k,l values only. I need then to remove the nonmatching rows from each data set and rewrite each data set with only the matching rows having same h,k,l values. I am using datacompy but not sure how to proceed to get the final output.

Sara
  • 13
  • 4

1 Answers1

0
  • you are really describing and inner join (relational theory terminology)
  • in pandas this is a merge() with kind="inner"
df1 = pd.DataFrame(
    {
        "h": range(0, 100, 2),
        "k": range(0, 200, 4),
        "l": range(0, 400, 8),
        "intensity": np.random.uniform(1, 5, 50),
        "sigma": np.random.uniform(0, 1, 50),
    }
)

df2 = pd.DataFrame(
    {
        "h": range(0, 200, 4),
        "k": range(0, 400, 8),
        "l": range(0, 800, 16),
        "intensity": np.random.uniform(1, 5, 50),
        "sigma": np.random.uniform(0, 1, 50),
    }
)

# get intersecting rows... in this case 25
df1 = pd.merge(df1, df2.loc[:,["h","k","l"]], on=["h","k","l"], how="inner")

# get intersecting rows... again 25 rows
df2 = pd.merge(df1.loc[:,["h","k","l"]], df2, on=["h","k","l"], how="inner")


Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • Thank you so much. I still have problem with repeating rows having same hkl appearing several times in the data. For exapmle (100) in second row is not common. But, then it is common in a later line let's say (100) matching in line 98. In this case, all (100) are not deleted and the "intensity" and "sigma" entities are randomly and wrongly distributed. – Sara Jun 20 '21 at 19:26
  • ok, so h,k,l doesn't identify a row. How do you want it to behave when there are multiple rows with repeating combinations? Take the first? Last? mean of sigma and intensity? – Rob Raymond Jun 20 '21 at 20:07
  • Even for repeated hkl, as long as (hkl) is present in both data sets at some line, I just want to keep exact corresponding values of Intensity and sigma in each data set. At the end I want to plot and compare both sigmas together and (hkl) will be taken as categories only along x axis. – Sara Jun 20 '21 at 22:38
  • Ok you need to really think through your requirements and express them in a way that a solution can be found. "delete none matching rows, implies there are candidate keys (indexes) that can be used to identify rows, then keep required ones and filter / delete unwanted ones – Rob Raymond Jun 21 '21 at 15:00