I have 2 very large data sets each of 5 columns( h, k, l, intensity, sigma). I would like to compare them row by row with respect to h,k,l values only. I need then to remove the nonmatching rows from each data set and rewrite each data set with only the matching rows having same h,k,l values. I am using datacompy but not sure how to proceed to get the final output.
Asked
Active
Viewed 321 times
1 Answers
0
- you are really describing and inner join (relational theory terminology)
- in pandas this is a
merge()
withkind="inner"
df1 = pd.DataFrame(
{
"h": range(0, 100, 2),
"k": range(0, 200, 4),
"l": range(0, 400, 8),
"intensity": np.random.uniform(1, 5, 50),
"sigma": np.random.uniform(0, 1, 50),
}
)
df2 = pd.DataFrame(
{
"h": range(0, 200, 4),
"k": range(0, 400, 8),
"l": range(0, 800, 16),
"intensity": np.random.uniform(1, 5, 50),
"sigma": np.random.uniform(0, 1, 50),
}
)
# get intersecting rows... in this case 25
df1 = pd.merge(df1, df2.loc[:,["h","k","l"]], on=["h","k","l"], how="inner")
# get intersecting rows... again 25 rows
df2 = pd.merge(df1.loc[:,["h","k","l"]], df2, on=["h","k","l"], how="inner")

Rob Raymond
- 29,118
- 3
- 14
- 30
-
Thank you so much. I still have problem with repeating rows having same hkl appearing several times in the data. For exapmle (100) in second row is not common. But, then it is common in a later line let's say (100) matching in line 98. In this case, all (100) are not deleted and the "intensity" and "sigma" entities are randomly and wrongly distributed. – Sara Jun 20 '21 at 19:26
-
ok, so h,k,l doesn't identify a row. How do you want it to behave when there are multiple rows with repeating combinations? Take the first? Last? mean of sigma and intensity? – Rob Raymond Jun 20 '21 at 20:07
-
Even for repeated hkl, as long as (hkl) is present in both data sets at some line, I just want to keep exact corresponding values of Intensity and sigma in each data set. At the end I want to plot and compare both sigmas together and (hkl) will be taken as categories only along x axis. – Sara Jun 20 '21 at 22:38
-
Ok you need to really think through your requirements and express them in a way that a solution can be found. "delete none matching rows, implies there are candidate keys (indexes) that can be used to identify rows, then keep required ones and filter / delete unwanted ones – Rob Raymond Jun 21 '21 at 15:00