-1

Imagine we have 2 dataframes with coordinates ['X','Y']:

df1 :

 X            Y          House №
2531        2016           175
2219        2196           11
2901        3426           201
6901        4431           46
7891        1126           89

df2 :

 X            Y      Delivery office №
2534        2019            O1
6911        4421            O2
2901        3426            O3
7894.5      1120            O4 

My idea is to merge them and get:

df3

 X            Y          House №    Delivery office №
2531        2016           175            01
2219        2196           11             NA
2901        3426           201            03
6901        4431           46             02
7891        1126           89             04

So we wants to realise 'fuzzy' merge by threshold (this param should be given by user). You can see that house number 11 didn't get any delivery office number because it located to much away from all of presented offices in df2.

So I need all rows from df2 'find' it's closest row from df1 and add it's 'Cost' value to it You can see that usual in-box pd.merge do not work there as well as custom packages that realize fuzzy logic relates to string values using levenshtein distance and so on

  • 1
    what is the threshold when you say closet value – anky Jan 15 '20 at 14:23
  • @anky_91 it's a great question. We have no threshold there. We want ALL df2's rows finds it's CLOSEST (in terms of coordinate distance) row from df1 (im not yelling ((: ) – Андрей Севостьянов Jan 15 '20 at 14:24
  • @АндрейСевостьянов All houses are more or less closest in some sense... You need to define threshold. Or say 3-5-10 closest. – Sergey Bushmanov Jan 15 '20 at 14:27
  • @SergeyBushmanov you are right! sorry. We can set this param in func you would suggest)) But actually if we take into account aforementioned condition : rows from df2 have to be fully used in merging – Андрей Севостьянов Jan 15 '20 at 14:31
  • then we would understand that we do not need threshold – Андрей Севостьянов Jan 15 '20 at 14:44
  • (1) Do you mean you need cartesian merge between the two sorted by distance (2) Distance, euclidean? – Sergey Bushmanov Jan 15 '20 at 15:16
  • BTW, why your "A" house has three different coordinates? Isn't it a mistake? – Sergey Bushmanov Jan 15 '20 at 15:39
  • Let df1['Dist'] be shortest distance from that point from df1 to closest point from df2. Any ideas how to calculate that? – Андрей Севостьянов Jan 15 '20 at 15:53
  • @АндрейСевостьянов There are many answers to the latest comment. The simplest is `math.dist()`. I believe you need more focus in asking your question. – Sergey Bushmanov Jan 15 '20 at 16:02
  • @АндрейСевостьянов If you edit your question with answers to the questions raised in the comments I will vote for re-opening your question. If opened, I will post my answer to the question, as I understand it now, with resulting df having either all costs sorted by the desired distance, or any number of top n closest for every house of interest. Please, make it sure houses in "House" column have unique ID's/names. – Sergey Bushmanov Jan 15 '20 at 19:10
  • @SergeyBushmanov add some more details in example. Please comment if it helps – Андрей Севостьянов Jan 16 '20 at 06:11
  • @АндрейСевостьянов Your question does not seem to be opening. You may try running `df1.assign(key=0).merge((df2.assign(key=0)), how = "outer", on="key",suffixes=("","_"))` and `from scipy.spatial.distance import cdist; df["distance"] = df.apply(lambda x: cdist([[x[0],x[1]]],[[x[4],x[5]]], metric="euclidean")[0][0], axis=1); df.drop(["X_","Y_","key"], axis=1, inplace=True); df.sort_values(["House", "distance"])` and `df.groupby("House").apply(lambda x: x.sort_values("distance").head(2).mean())` and report back if it helps. – Sergey Bushmanov Jan 17 '20 at 10:53
  • This code will give you all the necessary tools and set you in the right direction. It can easily be adapted to a threshold param. – Sergey Bushmanov Jan 17 '20 at 10:57
  • @SergeyBushmanov same approach was proposed on another resource. The problem is that I have df1 and df2 shaped like [10^5,:], so when I try merge (from your first batch of code) it crash because of memory error. – Андрей Севостьянов Jan 17 '20 at 13:30
  • Run in a loop through `df1` – Sergey Bushmanov Jan 17 '20 at 13:31
  • Actually I thought "apply lambda" function can solve my problem. We need calculate distance between first coordinates of each home and each delivery office and take then take minimum for each house. It looks really easy at the first sight... – Андрей Севостьянов Jan 17 '20 at 13:33
  • It's O(n^2) problem. Either you do it at once under the hood and have enough memory or iterate through it. – Sergey Bushmanov Jan 17 '20 at 13:36
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/206150/discussion-between---and-sergey-bushmanov). – Андрей Севостьянов Jan 17 '20 at 13:37

1 Answers1

0

No silver bullet, but a way to do this is to turn the Y values in categories using pd.cut. Using this method, it will place the values in different bins. You need to tune the bins manually, for example set it at 20.

Load the data:

df1 = pd.DataFrame({'X':[2531, 2219, 2901, 6901, 7891], 'Y':[2016, 2196, 3426, 4431, 1126], 'House':['A', 'B', 'J', 'A', 'A']})

df2 = pd.DataFrame({'X':[2534, 6911, 2901, 7894.5], 'Y':[2019, 4421, 3426, 1120], 'Cost':[1200, 3100, 800, 600]})

Make new categories:

df1['Y2'] = pd.cut(df1['Y'], 20, labels=False)

df2['Y2'] = pd.cut(df2['Y'], 20, labels=False)

df3 = pd.merge(df1, df2, on=['Y2'], how='left')
Ruthger Righart
  • 4,799
  • 2
  • 28
  • 33