1

I was wondering whether there is a way to compare each element (regardless of indexical position) in a numpy array. I often find myself using arrays from pandas dataframes and I'd like to use the underlying numpy array to do compare each element. I know I can do a fast-elementwise comparison like this:

dfarr1 = pd.DataFrame(np.arange(0,1000))
dfarr2 = pd.DataFrame(np.arange(1000,0,-1))
dfarr1.loc[(dfarr1.values == dfarr2.values)]
# outputs: 500

(the above is just a toy example, obviously) But what I'd like to do is rather the equivalent of two loops over all the elements, but in a way that is as fast as possible:

for ir in df.itertuples():
   for ir2 in country_df.itertuples():
      if df['city'][ir[0]] == country_df['Capital'][ir2[0]]:
         df['country'][ir[0]] = country_df['Country'][ir2[0]]

The thing is that my dataframes contains many thousands of elements and the above is simply too slow (not least given that I'm sure I'll do similar such operations in the future on different, similarly long dataframes and so clearing this once and for all would be good). The idea is that I've parsed a few thousand files and got their geodata (=df in the above) and I have a quite massive file with cities and their corresponding countries as a lookup (=country_df). The idea is to see if the cities in the df match those in the lookup and if so I'd like to add the corresponding country in a new column (at the same row index) of the df with the parsed geodata. Anyway, this is just an example of what I'd need at (ideally much) higher speed than the above way. Many thanks!

  • If I understand correctly, you want to compare each element to every other element, in which case I think that `np.equal.outer` might help. Take a look at [this](https://stackoverflow.com/questions/42724795/fastest-way-to-compare-every-element-with-every-other-in-np-array-of-strings) post if you need to do it for strings – user15270287 Mar 07 '21 at 16:04
  • You say you want to "compare each element regardless of indexical position" and then in the next sentence you say you want to do element-wise comparisons. Which is it? – ddejohn Mar 07 '21 at 16:19
  • yes, sorry, I meant over all elements regardless of indexical position, i.e. equal to two naive loops (as shown in the 2nd example). if np.equal.outer could be utilised that would be great. i haven't quite understood how the solution in the link would work for me, but i'll try it some more tomorrow or so. – jackewiebohne Mar 07 '21 at 20:21

1 Answers1

0

You can try this:

 df1 = pd.DataFrame({'city': ['New York City', 'Los Angeles', 'Paris', 'Berlin', 'Beijing'], 
                     'country' : [None, None, None, None, None] })

df2 = pd.DataFrame({'city' : ['New York City', 'Paris', 'Berlin', 'Beijing', 'Los Angeles', 'Rome'],
                    'country': ['USA', 'France', 'Germany', 'China', 'USA', 'Italy']})

Now we use fillna method on df1 with df2['country'] series as filling values:

df1['country'] = df1.set_index('city')['country'].fillna(df2.set_index('city')['country'])\
                    .reset_index(drop=True)

print(df1)

    city          country
0   New York City  USA
1   Los Angeles    USA
2   Paris          France
3   Berlin         Germany
4   Beijing        China
ashkangh
  • 1,594
  • 1
  • 6
  • 9
  • many thanks! very useful. Though I still need to figure out how to use this solution to get the indexes of those instances where there is a match between cities in df1 and df2 and then use that index to add the corresponding country (in another column of df2) to a column in df2. So let's say the city at row, column indexes i,k in df1 matches city at row, column indexes j,l in df2 then I want to add the corresponding country in df2 at j,l+1 to to df1 at i,k+1. it'd be nice to use the above solution as a Boolean array to copy the countries from df2 to df1, but df1 and df2 are unequal length. – jackewiebohne Mar 07 '21 at 20:33
  • You're very welcome! If you could make a small sample dataframe for both df1 and df2, In a way that we can reproduce df1, df2 in our system, and exactly show also what is your desired output. By doing so, I guarantee you will get your answer in less than an hour either from me or other amazing Stackoverflow community members. – ashkangh Mar 07 '21 at 20:46
  • OK, many thanks! I'm representing the dfs as dicts here: df1 = {'city': ['New York City', 'Los Angeles', 'Paris', 'Berlin', 'Beijing'], 'Country' : [None, None, None, None, None] } and df2 = {'city' : ['New York City', 'Paris', 'Berlin', 'Beijing', 'Los Angeles', 'Rome'], 'country': ['USA', 'France', 'Germany', 'China', 'USA', 'Italy'}. The goal is to go from a match like: df1['city'][2] == df2['city'][1] (= 'Paris') to entering the corresponding country in df2['country'[[1] into df1['country'][2] (=France) as fast as possible. df1 and df2 have unequal length. – jackewiebohne Mar 08 '21 at 05:19
  • I just edited my answer with new data you provided. Let me know if it works. – ashkangh Mar 08 '21 at 06:18
  • Many thanks! It works splendidly on the example, but when I run it on my actual dataframes I unfortunately get: "ValueError: cannot reindex from a duplicate axis" – jackewiebohne Mar 08 '21 at 15:29
  • OK, so the issue seems to be (which may be obvious to some) that the dataframe with the cities and corresponding countries contains duplicates (this is because it also contains districts of cities in another column and since cities have multiple districts cities occur multiple times). – jackewiebohne Mar 09 '21 at 08:16
  • So this doesn't work if there are duplicates in df2 (whereas duplicates in df1 are OK): df1 = pd.DataFrame({'city': ['New York City', 'Los Angeles', 'Paris', 'Paris', 'Berlin', 'Beijing'], 'country' : [None, None, None, None, None, None] }). df2 = pd.DataFrame({'city' : ['New York City','New York City', 'Paris', 'Berlin', 'Beijing', 'Los Angeles', 'Rome'], 'country': ['USA', 'USA','France', 'Germany', 'China', 'USA', 'Italy']}) df1['country'] = df1.set_index('city')['country'].fillna(df2.set_index('city')['country']).reset_index(drop=True) – jackewiebohne Mar 09 '21 at 08:18
  • the solution is to get cities and countries in a new dataframe and then run drop_duplicates(subset='city') and then run the above operation. this may be obvious, but in case it wasn't this is what worked for me – jackewiebohne Mar 09 '21 at 08:21
  • Regarding duplications, there are some ways to handle them But first you need to decide what you wanna do with duplications in `country` column of `df2`. If you tell me, I could help you out. For instance, in the example you brought, If for 'USA' there would be two different cities like `New York City` and `Boston`, which one do you want to keep? But if you're sure that there wouldn't be any such cases and all `country` comes with the same city, also let me know, so I can edit my answer. – ashkangh Mar 09 '21 at 14:56
  • Many thanks! I'm happy with duplicates in the "country" column (in fact they're essential, since I need the matching country for each city), but for purposes of matching the cities in df1 to the corresponding countries in df2 I dropped the duplicates in df2 with drop_duplicates(subset='city'). Many thanks for all your efforts. This really helped and is massively faster than looping. – jackewiebohne Mar 10 '21 at 12:48
  • I'm really glad that I could help. Please accept my answer and give it an upvote if you find it useful. Good luck! – ashkangh Mar 10 '21 at 16:48
  • 1
    OK, I think I've now accepted your answer, but I don't seem to have any voting privileges yet :( – jackewiebohne Mar 10 '21 at 22:03