0

I am trying to merge two pandas data frames (one is a spatial dataframe -sdf- and the other a simple dataframe) on a common field, GEOID using pd.merge. In the sdf, the GEOID is a string, in the df it was an int. I used the .astype('str') to convert the df GEOID field to a string as well. Still, when calling pd.merge the output is either blank or I get an error that says 'You are trying to merge on object and int64 columns.' I have confirmed with .dtypes() that both are strings. Any idea why the merge is not functioning?

I have tried converting both to string, as well as both to int. I also tried using pd.join and pd.concat but neither work as expected.

import pandas as pd
#Read in CSV with updates (would already be df from Socrata pull in real version)
updated_csv= r"C:\Users\mad10412\Desktop\Active_Business_Data_Edited.csv"
updated_csv_df = pd.read_csv(updated_csv)
updated_csv_df.head(5)
updated_csv_df['GEOID10']=updated_csv_df['GEOID10'].astype(str)
updated_csv_df.dtypes
output_layer_name = 'Join_Features_Test5'
actbus=gis.content.search(output_layer_name)
ActiveBusinesses_item = actbus[0]
ActiveBusinesses_item
ActiveBusinesses_flayer = ActiveBusinesses_item.layers[0]
ActiveBusinesses_flayer
ActiveBusinesses_fset = ActiveBusinesses_flayer.query() #querying without any conditions returns all the features
ActiveBusinesses_fset.sdf.head()
ActiveBusinesses_fset.sdf.shape
ActiveBusinesses_fset.sdf.dtypes
##Attempt 1: Includes original data and Adds Column names but no data

overlap_rows = ActiveBusinesses_fset.sdf.join(updated_csv_df.set_index('GEOID10'),on='GEOID10', lsuffix='_left', rsuffix='_right')
overlap_rows.head(10)
overlap_rows.to_csv("C:\\Users\\mad10412\\Desktop\\ConcatDF.csv")

##Attempt 2: Only includes column name. no data at all
overlap_rows = pd.merge(left = ActiveBusinesses_fset.sdf, 
                        right = updated_csv_df, 
                        how='inner',
                        on = 'GEOID10')
overlap_rows.head(5)
overlap_rows.to_csv("C:\\Users\\mad10412\\Desktop\\ConcatDF2.csv")

##Attempt 3: Includes all columns and all data, but GEOIDs don't match
result = pd.concat([ActiveBusinesses_fset.sdf, updated_csv_df], axis=1, join='inner')
result.head(5)
result.to_csv("C:\\Users\\mad10412\\Desktop\\ConcatDF3.csv")


##Attempt 4:  Only includes column name. no data at all
left=ActiveBusinesses_fset.sdf
right=updated_csv_df
result = pd.merge(left, right, how='inner',on=['GEOID10', 'GEOID10'])
result.head(5)
result.to_csv("C:\\Users\\mad10412\\Desktop\\ConcatDF4.csv")

The data for both dataframes looks like this:

df=pd.DataFrame({'GEOID': ['060372932023', '060372941201', '060372932022'],
               'Mining': [6, 4,2 ],
               'Agriculture': [10, 12, 4]})
df

The only difference between the data frames is one has a shape column containing the geometry. Essentially, I am trying to merge these data frames together to find instances where the values for the fields such as Agriculture and Mining are different.

df=pd.DataFrame({'GEOID': ['060372932023', '060372941201', '060372932022'],
               'Mining': [6, 4,2 ],
               'Agriculture': [10, 12, 4],
                'Mining2': [8, 3 , 1],
               'Agriculture2': [14, 0, 6]})
df

This should yield in one row for each GEOID, with data from both dataframes. See the last code snippet comments for what the outputs actually look like.

  • It is better if you can give examples of the data you are using. – 1__ Nov 05 '19 at 20:13
  • What do the dataframes look like right before you attempt to merge them? See [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and provide sample input and output – G. Anderson Nov 05 '19 at 20:13
  • `join` is notoriously bad when trying to merge on columns. You can see: https://stackoverflow.com/questions/52373285/pandas-join-on-string-datatype/52373954#52373954, basically it tries to coerce the merge_keys to a suitable type which isn't great and leads to problems. So Attempt 1 is bound to fail. But the `merge` should work. – ALollz Nov 05 '19 at 20:16
  • It's probably because the representations aren't identical. Given your sample the proposed merge works... But there I assume it's `on='GEOID'` because there's no `'GEOID10'` column.It's *very* difficult to troubleshoot the problem without having access to the actual data. My guess is that when you're re-creating test data it's not representative of the data. I doubt the values are `'060372932023'` in both. The display doesn't really show the representation, you can look at the output of something like df.GEOID.unique() instead, or provide `df.head().to_dict()` for us to see the real data. – ALollz Nov 05 '19 at 20:48
  • 1
    @ALollz you were correct, one had dropped the leading 0 so it was reading in as 60372932023. I added a leading 0 to the string and the merge was successful. Thank you! – Maddie Haynes Nov 05 '19 at 21:03

0 Answers0