pd.merge not working with converted data type

Question

I am trying to merge two pandas data frames (one is a spatial dataframe -sdf- and the other a simple dataframe) on a common field, GEOID using pd.merge. In the sdf, the GEOID is a string, in the df it was an int. I used the .astype('str') to convert the df GEOID field to a string as well. Still, when calling pd.merge the output is either blank or I get an error that says 'You are trying to merge on object and int64 columns.' I have confirmed with .dtypes() that both are strings. Any idea why the merge is not functioning?

I have tried converting both to string, as well as both to int. I also tried using pd.join and pd.concat but neither work as expected.

import pandas as pd
#Read in CSV with updates (would already be df from Socrata pull in real version)
updated_csv= r"C:\Users\mad10412\Desktop\Active_Business_Data_Edited.csv"
updated_csv_df = pd.read_csv(updated_csv)
updated_csv_df.head(5)
updated_csv_df['GEOID10']=updated_csv_df['GEOID10'].astype(str)
updated_csv_df.dtypes

output_layer_name = 'Join_Features_Test5'
actbus=gis.content.search(output_layer_name)
ActiveBusinesses_item = actbus[0]
ActiveBusinesses_item
ActiveBusinesses_flayer = ActiveBusinesses_item.layers[0]
ActiveBusinesses_flayer
ActiveBusinesses_fset = ActiveBusinesses_flayer.query() #querying without any conditions returns all the features
ActiveBusinesses_fset.sdf.head()
ActiveBusinesses_fset.sdf.shape
ActiveBusinesses_fset.sdf.dtypes

##Attempt 1: Includes original data and Adds Column names but no data

overlap_rows = ActiveBusinesses_fset.sdf.join(updated_csv_df.set_index('GEOID10'),on='GEOID10', lsuffix='_left', rsuffix='_right')
overlap_rows.head(10)
overlap_rows.to_csv("C:\\Users\\mad10412\\Desktop\\ConcatDF.csv")

##Attempt 2: Only includes column name. no data at all
overlap_rows = pd.merge(left = ActiveBusinesses_fset.sdf, 
                        right = updated_csv_df, 
                        how='inner',
                        on = 'GEOID10')
overlap_rows.head(5)
overlap_rows.to_csv("C:\\Users\\mad10412\\Desktop\\ConcatDF2.csv")

##Attempt 3: Includes all columns and all data, but GEOIDs don't match
result = pd.concat([ActiveBusinesses_fset.sdf, updated_csv_df], axis=1, join='inner')
result.head(5)
result.to_csv("C:\\Users\\mad10412\\Desktop\\ConcatDF3.csv")


##Attempt 4:  Only includes column name. no data at all
left=ActiveBusinesses_fset.sdf
right=updated_csv_df
result = pd.merge(left, right, how='inner',on=['GEOID10', 'GEOID10'])
result.head(5)
result.to_csv("C:\\Users\\mad10412\\Desktop\\ConcatDF4.csv")

The data for both dataframes looks like this:

df=pd.DataFrame({'GEOID': ['060372932023', '060372941201', '060372932022'],
               'Mining': [6, 4,2 ],
               'Agriculture': [10, 12, 4]})
df

The only difference between the data frames is one has a shape column containing the geometry. Essentially, I am trying to merge these data frames together to find instances where the values for the fields such as Agriculture and Mining are different.

df=pd.DataFrame({'GEOID': ['060372932023', '060372941201', '060372932022'],
               'Mining': [6, 4,2 ],
               'Agriculture': [10, 12, 4],
                'Mining2': [8, 3 , 1],
               'Agriculture2': [14, 0, 6]})
df

This should yield in one row for each GEOID, with data from both dataframes. See the last code snippet comments for what the outputs actually look like.

It is better if you can give examples of the data you are using. — 1__, Nov 05 '19 at 20:13
What do the dataframes look like right before you attempt to merge them? See [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and provide sample input and output — G. Anderson, Nov 05 '19 at 20:13
`join` is notoriously bad when trying to merge on columns. You can see: https://stackoverflow.com/questions/52373285/pandas-join-on-string-datatype/52373954#52373954, basically it tries to coerce the merge_keys to a suitable type which isn't great and leads to problems. So Attempt 1 is bound to fail. But the `merge` should work. — ALollz, Nov 05 '19 at 20:16
It's probably because the representations aren't identical. Given your sample the proposed merge works... But there I assume it's `on='GEOID'` because there's no `'GEOID10'` column.It's *very* difficult to troubleshoot the problem without having access to the actual data. My guess is that when you're re-creating test data it's not representative of the data. I doubt the values are `'060372932023'` in both. The display doesn't really show the representation, you can look at the output of something like df.GEOID.unique() instead, or provide `df.head().to_dict()` for us to see the real data. — ALollz, Nov 05 '19 at 20:48
@ALollz you were correct, one had dropped the leading 0 so it was reading in as 60372932023. I added a leading 0 to the string and the merge was successful. Thank you! — Maddie Haynes, Nov 05 '19 at 21:03

pd.merge not working with converted data type

0 Answers0