0

I have a pandas dataframe, where I have a list of incomplete addresses that I pushed through to Google Maps API to get as much data about every address as possible and stored this data in a column called Components, which is then parsed using other functions to get the area name, postal code, etc.

This is how it looks

df['Components'][0]:

"{'access_points': [],
 'address_components': [{'long_name': '350',
   'short_name': '350',
   'types': ['subpremise']},
  {'long_name': '1313', 'short_name': '1313', 'types': ['street_number']},
  {'long_name': 'Broadway', 'short_name': 'Broadway', 'types': ['route']},
  {'long_name': 'New Tacoma',
   'short_name': 'New Tacoma',
   'types': ['neighborhood', 'political']},
  {'long_name': 'Tacoma',
   'short_name': 'Tacoma',
   'types': ['locality', 'political']},
  {'long_name': 'Pierce County',
   'short_name': 'Pierce County',
   'types': ['administrative_area_level_2', 'political']},
  {'long_name': 'Washington',
   'short_name': 'WA',
   'types': ['administrative_area_level_1', 'political']},
  {'long_name': 'United States',
   'short_name': 'US',
   'types': ['country', 'political']},
  {'long_name': '98402', 'short_name': '98402', 'types': ['postal_code']}],
 'formatted_address': '1313 Broadway #350, Tacoma, WA 98402, USA',
 'geometry': {'location': {'lat': 47.250653, 'lng': -122.43913},
  'location_type': 'ROOFTOP',
  'viewport': {'northeast': {'lat': 47.2520019802915,
    'lng': -122.4377810197085},
   'southwest': {'lat': 47.2493040197085, 'lng': -122.4404789802915}}},
 'place_id': 'ChIJcysCMHtVkFQRRUkEIPwScyk',
 'plus_code': {'compound_code': '7H26+78 Tacoma, Washington, United States',
  'global_code': '84VV7H26+78'},
 'types': ['establishment', 'finance', 'point_of_interest']}"

Then I use the following function to get the area name

def get_area(address_data):
    for item in address_data['address_components']:
        typs = set(item['types'])
        if typs == set(['neighborhood', 'political']):
            return item['long_name']

    return None

df.loc[:10000, 'area'] = df['Components'][:10000].apply(get_area)

TypeError                                 Traceback (most recent call last)
<ipython-input-233-eb2932e010e3> in <module>
----> 1 dfm.loc[:10000, 'area'] = dfm['Components'][:10000].apply(get_area)
      2 dfm['area'].value_counts()

~/virt_env/virt2/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   4040             else:
   4041                 values = self.astype(object).values
-> 4042                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4043 
   4044         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-232-ede4aa629b42> in get_area(address_data)
    149 
    150 def get_area(address_data):
--> 151     for item in address_data['address_components']:
    152         typs = set(item['types'])
    153         if typs == set(['neighborhood', 'political']):

TypeError: string indices must be integers

How do I fix this to be able to run this and other functions on the Components column?

user4718221
  • 561
  • 6
  • 20
  • `df['Components][N]` (where 0 <= N <= 10000), is not a dict (as you might expect), it's a string. You need to convert it to dict (indexable structure) first. Perhaps, look into string to dict() conversion. – dvlper May 19 '20 at 00:48
  • @dvlper I figured that's what is causing the issue. Could you suggest the best way to convert it to a dictionary? – user4718221 May 19 '20 at 01:38
  • `import json def get_area(address_data_raw): address_data= json.loads(address_data_raw) for item in address_data['address_components']: .... ` Something of this nature, maybe! it's not a clean way BTW! – dvlper May 19 '20 at 01:44

1 Answers1

1

The issue appears because df['Components'] is a string, a few ways to fix it:

import json
def get_area(address_data_raw): 
   address_data = json.loads(address_data_raw) 
   for item in address_data['address_components']: 
      ...

Second way:

import json
def get_area(address_data):
   ...

to_dict = lambda x: json.loads(x)
df.loc[:10000, 'area'] = df['Components'][:10000].apply(to_dict)
df.loc[:10000, 'area'] = df['Components'][:10000].apply(get_area)

These are a few ways to have it working!

dvlper
  • 462
  • 2
  • 7
  • 18
  • for both version, I get the following: JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1) – user4718221 May 19 '20 at 02:02
  • Then something is wrong with the string. I noticed yo have a # in the string – dvlper May 19 '20 at 07:57