4

I've poured over this post, but the answers don't seem to work for my needs. However, I'm very new to Python so that might be the problem as well.

Here's some lines from the output.csv:
Case Parties Address
25 THOMAS ST., PORTAGE, IN
67 CHESTNUT ST., MILLBROOK, NJ
1 EMPIRE DR., AUSTIN, TX, 11225
111 WASHINGTON AVE. #404, VALPARAISO, AK
89 E. JERICHO TPKE., Scarssdale, AZ

ORIGINAL POST CODE

import usaddress
import csv

with open('output.csv') as csvfile:
reader = csv.DictReader(csvfile)
    for row in reader:
        addr=row['Case Parties Address']
        data = usaddress.tag(addr)
        print(data)

(OrderedDict([('AddressNumber', u'4167'), ('StreetNamePreType', u'Highway'), ('StreetName', u'319'), ('StreetNamePostDirectional', u'E'), ('PlaceName', u'Conway'), ('StateName', u'SC'), ('ZipCode', u'29526-5446')]), 'Street Address'

Much like the previous post, I need to output the parsed data into csv. As best as I can tell, I need to do the following steps:

  1. Provide the headers as a list for reference. (They're listed here in 'Details'.)
  2. Using Usaadress.tag(), parse the source_csv as "data" BUT keep their corresponding "keys."
  3. Map the key:data to the header_reference
  4. Export into output_csv that has one header row.

I'm using the Python module, usaaddress, to parse a large csv (200k+). The module outputs the parsed data using OrderedDict. The aforementioned post only works if all the fields map to the same headers for all records. However, one of the MANY benefits of usaddress is that it parses out data even when there isn't fields to parse. So, for example, "123 Fake St, Maine, PA" maps to address,city,state headers perfectly. But "123 Jumping Block, Suite 600, Maine, PA" will put the "Suite 600" in the "city" column since it's matching statically based on position. If I parse the latter by itself, usaddress provides address, occupancy identifier (e.g. "suite #"), city, state headers.

The output format that I need is provided when I use the parserator's online parser, but it only can accommodate 500 rows at a time.

It seems like my code won't know what each data point is until it gets routed through the module; a chicken-or-the-egg situation. How do I write rows to a CSV file when each row might have a different subset of columns?

For reference, the error that I get when I try the closest solution (provided by isosceleswheel) is valueerror: I/O(...) and they reference lines 107 and 90 of the csv.py library, both of which pertain to fieldnames.

with open('output.csv') as csvfile:
reader = csv.DictReader(csvfile)

with open('myoutputfile', 'w') as o:  # this will be the new file you write to
    for row in reader:
        addr=row['Case Parties Address']
        data = usaddress.tag(addr)
        header = ','.join(data.keys()) + '\n'  # this will make a string of the header separated by comma with a newline at the end
        data_string = ','.join(data.values()) + '\n' # this will make a string of the values separated by comma with a newline at the end
        o.write(header + data_string)  # this will write the header and then the data on a new line with each field separated by commas
davidism
  • 121,510
  • 29
  • 395
  • 339
JoeB
  • 43
  • 5

2 Answers2

2

see this github issue for a solution

since we know all the possible labels in usaddress, we can use them to define the fields in the output.

I can't comment on an answer b/c I don't have enough reputation, but I'd recommend against using the usaddress parse method for this task. The tag method will parse an address and then concatenate consecutive address tokens when they have the same label, and will raise an error if there are non-consecutive tokens with the same label - it'd be good to capture tagging errors in the output.

Cathy D.
  • 96
  • 3
1

You want to parse each address separately and store in a list. Then you can use a Pandas DataFrame to align the outputs. Something like this:

import pandas as pd

data = ['Robie House, 5757 South Woodlawn Avenue, Chicago, IL 60637',
        'State & Lake, Chicago']

tagged_addresses = [usaddress.parse(line) for line in data]

address_df = pd.DataFrame(tagged_addresses)

print(address_df)

  AddressNumber BuildingName IntersectionSeparator PlaceName SecondStreetName StateName StreetName StreetNamePostType StreetNamePreDirectional ZipCode
0          5757  Robie House                   NaN   Chicago              NaN        IL   Woodlawn             Avenue                    South   60637
1           NaN          NaN                     &   Chicago             Lake       NaN      State                NaN                      NaN     NaN
maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • This could work, but I can't confirm it since I don't know how to change the data list to the stored rows from the output.csv. I've been trying for the past 6 hours, but everything that I try still errors out on: tagged_addresses = [usaddress.parse(line) for line in data] – JoeB Aug 04 '15 at 18:07
  • 1
    you can read your data in with pandas: `data = pd.read_csv('output.csv')['Case Parties Address']` – maxymoo Aug 04 '15 at 23:25
  • Thanks but the output isn't providing the headers or subsequently parsing out the data. When I use address_df.to_csv(), it provides 0-5 for the column headers, while bunching (u'THOMAS','Streetname') into one cell for example. It does the same when I use your original code directly. Is there a module that I need to import? It seems like we're missing the LABELS header from the usaddress module. Thanks again. – JoeB Aug 05 '15 at 01:39
  • Can you post (an anonymised version of )the first few lines of 'output.csv' in your question? – maxymoo Aug 05 '15 at 01:42
  • There ya' go! :) I also can't replicate your original post's pandas output table. Thanks!! – JoeB Aug 05 '15 at 01:53