3

The title is more completely: Convert tuple containing an OrderedDict with tagged parts to table with columns named from tagged parts (variable number of tagged parts and variable number of occurrences of tags).

I know more about address parsing than python which is probably the underlying source of the problem. How to do this might be obvious. The usaddress library is intentionally returning results in this manner which is presumably useful.

I'm using usaddress which "is a python library for parsing unstructured address strings into address components, using advanced NLP methods," and seems to work very well. Here is the usaddress source and website.

So I run it on a file like:

2244 NE 29TH DR
1742 NW 57TH ST
1241 NE EAST DEVILS LAKE RD 
4239 SW HWY 101, UNIT 19 
1315 NE HARBOR RIDGE 
4850 SE 51ST ST 
1501 SE EAST DEVILS LAKE RD 
1525 NE REGATTA WAY 
6458 NE MAST AVE 
4009 SW HWY 101 
814 SW 9TH ST 
1665 SALMON RIVER HWY 
3500 NE WEST DEVILS LAKE RD, UNIT 18 
1912 NE 56TH DR 
3334 NE SURF AVE 
2734 SW DUNE CT
2558 NE 33RD ST 
2600 NE 33RD ST 
5617 NW JETTY AVE 

I want to convert those results into something more like a table (CSV or database eventually).

I was not sure what datatypes are returned. Reading the docs, tells me that the tag method returns a tuple containing an OrderedDict with tagged parts. The parse method seems to return a slightly different type. This question, helped me determine that it is a list and a tuple (apparently with tags). Searching for how to convert a python list with tagged parts to a table was unsuccessful.

Searching for how to convert a tuple containing an OrderedDict doesn't turn up much. This is the closest that I found. I also found that pandas is good at various formatting tasks, although it was not clear to me how to apply pandas to this. Many of the closest question I've found like the opposite question or one with named tuples have very low scores.

I also tried some exploratory attempts to see if it would just work (below). I was able to see a few ways to access the data and using zip from this Matrix Transpose question got a little closer to a table since the data and named tags are now separate, although not uniform. Is there a way to take these results in tagged lists or tuples containing an OrderedDict with tagged parts to a table? Is there a fairly direct way from the returned results?

Here is the parse method:

## Get a library
import usaddress

## Open the file with read only permmission
f = open('address_sample.txt')

## Read the first line 
line = f.readline()

## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
    ## Try the parse method
    parsed = usaddress.parse(line)
    ## See what the parse results look like
    zippy = [list(i) for i in zip(*parsed)]
    print(zippy)
    ## read the next line
    line = f.readline()

## close the file
f.close()

And the results produced (notice that when there are multiple parts to a tag it is repeated).

[['2244', 'NE', '29TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1742', 'NW', '57TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1241', 'NE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['4239', 'SW', 'HWY', '101,', 'UNIT', '19'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier']]
[['1315', 'NE', 'HARBOR', 'RIDGE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4850', 'SE', '51ST', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1501', 'SE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['1525', 'NE', 'REGATTA', 'WAY'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['6458', 'NE', 'MAST', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4009', 'SW', 'HWY', '101'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName']]
[['814', 'SW', '9TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1665', 'SALMON', 'RIVER', 'HWY'], ['AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['3500', 'NE', 'WEST', 'DEVILS', 'LAKE', 'RD,', 'UNIT', '18'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier']]
[['1912', 'NE', '56TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['3334', 'NE', 'SURF', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2734', 'SW', 'DUNE', 'CT'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2558', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2600', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['5617', 'NW', 'JETTY', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]

Here is the tag method:

## Get a library
import usaddress

## Open the file with read only permmission
f = open('address_sample.txt')

## Read the first line 
line = f.readline()

## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
    ## Try tag method
    tagged = usaddress.tag(line)
    ## See what the tag results look like
    items_ = list(tagged[0].items())
    zippy2 = [list(i) for i in zip(*items_)]
    print(zippy2)
    ## read the next line
    line = f.readline()

## close the file
f.close()

produces the following output which better handles the combining of multiple parts with the same tag:

[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2244', 'NE', '29TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1742', 'NW', '57TH', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1241', 'NE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier'], ['4239', 'SW', 'HWY', '101', 'UNIT', '19']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1315', 'NE', 'HARBOR', 'RIDGE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['4850', 'SE', '51ST', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1501', 'SE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1525', 'NE', 'REGATTA', 'WAY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['6458', 'NE', 'MAST', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName'], ['4009', 'SW', 'HWY', '101']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['814', 'SW', '9TH', 'ST']]
[['AddressNumber', 'StreetName', 'StreetNamePostType'], ['1665', 'SALMON RIVER', 'HWY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier'], ['3500', 'NE', 'WEST DEVILS LAKE', 'RD', 'UNIT', '18']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1912', 'NE', '56TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['3334', 'NE', 'SURF', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2734', 'SW', 'DUNE', 'CT']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2558', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2600', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['5617', 'NW', 'JETTY', 'AVE']]
Community
  • 1
  • 1
elil
  • 33
  • 4
  • Please provide a [_Minimal, Complete, and Verifiable example_](http://stackoverflow.com/help/mcve). – martineau Apr 21 '15 at 20:35
  • @martineau I edited to make it closer to a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). The code for the two methods is now separate and their results are shown too. Partly I'm looking for ideas of how to do this (the code is functioning; it just doesn't do what I want and there are probably several steps to go still). Thanks for reading and the pointer to that link. Generally I use [How to ask questions the smart way](http://catb.org/~esr/faqs/smart-questions.html) as a guide but haven't read it for a while and it isn't specific to SO. – elil Apr 21 '15 at 21:14

1 Answers1

1

Just use the csv.DictWriter class with your tag method:

from csv import DictWriter
import usaddress

tagged_lines = []
fields = set()
# Note 1: Use the 'with' statement instead of worrying about opening
# and closing your file manually
with open('address_sample.txt') as in_file:
    # Note 2: You don't need to mess with readline() and while loops; 
    # just iterate over the file handle directly, it produces lines.
    for line in in_file:
        tagged = usaddress.tag(line)[0]
        tagged_lines.append(tagged)
        fields.update(tagged.keys()) # keep track of all field names we see

with open('address_sample.csv', 'w') as out_file:
    writer = DictWriter(out_file, fieldnames=fields)
    writer.writeheader()
    writer.writerows(tagged_lines)

Note that this is inefficient for large files as it holds the entire contents of your input in memory at once; the only reason for that is that the set of fieldnames (i.e. csv column headers) is unknown beforehand.

If you know the full set you could just do it in one streaming pass, writing tagged output as you read each line. Alternatively, you could do one pass over the file to generate the set of headers, and then a second pass to do the conversion.

tzaman
  • 46,925
  • 11
  • 90
  • 115
  • Line 14, `fields.update(tagged.keys()) # keep track of all field names we see` is giving an AttributeError: 'tuple' object has no attribute 'keys' – elil Apr 21 '15 at 22:00
  • @elil - fixed; seems the `tag` method returns a tuple with the tagged result and the address type. – tzaman Apr 21 '15 at 22:04