0

I am trying to save the results of a regex .findall() search into a csv but am having difficulties appending the result to an output file.

As I am still very new to Python I am attempting to limit this problem to using only the csv and re libraries - but if there is a far easier way (i.e. in pandas) that would also be helpful to know.


  1. How would one copy the entire contents of the input CSV to the output CSV and add the postcode / found regular expression to the line it was found on?

  2. Are there any obvious forms of error checking or other things that I am missing?

  3. Does a better method exist to add the header of the input CSV automatically to the output CSV without explicitly specifying them?

  4. Is it possible to do this using DictWriter? As I had originally attempted.


import csv, re

pattern = r'[A-Z]{1,2}[0-9R][0-9A-Z]?[0-9][A-Z]{2}'
postcodes = []
with open(r'Postcode/addressin.csv', 'r') as csvinput:
    csv_reader = csv.DictReader(csvinput)

    with open(r'Postcode/addressout.csv', 'w', newline='') as csvoutput:
        fieldnames = ['Address', 'Name']
        csv_writer = csv.writer(csvoutput)

        csv_writer.writerow(fieldnames)

        for line in csv_reader:
            postcodes = re.findall(pattern, line["Address"])
            csv_writer.writerow(postcodes)

Example Data:

Address,Name,Lat,Long,2016 Sales,Type
48  Park Avenue, LATTON, SN6 4SZ,Nikki Yellowbeard,-23.17549,36.74641,9727,AA
IV21 1TD 116  Walwyn Rd CHARLESTOWN,Jonh Doe,-10.98309,156.41854,11932,AE
olinox14
  • 6,177
  • 2
  • 22
  • 39
Scottie
  • 5
  • 5
  • 1
    providing an example data for the input and output CSV files would be helpful in answering the question. – mohd4482 Jun 30 '19 at 12:50

3 Answers3

2

It seems to me that there are commas in the first field addresses that would create irregularities, and I'm not so sure what might be the best method to bypass those, yet this expression:

(.*),(.*),\s*([0-9.-]+)\s*,\s*([0-9.]+)\s*,([0-9]{4,5}(?:-[0-9]{4})?)\s*,\s*([A-Z]{2})

might be an approach to look into.


Demo


US Zip codes are usually in these formats:

([0-9]{5}(?:-[0-9]{4})?)

just for demoing, I have included:

[0-9]{4,5}

which you can simply remove that.

Example

import re

regex = r"(.*),(.*),\s*([0-9.-]+)\s*,\s*([0-9.]+)\s*,([0-9]{4,5}(?:-[0-9]{4})?)\s*,\s*([A-Z]{2})"

test_str = ("Address,Name,Lat,Long,2016 Sales,Type\n"
    "48  Park Avenue, LATTON, SN6 4SZ,Nikki Yellowbeard,-23.17549,36.74641,9727,AA\n"
    "IV21 1TD 116  Walwyn Rd CHARLESTOWN,Jonh Doe,-10.98309,156.41854,11932,AE")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

If we would not be validating values, then simply this expression

(.*),(.*),(.*),(.*),(.*),(.*)

might likely work.

Demo

Community
  • 1
  • 1
Emma
  • 27,428
  • 11
  • 44
  • 69
  • 2
    Interesting, but the regex expression I am using seems to work well. The chief issue is exporting the end result to csv appending the postcode in a separate column with each line. – Scottie Jun 30 '19 at 19:48
  • The regex looks like one for UK or Canadian (or I'm guessing random other current or former Commonwealth member nations) postcodes. Forcing US-only ZIP codes on data which clearly doesn't have them just seems imperialistic. (Ask us about trying to write our addresses into order forms on American sites!) – tripleee Jul 01 '19 at 06:19
1

you are probably better of by reading your input csv file into a data frame and then use pandas.str.extract() to extract your post code from your address column.

  1. read csv: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
  2. extract postscode: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html
  3. write csv: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
Raphael
  • 1,731
  • 2
  • 7
  • 23
  • This is useful for future reference. Still hoping to discover if a solution with just csv and re exists. – Scottie Jun 30 '19 at 17:27
1

The CSV in your sample is not valid; it looks like you are lacking quoting around the Address field.

Also, re.findall() can return more than one result - CSV can't really accommodate more than one value in a column (and when you try you get into the sort of mess you are trying to get out of now); a better solution in general is to normalize your data so that every field contains a minimal, atomic piece of data which cannot be further divided into smaller units of information.

If you are trying to represent nested or hierarchical data, maybe look at JSON or XML instead of CSV as your storage format.

With that out of the way, here is a refactoring which adds one field to the end of each line, and embeds a semicolon-separated list of post codes (or nothing at all, if the regex match was unsuccessful) into that field.

import csv, re

# Precompile the pattern
pattern = reccompile(r'[A-Z]{1,2}[0-9R][0-9A-Z]?[0-9][A-Z]{2}')

with open(r'Postcode/addressin.csv', 'r') as csvinput, open(r'Postcode/addressout.csv', 'w') as csvoutput:
    csv_reader = csv.DictReader(csvinput)
    csv_writer = csv.writer(csvoutput)

    outputfieldnames = ['Address', 'Name', 'Postcode']
    csv_writer.writerow(outputfieldnames)

    for line in csv_reader:
        postcodes = ';'.join(pattern.findall(line["Address"]))
        csv_writer.writerow([line["Address"], line["Name"], postcodes])
tripleee
  • 175,061
  • 34
  • 275
  • 318