0

I am looking for algorithm or example code for parsing the test postal address from a text file and convert it to excel report.

The text file I receive will have many postal address's in different formats as give below:

Unit 8/25 Bright ST, NSW, 2010 
UNIT 5, 77 CHAPMAL STREET
UNIT 4, 75 GREAT STREET
95 OAKILANDS WAY
AVOCADO AVE
628 BRIDGEWATER ABCE ROAD

I have to read this file and assign the details to variables for further usage:

example -
street number from text address should be asigned to 'streetName',
unit number should be assigned to 'unitNumber', etc

I have pattern matcher which can recognise the values if there are specific details in the string:

Pattern p1 = Pattern.compile("([0-9]+\\s+[aA-zZ])+.*");
Matcher m1 = p1.matcher(str);
while (m1.find()) {
    String __tmp = m1.group();
    printer = __tmp;
}
System.out.println("Street : " + printer);;

ex: - 4, 75 GREAT STREET : from this text address, the above algorithm is able to identify "75 GREAT" as a street, but the number "4" before the street address should be identified as unit/flat/etc irrespective of weather "4" is joined with "unit"/"flat",etc.

I have added one more pattern match to get the numeric before the street address like :

Pattern p2 = Pattern.compile("([0-9])");
Matcher m2 = p2.matcher(str);
Pattern p3 = Pattern.compile("([,])");
Matcher m3 = p3.matcher(str);
while (m2.find() & m3.find()) {
    String __tmp = m2.group();
    printer = __tmp;
}
System.out.println("Unit : " + printer);

This gives me output as:

Street : 75 GREAT STREET
Unit : 4

but, this algorithm is not working when I append with "unit" or "flat" etc. Can someone help with the solution for this ?

Denim Datta
  • 3,740
  • 3
  • 27
  • 53
  • Post your current solution, perhaps someone can help you get it up to the "best" category. – Evan Trimboli Dec 04 '13 at 05:14
  • You're not going to be able to use regular expressions to handle this with any sort of accuracy - since the format is completely *irregular*. Have you given some thought to using an API service for this? Something like [Google Maps API](http://stackoverflow.com/questions/7764244/correct-address-format-to-get-the-most-accurate-results-from-google-geocoding-ap) – brandonscript Dec 04 '13 at 05:14
  • ya, I have tried Goole Maps API, but this helps only if we are passing enough data to recognise the address. Here in my case, some times I will get only street Name. – Vinay Kumar Dec 04 '13 at 07:21

1 Answers1

0

I would use a more flexible two steps algorithm:

Step 1: Read one line from the file

BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
   // process the line.
}
br.close();

Step2: Process the line

With the help of an XML file, I would put in it the various regexes used for extracting the desired information from the processed lines.

My file would look like this:

<address-templates>
    <address name="type1">
        <!-- The regex of this first type of address -->
        <regex><![CDATA[UNIT\s+(\d+),(.+)]]></regex>

        <!-- The capturing groups position of the various known parts -->
        <groups>
           <group name="unit" pos="1"/>
           <group name="street" pos="2"/>
           ...
        </groups>
    </address>

    <address name="type2">
        <!-- The regex of this second type of address -->
        <regex><![CDATA[(\d+)\s+(.+)]]></regex>

        <!-- The capturing groups position of the various known parts -->
        <groups>
           <group name="unit" pos="1"/>
           <group name="street" pos="2"/>
           ...
        </groups>
    </address>

    ...
</address-templates>

Then in my code, I would parse the above small config file and check my line against each address template. Later, if a new adress template appears, I would simply update my small config file.

Community
  • 1
  • 1
Stephan
  • 41,764
  • 65
  • 238
  • 329