0

I get daily data feeds with data that is only loosely structured. I need to import it into a database so I can run a report that finds new records and changes to existing records.

The data looks like this:

--------------------------------
blah:
foo
bar
lorum: ipsum
dolor: sit
foo: bar
bar: foo
123-555-1212
Lorum / Ipsum / Dolor / Sit
Foo / Bar
--------------------------------

As you can see there are some field headings like "blah", "lorum", etc. but some data lacks a heading, like the phone number or slash delimited list. And some headings are on the same line and others are not.

Just to keep us on our toes, the records do not have the same number of fields.

So I'm thinking that parsing needs to have at least 3 ways to parse the data like,

if "heading:$" then grab the next lines until the next "*.:" is read and grab "heading: value" and if line starts with number assume heading of "phone" and if line contains slash delimited list assume heading "features" until "--------..."

But I have no idea how to start coding something like this. The language is open at this point although I have to run the code in MacOS.

I suppose perl might be good for this, but have very poor perl foo.

Don't even know where to start with this one.

Paul Ericson
  • 777
  • 2
  • 7
  • 15

1 Answers1

0

You always need to assume something about your text, otherwise you have an exercise in NLP.

Can we assume that the non-key-value part is in the end? is so, the following regexs will help you:

 # split the text into records:
 @records = split /\n-----------------\n/, $text;
 # this will find lines that have another key/value pair after it
 qr/\A(\w+):(.*?)(?=\n\w+:)/ms

 # then the last key/value, that probably must be one line:
 qr/^(\w+):(.*)/

I recommend that each time, after successful matching, remove the matched text and continue.

Other useful assumptions: that the phone number can appear only once in the record, (and not as part of other key/value) that tags are in the end.

Shmuel Fomberg
  • 546
  • 2
  • 11