1

I am looking into various options to parse data from text files. We receive invoices from different clients and the format is not predefined. Basically we receive table kind of structure with different columns as shown below and data needs to be extracted from the file.

Right now, we are having an IExtractor interface with Parse method which is implemented by each client parser and depending upon the file appropriate class is instantiated and logic is hard coded to retrieve the data.

Since the number of clients are increasing, we are looking into more robust and easy to code method to extract the information from text files.

Is it recommended to use regular expressions for identifying header and footer and use another expression to extract the information from each row. I would appreciate if anyone could suggest better alternatives.

<addition text>.....
    Date          Document            Invoice               Deductions     Paid Amount
    --------------------------------------------------------------------------------------------
    21.03.2014    9289                9280                  0.00                        48,000.00
    10.01.2013    21389               9402                  3.00                        4,000.00
    21.03.2014    9289                9280                  0.00                        48,000.00
    10.01.2013    21389               9402                  3.00                        4,000.00

    Sum Total
    Please ....<text>
Sunny
  • 4,765
  • 5
  • 37
  • 72
  • If you just need the data, why not a simple CSV file? Not as pretty for humans to read, but it gets the data with no "fluff". – gunr2171 Sep 22 '14 at 17:04
  • We receive *.txt files from clients – Sunny Sep 22 '14 at 17:07
  • It's not clear to me what you're asking. Are you asking what format of data you should tell your clients to provide the data in? XML and JSON are often good formats for exchanging data, though they're not human friendly so you'd want to create a program that generates those for you. Otherwise CSV or Excel are good choices. – mason Sep 22 '14 at 17:08
  • I am sorry, I have edited the question. We need to extract info. from text files which are not in a standard format. Right now, we have hardcoded the logic for each client by looping through each line. Please let me know if it is still unclear. – Sunny Sep 22 '14 at 17:14
  • 2
    If you need a standardized way to parse the data, give your clients a standardized format that they must use to send the data. Otherwise, you're stuck with writing specialized parsers for the various formats. – Ken White Sep 22 '14 at 17:16
  • @Ken White, Unfortunately it is not an option. – Sunny Sep 22 '14 at 17:20
  • Hmmm... I'm not sure why it wouldn't be. There's an entire standardized system, including standards for various specialized types of data, available for EDI transactions between entities. – Ken White Sep 22 '14 at 17:38

1 Answers1

0

If you have too many to do a code solution - ie IExtractor .Parse that you mention - then I would go for an embedded scripting language

You can then write a script per client.

I would use Javascript as the language (it has built in regex support ). I would use the jint project from codeplex

pm100
  • 48,078
  • 23
  • 82
  • 145