0

I'm attempting to build a winforms application that can do the following:

  1. Take in a PDF file
  2. Extract data (based on some sort of template or configuration file)
  3. Build data tables
  4. Serialize and upload the data tables to a web service

As of right now I have the PDF file converted into a text string, but I am having trouble coming up with a format for the template. At first I tried making my own XML custom configuration files- while this would satisfy the requirements of the project, I am finding it extremely difficult to express the necessary instructions in a way that is general enough. First I tried processing the text line by line and using a series of flags for various instructions. This concept seemed like it would work until I realized that often the data tables spanned multiple pages with extraneous text in-between. My initial processing attempt went like this:

  1. Load first instruction (start flag,end flag, action (eg. create table), and table structure)
  2. When End Flag is reached load next instruction

Unfortunately this doesn't account for looping or offer enough control over the way this all works. In some cases I need to get information that is appended to every row of data. I worked out how to do this using queued instructions then going back and processing them again when the rest of the table is built. The looping issue still remains though since each table is named based on the instruction.

Now I am looking into VTL and trying to see if a project like Vici would help me. It is getting to the point where I'm creating a psuedo-scripting language just to accomplish what I need and it is getting far too difficult.

TLDR VERSION: Are there any libraries or projects that will help me build data tables from plain text using some sort of template or configuration files?

emd
  • 1,173
  • 9
  • 21
  • Have you thought of the prospect of *NOT* using a template or configuration file? What are the advantages of using such a file? Can't you, for example, create an impromptu library and just write the actual processing code in C#? I did the same thing you're doing now, once, and in retrospect, this is what I should have done. – GregRos May 30 '12 at 18:47
  • @GregRos The complete concept involves a collection of configurations which are auto updated on the client application from a central source. The reason being, the documents being parsed are subject to change and a configuration or template based system allows for updates without code changes and re-deployment of the client application. The next question would be: Why not do it server side? The answer being: uploading full pdf files is not that great of an idea. – emd May 30 '12 at 18:50
  • 1
    Yes, but that's not possible. You said it yourself, you're developing some sort of scripting language. That already means code changes. Whatever you use, if the scenario is complex enough, it's bound to end up as code changes or a similar effort. You could pack the processing code separate from the library code, and update the assembly the contains it alone. – GregRos May 30 '12 at 18:53
  • @GregRos Good point, I guess what I would like to do really isn't feasible in the way I want to do it. Keeping the processing code separate from the library is a better idea. – emd May 30 '12 at 18:56
  • In that case, let me add the comment to the answers below to benefit *future generations*. – GregRos May 30 '12 at 19:10

1 Answers1

1

Have you thought of the prospect of NOT using a template or configuration file? What are the advantages of using such a file? Can't you, for example, create an impromptu library and just write the actual processing code in C#? I did the same thing you're doing now, once, and in retrospect, this is what I should have done.

You said it yourself, you're developing some sort of scripting language. That already means code changes. Whatever you use, if the scenario is complex enough, it's bound to end up as code changes or a similar effort. You could pack the processing code separate from the library code, and update the assembly the contains it alone.

GregRos
  • 8,667
  • 3
  • 37
  • 63