1

I want to create an application which will be able to parse doc/docx files structure of this file is shown bellow:

par-000.01 - some content
par-000.21 - some content
par-000.31 - some content
par-001.32 - some content

content could be multi line and not regular. What I want to do is to put these content into database I mean for first record - par-000.01 into code column and some content into text column.
The reason why I cannot do this manually is that I have about 15 docs where each of them contains about 10 pages of paragraphs I want to put into my database.
I cannot find any article how can i parse whole doc file so I believe it could be possible if i write proper regular expression. Can anyone redirect me to the article how I can do what I want- I can't find anything that suits me probably I am using wrong key words..

Guy Coder
  • 24,501
  • 8
  • 71
  • 136
Mithrand1r
  • 2,313
  • 9
  • 37
  • 76

2 Answers2

3

Since you say you have reasonable amount of data, 15 docs * 10 pages/doc * ~100 lines/page = 15000 lines this is manageable in a word document, and you did not say that this is a repeating data feed, i.e. this is a one-time conversion, I would do it using an editor that supported global find and replace and convert to a Comma Separated Variable format. Most DB I know can load a CSV file.

I know you asked for C# app, but that is overkill for time and effort based on your problem

So

  1. Convert '<start of line>' to '<start of line>"'
    for MS Word with Find and replace
    find: ^p
    replace: ^&"

  2. Convert ' - ' to '","'
    for MS Word with Find and replace
    find: ' - ' Note: don't add tick marks.
    replace: ","

  3. Convert '<end of line>' to '"<end of line>'
    for MS Word with Find and replace
    find: ^p
    replace: "^&

  4. Manually fix up start of first line and end of last line.

you should get

"par-000.01","some content"
"par-000.21","some content"

Now just load that into a DB using its CSV load.

Also if you insist on doing this with C#, then realize that you can probably save the text as a *.txt file without all of the Word tags and it will be much easier to take apart with a C# app. Don't get fixated on the Word tags, just side step the problem with creative thinking.

Guy Coder
  • 24,501
  • 8
  • 71
  • 136
0

You can automate parsing of Word documents (.doc or .docx) in C# using GroupDocs.Parser for .NET API. The text can be extracted from the documents either line by line or as a whole. This is how you can do it.

// extracting all the text 
WordsTextExtractor extractor = new WordsTextExtractor("sample.docx");
Console.Write(extractor.ExtractAll());

// OR

// Extract text line by line
string line = extractor.ExtractLine();

// If the line is null, then the end of the file is reached
while (line != null)
{
      // Print a line to the console
      Console.Write(line);
      // Extract another line
      line = extractor.ExtractLine();
}

Disclosure: I work as Developer Evangelist at GroupDocs.

Usman Aziz
  • 100
  • 3