.net program to parse .doc file

Question

I want to create an application which will be able to parse doc/docx files structure of this file is shown bellow:

par-000.01 - some content
par-000.21 - some content
par-000.31 - some content
par-001.32 - some content

content could be multi line and not regular. What I want to do is to put these content into database I mean for first record - par-000.01 into code column and some content into text column.
The reason why I cannot do this manually is that I have about 15 docs where each of them contains about 10 pages of paragraphs I want to put into my database.
I cannot find any article how can i parse whole doc file so I believe it could be possible if i write proper regular expression. Can anyone redirect me to the article how I can do what I want- I can't find anything that suits me probably I am using wrong key words..

actually by now I can load file line-by-line and store it in string builder into variable. But this way using regex is not really efficient. — Mithrand1r, Mar 12 '13 at 18:27
Why do you need a RegEx if you are already able to read the doc line by line? Just find the paragraph break and save it? — D.R., Mar 12 '13 at 18:37
15 docs 10 pages each. What performance problems are you experiencing? — paparazzo, Mar 12 '13 at 18:56
Perhaps the Docx library is what you're looking for? http://docx.codeplex.com/ — Jim Mischel, Mar 12 '13 at 20:15

Guy Coder · Accepted Answer · 2013-03-12T21:20:33.000

Since you say you have reasonable amount of data, 15 docs * 10 pages/doc * ~100 lines/page = 15000 lines this is manageable in a word document, and you did not say that this is a repeating data feed, i.e. this is a one-time conversion, I would do it using an editor that supported global find and replace and convert to a Comma Separated Variable format. Most DB I know can load a CSV file.

I know you asked for C# app, but that is overkill for time and effort based on your problem

So

Convert '<start of line>' to '<start of line>"'
for MS Word with Find and replace
find: ^p
replace: ^&"
Convert ' - ' to '","'
for MS Word with Find and replace
find: ' - ' Note: don't add tick marks.
replace: ","
Convert '<end of line>' to '"<end of line>'
for MS Word with Find and replace
find: ^p
replace: "^&
Manually fix up start of first line and end of last line.

you should get

"par-000.01","some content"
"par-000.21","some content"

Now just load that into a DB using its CSV load.

Also if you insist on doing this with C#, then realize that you can probably save the text as a *.txt file without all of the Word tags and it will be much easier to take apart with a C# app. Don't get fixated on the Word tags, just side step the problem with creative thinking.

Yeah, just save it as text and parse it from there. – Jim Mischel Mar 12 '13 at 22:08 — Jim Mischel, Mar 12 '13 at 22:08

score 0 · Answer 2 · answered Sep 19 '19 at 07:13

You can automate parsing of Word documents (.doc or .docx) in C# using GroupDocs.Parser for .NET API. The text can be extracted from the documents either line by line or as a whole. This is how you can do it.

// extracting all the text 
WordsTextExtractor extractor = new WordsTextExtractor("sample.docx");
Console.Write(extractor.ExtractAll());

// OR

// Extract text line by line
string line = extractor.ExtractLine();

// If the line is null, then the end of the file is reached
while (line != null)
{
      // Print a line to the console
      Console.Write(line);
      // Extract another line
      line = extractor.ExtractLine();
}

Disclosure: I work as Developer Evangelist at GroupDocs.

.net program to parse .doc file

2 Answers2