0

Hello creative Developers and night rangers of StackOverflow, I have a customer who has around 20 thousand words for a dictionary in Microsoft Document files.

He created it around 1 decade ago, now I have to load these *.doc files contents to a database to create a dictionary for the customer.

My Question is where to start for changing a Text based in Columns to any sort of Database?

I'm thinking about using RegEx and using some patterns. So any cool suggestions?

curtisk
  • 19,950
  • 4
  • 55
  • 71
0bserver07
  • 3,390
  • 1
  • 28
  • 56
  • How many files are they? less than hundred? more than hundred? – d_inevitable Mar 19 '13 at 22:42
  • 1
    How is the data stored within the document? In a single table for each document? Multiple columns in tables within the document? – Zev Spitz Mar 19 '13 at 22:45
  • Or is it stored not as tables, but as simple text? If the data is stored in tables, then convert it to plaintext using the ConvertToText method. Once in plaintext, you can save it into a text file, and connect to it via an Access linked table, an ADODB connection, or an ADO.NET connection with the OleDb provider objects. All these support reading from a formatted text file as if it were a database table, with SQL. – Zev Spitz Mar 19 '13 at 22:53
  • @d_inevitable , there are like 20 to 24 files, but lost of words and their meanings. – 0bserver07 Mar 20 '13 at 08:57
  • @ZevSpitz I tried saving it in HTML to see if I can take them off by using DIV s didn't work, then I tried changing it to a plaintext file, but now I have no idea what to do with it. – 0bserver07 Mar 20 '13 at 08:58
  • 1
    What programming environments are you familiar/most comfortable with? .NET - C#/VB.NET? VBA/VB6 - Access? VBScript? Javascript? Something else? Let me know and I'll post some sample code. – Zev Spitz Mar 24 '13 at 10:39
  • @ZevSpitz C#, JS, Python, Ruby, pretty much any other Language similar to those :)! Probably I should update my question, because I tried RegEx and picking the English words into a column and their meanings and phonetic pronunciation to another column. At some places it gets messy, and its hard to check because its alot. – 0bserver07 Mar 25 '13 at 05:03
  • Again, the central question is how the words and meanings are stored within the documnent. Is each word-meaning pair in a single paragraph, with some character or formatting separating them? Is each word-pair a row in a table? More detail is needed. – Zev Spitz Mar 26 '13 at 19:08
  • You mentioned before that you got a plaintext file. What is the structure of that file? – Zev Spitz Mar 26 '13 at 19:09

2 Answers2

1

The main problem here is not that the data is stored in text, but that it is stored in .doc files and in tables there and that they are in many files.

So what you will need to do is:

  • Combine it into one file.
  • Convert it into sql text
  • Convert it into a text file

You can do this in any order, but the order will change the methodology a lot.

You could create MS-Word macros (in Basic), that would convert it into SQL text and combines the documents into one.

Or you could convert the document into RTF, and then run write script in any language you like to do the rest.

Regular expressions surely will be handy, but can't say how they should look like, because you did not specify how the files look like.

If there are not so many files, you could consider using copy & paste to put it into a simple text file. That will get rid of the table too. The result might be ugly, but it would still be structure so that I can be converted into sql.

d_inevitable
  • 4,381
  • 2
  • 29
  • 48
  • RTF is probably overkill, since a simple text file would be easier to parse in any language. Also, converting the text into SQL statements seems overcomplicated, when there are various methods available to read the file as-is. – Zev Spitz Mar 19 '13 at 22:58
  • @ZevSpitz well you can you can put them into standard formats so that an existing method will be able to parse them. I am assuming this is not case now. Converting directly into text from word? I not sure if that is possible. Where as RTF is. Depending on how the data is structure it might be the easiest way to parse it using some RTF lib. SQL, CSV will almost have the complexity. Only that SQL will have an extra prefix, and suffix. – d_inevitable Mar 19 '13 at 23:06
  • Microsoft Word has a rich object model, which will allow automating the process of opening each Word document, converting the table into delimited text, and saving said text as a plain text file - either using the [SaveAs2 method](http://msdn.microsoft.com/en-us/library/ff836084%28v=office.14%29.aspx) while passing in the `wdFormatText` constant; or reading the converted contents into a text file using the built-in APIs available (for example, under .NET, using File.WriteAllLines). Another possibility, if Word is not installed, is to use the Office Open XML SDK. – Zev Spitz Mar 19 '13 at 23:15
  • @d_inevitable that is a great reply, but first, there is a huge mass of data. However, my best choice write now is trying the Macro and trying the delimiter thing. – 0bserver07 Mar 20 '13 at 09:02
  • @ZevSpitz for somebody who is fluent in .NET your suggestion is very good. I think you should provide that as an answer with more details. However to somebody who would need to learn .NET first, it would be an overkill. My suggestion brings it into a domain in which any programming environment can be used. I mean we must not forget, that we expect the program to be run only once. So it doesn't need to be neat or efficient. – d_inevitable Mar 20 '13 at 10:27
  • This has nothing to do with .NET (except for the Office Open XML SDK, which I don't know if it can be used outside of .NET). Both Word and ADODB are COM components, which means that they are accessible from VBScript/JScript under Windows Scripting Host, from VBA within another Office program, Python (with the PyWin32 extensions), PHP, and of course the .NET languages - C#, VB.NET, F# etc. – Zev Spitz Mar 24 '13 at 10:37
1

Sample in C#:

For starters, add a reference to Microsoft.Office Interop.Word. Then you can do some basic parsing:

var wdApp = new Application();
var dict = new Dictionary<string, string>();
//paths is some collection of paths to the Word documents
//You can use Directory.EnumerateFiles to get such a collection from a folder
//EnumerateFiles also allows you to filter the files, say to only .doc
foreach (var path in paths) {
    var wdDoc = wdApp.Documents.Open(path);
    foreach (Paragraph p in wdDoc.Paragraphs) {
        var text = p.Range.Text;
        var delimiterPos = text.IndexOf(";");
        dict.Add(
            text.Substring(0, delimiterPos - 1),
            text.Substring(delimiterPos + 1)
        );
    }
    wdDoc.Close();
}
//This can be done more cleanly using LINQ, but Dictionary<TKey,TValue> doesn't have an AddRange method.
//OTOH, such a method can be easily added as an extension method, taking IEnumerable<KeyValuePair<TKey,TValue>>

For more complex parsing, you can save each item as a new textfile:

var newPaths =
    from path in paths
    select new {
        path,
        //If needed, add some logic to put the textfile in a different folder
        newPath = Path.ChangeExtension(path, ".txt")
    };
var wdApp = new Application();
foreach (var item in newPaths) {
    var wdDoc = wdApp.Documents.Open(item.path);
    wdDoc.SaveAs2(
        FileName: item.newPath,
        FileFormat: WdSaveFormat.wdFormatText
    );
    wdDoc.Close();
}

You may also need to create a file named schema.ini and put it in the same folder as the text files (more details on the syntax here):

//assuming the delimiter is a ;
File.WriteAllLines(schemaPath,
    from item in newPaths
    select String.Format(@"
        [{0}]
        Format=Delimited(;)
    ", item.filename)
);

Then, you can query the resulting text files using SQL statements, via the OleDbConnection, OleDbCommand, and OleDbReader classes.

foreach (var item in newPaths) {
    var connectionString = @"
        Provider=Microsoft.Jet.OLEDB.4.0;
        Extended Properties=""text;HDR=NO;IMEX=1;""
        Data Source=" + item.newPath;
    using (var conn = new OleDbConnection(connectionString)) {
        using (var cmd = conn.CreateCommand()) {
            cmd.CommandText = String.Format(@"
                SELECT *
                FROM [{0}]
            ", item.newPath);
            using (var rdr = cmd.ExecuteReader()) {
                //parse file contents here
            }
        }
    }
}
Zev Spitz
  • 13,950
  • 6
  • 64
  • 136