0

I tried to read .docx and .txt in C# The content from ABC.docx is :

Test1

Test2

My code actually read the ABC.docx but one problem is when the data stored in the sql server the output is like this:

enter image description here

Below is my code:

 void WalkDirectoryTree(System.IO.DirectoryInfo root)
    {
        //System.IO.FileInfo[] files = null;

        System.IO.DirectoryInfo[] subDirs = null;

        //need to add-in more extension file such as .doc,  .ppt, .xlsx
        //files = root.GetFiles("*.txt");


        var files = root.GetFiles().Where(a => a.Extension.Contains(".docx") || a.Extension.Contains(".txt"));

        //  files = new string[] { "*.txt", "*.docx" }
        //.SelectMany(i => root.GetFiles(i, SearchOption.AllDirectories))
        //.ToArray();

        //if file is not null, read filename & file extension
        if (files != null)
        {
            foreach (System.IO.FileInfo fi in files)
            {
                StringBuilder text = new StringBuilder();
                Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
                object miss = System.Reflection.Missing.Value;
                //object path = @"I:\def.docx";
                object path = fi.FullName;
                object readOnly = true;
                Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);

                for (int i = 0; i < docs.Paragraphs.Count; i++)
                {
                    text.Append(" \r\n " + docs.Paragraphs[i + 1].Range.Text.ToString());
                }


                //Get the full patch of the file extension
                string[] lines = System.IO.File.ReadAllLines(fi.FullName);
                //TextReader reader = new FilterReader(fi.FullName);
                //StreamReader m = new StreamReader(fi.FullName);



                foreach (string line in lines)
                {

                    String[] substrings = fi.FullName.Split('\\');
                    string strFileName = string.Empty;
                    string strFileExtension = string.Empty;


                    if (substrings.Length > 0)
                    {
                        strFileName = substrings[ substrings.Length -1 ];
                        if( !string.IsNullOrEmpty(strFileName) )
                        {
                            string[] extensionSplit = strFileName.Split('.');
                            if (extensionSplit.Length > 0)
                            {
                                strFileExtension = extensionSplit[extensionSplit.Length - 1];
                            }
                        }
                    }
                    else
                    {
                        strFileName = fi.FullName;
                    }

                     InsertData(strFileName, line.Replace("'",""),  fi.FullName,strFileExtension);
                }
            }

            //After searched from root, continue search from subDirectories
            subDirs = root.GetDirectories();

            #region Exclude all the hidden files from drives
            foreach (System.IO.DirectoryInfo dirInfo in subDirs)
            {
                if ((dirInfo.Attributes & FileAttributes.Hidden) == 0)
                {
                    WalkDirectoryTree(dirInfo);
                }
            }
            #endregion
        }
    }

Please advice how to store inside the sql server. Thanks.

coder
  • 105
  • 1
  • 8
  • 1
    What is the type of `Content`? You are creating a row for each line. It doesn't look right. You probably want to read the file as binary array and create just one row per file. – muratgu May 06 '16 at 01:25
  • Conent from .dcox file is : Test1 Test2 – coder May 06 '16 at 01:27
  • 1
    What did you expect it to show? `docx` files are binary files, not text files (actually they are zipped collections of XML files). – Matt Burland May 06 '16 at 01:28
  • 2
    You're trying to treat a Word document as a text file. Open one in Notepad and you'll see it is nothing like a text file. – Ken White May 06 '16 at 01:36
  • 1
    @coder I was asking the type of the table column `Content`. – muratgu May 06 '16 at 01:43

1 Answers1

2

Save the Word document data as a Base64 string within the database.

Using that base64String of the document, not only can you save the document, but you can also then open it (by converting it back) at a later stage.

Saving this result to the database;

Public string GetDocumentBinary()
    {
        string docPath = "DocumentPath";
        byte[] binarydata = File.ReadAllBytes(docPath);
        base64 = System.Convert.ToBase64String(binarydata, 0, binarydata.Length);
        return base64;
    }

Then when you needing to display the document, convert it back save it to disk (optional);

Public void SaveBinaryAsDocument(string filePath, string base64String)
    {
        Byte[] bytes = Convert.FromBase64String(base64String);
        File.WriteAllBytes(filePath, bytes);
    }
Hexie
  • 3,955
  • 6
  • 32
  • 55
  • Why would you do that rather than just save it as a blob in SQL server? – Matt Burland May 06 '16 at 03:29
  • @MattBurland You could do that, however, from personal experience i prefer base64: See this write-up as well: http://stackoverflow.com/questions/29284266/mysql-base64-vs-blob – Hexie May 06 '16 at 04:14