3

We have some contents in Ms Word .docx formats, prepared by our customers. These documents may have equations, images, etc.

We want to transfer these contents to our web environment.

Firstly, I plan to use TinyMCE "paste from word" plugin and fmath editor plugin. No use...

Then I decide to put upload button to transfer ms word contents and showing resulting web contents into TinyMCE editor. Actually something like writing a new plugin.

I am using Microsoft.Office.Interop.Word.Document class's "SaveAs" method. But I have following problems:

1) I can not change document resources folder path. It generate "..._files" folder same with generated html file. I want to transfer all resources to appropriate places on the server.

2) I can not change the image source paths as absolute paths.

3) Too many garbage styles, codes on generated html file.

I may totally in wrong way to achieve this purpose. So I decided to get your advices, before continue in this directions. I am open any suggestion.

Regards,

I am adding draft version of this code:



    var fileName = Request["docfilename"];
    var file = Request.Files[0];
    var buffer = new byte[file.ContentLength];
    file.InputStream.Read(buffer, 0, file.ContentLength);
    var root = HttpContext.Current.Server.MapPath(@"~/saveddata/_temp/");
    var path = Path.Combine(root, fileName);

    using (var fs = new FileStream(path, FileMode.Create))
    {
        using (var br = new BinaryWriter(fs))
        {
            br.Write(buffer);
        }
    }


    Microsoft.Office.Interop.Word.ApplicationClass oWord = new ApplicationClass();
    object missing = System.Reflection.Missing.Value;
    object isVisible = false;
    word.Document oDoc;
    object filename = path;
    object saveFile;
    oDoc = oWord.Documents.Open(ref filename, ref missing, ref missing, ref missing,
     ref missing, ref missing, ref missing, ref missing,
     ref missing,ref missing, ref missing, ref missing, ref missing, ref missing,
                        ref missing, ref missing);
    oDoc.Activate();

    object path2 = Path.Combine(root, "test.html");
    object fileFormat = word.WdSaveFormat.wdFormatFilteredHTML;
    oDoc.SaveAs(ref path2, ref fileFormat, missing, missing, missing, missing, missing, missing,
                missing, missing, missing, missing, missing, missing, missing, missing);

    oDoc.Close(ref missing, ref missing, ref missing);
    oWord.Application.Quit(ref missing, ref missing, ref missing);

EmRe
  • 51
  • 3

1 Answers1

1

This is a delicate matter. I was facing the same problem as doc has a lot of style tags. If you notice, try to share a url (which has word doc content) on facebook, then in the description/summary of url, the unwanted tags used to come :) So I guess the issue is persistent there too. I would suggest, go through the basics of Information Retrieval and try to intelligently strip the style tags. You will be required to write most of your stripping code with regular expressions

gaurav.singharoy
  • 3,751
  • 4
  • 22
  • 25