-1

I have a html document that after being parsed contains only formatted text.I was wondering if it is possible to get its text like I would do if I was mouse-selecting it + copy + paste in new Text Document?

I know that this is possible in Microsoft.Office.Interop where I have .ActiveSelection property that selects the content of the open Word.

I need to find a way to load the html somehowe(maybe in a browser object) and then copy all of its content and assign it to a string.

var doc = new HtmlAgilityPack.HtmlDocument();
var documetText = File.ReadAllText(myhtmlfile.html, Encoding.GetEncoding(1251));
documetText = this.PerformSomeChangesOverDocument(documetText);
doc.LoadHtml(documetText);
var stringWriter = new StringWriter();
AgilityPackEntities.AgilityPack.ConvertTo(doc.DocumentNode, stringWriter);
stringWriter.Flush();
var titleNode = doc.DocumentNode.SelectNodes("//title");
if (titleNode != null)
{
    var titleToBeRemoved = titleNode[0].InnerText;
    document.DocumentContent = stringWriter.ToString().Replace(titleToBeRemoved, string.Empty);
}
else
{
    document.DocumentContent = stringWriter.ToString();
}

and then I return the document object.The problem is that the string is not always formatted as I want it to be

Corak
  • 2,688
  • 2
  • 21
  • 26
mathinvalidnik
  • 1,566
  • 10
  • 35
  • 57
  • 1
    Do you have any code to show? What have you tried? – htxryan Aug 27 '13 at 14:37
  • I have tried the html agility pack originally my html are a little bit messed up so I am doing some replacements over them. And after that I am trying ti assing the doc.DocumentNode to a StringWriter.I am going to update my question. – mathinvalidnik Aug 27 '13 at 14:42

1 Answers1

0

You should be able to just use StreamReader and as you read each line just write it out using StreamWriter

Something like this will readuntil the end of your file and save it to a new one. If you need to do extra logic in the file I have a comment inserted to let you know where to do all that.

private void button4_Click(object sender, EventArgs e)
        {
            System.IO.StreamWriter writer = new System.IO.StreamWriter("C:\\XXX\\XXX\\XXX\\test2.html");
            String line;
            using (System.IO.StreamReader reader = new System.IO.StreamReader("C:\\XXX\\XXX\\XXX\\test.html"))
            {
                //Do until the end
                while ((line = reader.ReadLine()) != null) {
                //You can insert extra logic here if you need to omit lines or change them
                writer.WriteLine(line);
                }
                //All done, close the reader
                reader.Close();
            }
            //Flush and close the writer
            writer.Flush();
            writer.Close();

        }

You can also save it to a string then just do whatever you want to with it. You can use new lines to keep the same format.

EDIT The below will tke into account your tags

  private void button4_Click(object sender, EventArgs e)
        {
            String line;
            String filetext = null;
            int count = 0;
            using (System.IO.StreamReader reader = new System.IO.StreamReader("C:\\XXXX\\XXXX\\XXXX\\test.html"))
            {
              while ((line = reader.ReadLine()) != null) { 
                if (count == 0) {
                    //No newline since its start
                    if (line.StartsWith("<")) {
                        //skip this it is formatted stuff
                    }
                    else {
                    filetext = filetext + line; 
                    }
                    }
                else {
                    if (line.StartsWith("<"))
                    {
                        //skip this it is formatted stuff
                    }
                    else
                    {
                        filetext = filetext + "\n" + line;
                    }
                }
                count++;                           
           }                
            Trace.WriteLine(filetext);                  
            reader.Close();
            }          
        }
sealz
  • 5,348
  • 5
  • 40
  • 70
  • I think OP wants to modify the rendered HTML and not just copy a text file. – Corak Aug 27 '13 at 14:56
  • "possible to get its text like I would do if I was mouse-selecting it + copy + paste in new Text Document?" My answer will do that. If he needs to perform extra logic he can do so in the loop. It seems to be he has a document that has already been parsed and he just wants to copy that to a new file? Not sure if I am missing something – sealz Aug 27 '13 at 14:59
  • From his earlier question http://stackoverflow.com/questions/18350208/how-to-copy-html-text-selection-and-assign-it-to-a-string-in-c-sharp I figured, that by "parsed" he means "rendered in a browser". – Corak Aug 27 '13 at 15:03
  • @Corak Did not know about original question. If the OP can comment and say this is not what he wants I will remove. – sealz Aug 27 '13 at 15:08
  • @sealz I still dont't have the document parsed.You are using .txt files for the reading, but this can not happen with the .html file. – mathinvalidnik Aug 27 '13 at 15:10
  • @marthinvalidnik You can use .html extensions and it will still read the text in. – sealz Aug 27 '13 at 15:12
  • @sealz Yeah, but I think it is going to read it including the elements tags like

    's , 's etc. but not the plain text(Inner Text).

    – mathinvalidnik Aug 27 '13 at 15:13
  • @mathinvalidnik Currenlty yes it will include the tags. You will have to add logic in. You can either check line by line for tags and remove/replace them or take the long file string and Read Until you hit your starting or ending tags. – sealz Aug 27 '13 at 15:15
  • I have the feeling that this is going to be lots of lots of logic :) – mathinvalidnik Aug 27 '13 at 15:19
  • @mathinvalidnik you can simply check for brackets before recording each line as in my edit. The solution is jumbling together now but this would work and can def. be broken down to be easier on the eyes :) – sealz Aug 27 '13 at 15:21
  • With `var titleNode = doc.DocumentNode.SelectNodes("//title");` you got the title node of the document. What would happen if you try `var bodyNode = doc.DocumentNode.SelectSingleNode("//body");` and then look at `bodyNode.InnerText`? – Corak Aug 27 '13 at 15:26