0

Im trying to add a title attribute in every tag that has a alt attribute to 3-400 files. 90% of these files are asp files and rest are aspx/html/++.

I decided to fix it with HTMLAgilityPack and write a small program in C# to do this. I just write the file names into a .txt file and then run through that text file to load each file. The program works fine except that HAP keeps adding closing brackets and modifies certain other tags. I thought I could live with it and just write the errors into another txt file, but noticed that not all these changes were actually written into the string I set to keep the error messages(There are files I see has been changed, but when I check my error log file, there is no message about these changes)

Mostly what is being added are /tr,/td and /table.

This project is fairly large(these files are just a small part of the full project) and I really dont want to add any other changes than I need to.

First here is the parts of the program that is concerning my problem:

       static void Main(string[] args)
    {
        string[] files = System.IO.File.ReadAllLines(@"filelist.txt");
        string errors = "";
        HtmlDocument doc = new HtmlDocument();
        bool dirExists;

        doc.OptionCheckSyntax = false;
        doc.OptionReadEncoding = false;
        doc.OptionOutputOriginalCase = true;
        doc.OptionWriteEmptyNodes = true;
        HtmlNode.ElementsFlags.Remove("option");

        foreach (string file in files)
        {
            doc.Load(file);

            if (doc.DocumentNode.SelectNodes("//@alt") != null)
            {
                foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//@alt"))
                {
                    if (!node.GetAttributeValue("title", false))
                    {
                        foreach (HtmlAttribute attr in node.Attributes.ToList())
                        {
                            if (attr.Name == "alt")
                            {
                                node.SetAttributeValue("title", attr.Value);
                            }
                        }
                    }
                }
                string newfile = file.Replace("C:\\source\\", "C:\\SLtmp\\");
                string[] tmp = newfile.Split('\\');
                string folder = "";

                for (int i = 0; i < tmp.Length - 1; i++)
                {
                    folder += tmp[i] + '\\';
                }
                dirExists = System.IO.Directory.Exists(folder);

                if (!dirExists)
                {
                    System.IO.Directory.CreateDirectory(folder);
                }
                doc.Save(newfile);
                foreach (HtmlParseError error in doc.ParseErrors)
                {
                    errors += newfile + " (" + error.Line + "," + error.LinePosition + "): " + error.Reason + "\n";

                }
            }
        }
        System.IO.File.WriteAllText("C:\\tmp\\errors.txt", errors);
    }

Basically what happens in the end is that it adds as many ending tags it detects are not closed in the file it currently reads, however the tags may be ended in a different file.

So my question then is: Is it possible to Have HAP only do the changes that I specifically do and ignore any fixes it feels the urge to automatically do?

Cheran Shunmugavel
  • 8,319
  • 1
  • 33
  • 40
OMK
  • 45
  • 1
  • 10
  • Short answer: nope. The Html Agility Pack parses the text and creates an in-memory DOM. It's not really "fixing" things, it just doesn't use the errors to write the text like it was, with all its original errors. It outputs the created DOM. – Simon Mourier Jun 25 '13 at 12:07
  • I was fearing this :) I assume there is no way around this? – OMK Jun 25 '13 at 12:23
  • It's open source, so you can get it and change it, but it's not an easy task. Depending on what you need, you could do: 1) combine the opening+closing files, 2) remember where they split, 3) open the combination with HAP, 4) modify it the way you want and 5) split the result again. – Simon Mourier Jun 25 '13 at 13:27
  • can you post a bit of what the html looks before and after you changed it? – shriek Jun 25 '13 at 20:11
  • I did kinda work around it, by creating a small class in my program which contained: filename, oldvalue and newvalue, where Id use HAP to copy the node into oldvalue before I add the title value, and copy the node into newvalue after I add the title. Problem now is that HAP doesn't only add the title, but also modifies the node itself in how it is written. so for instance, you can have a project where a img tag is written and the node will end up like . So basically I need to either get the original text inside HAP, or get the length of it. – OMK Jun 26 '13 at 06:35
  • Is there no way in HAP to extract the unaltered original tag text that has been used to create a node? The nodes keep getting small changes to them, which in general is a good thing, just not in my situation :P – OMK Jun 26 '13 at 10:03

0 Answers0