0

I am trying to parse the following data from an HTML document using HTMLAgillityPack:

<a href="http://abilene.craigslist.org/">abilene</a> <br>
<a href="http://albany.craigslist.org/"><b>albany</b></a> <br>
<a href="http://amarillo.craigslist.org/">amarillo</a> <br>
...

I would like parse out the URL and the name of the city into 2 separate files.

Example:

urls.txt
"http://abilene.craigslist.org/"
"http://albany.craigslist.org/"
"http://amarillo.craigslist.org/"

cities.txt
abilene
albany
amarillo

Here is what I have so far:

        public void ParseHtml()
    {
        //Clear text box 
        textBox1.Clear();

        //managed wrapper around the HTML Document Object Model (DOM). 
        HtmlAgilityPack.HtmlDocument hDoc = new HtmlAgilityPack.HtmlDocument();

        //Load file
        hDoc.Load(@"c:\AllCities.html"); 

        try
        {
            //Execute the input XPath query from text box
            foreach (HtmlNode hNode in hDoc.DocumentNode.SelectNodes(xpathText.Text))
                {
                    textBox1.Text += hNode.InnerHtml + "\r\n";
                }

        }
        catch (NullReferenceException nre)
        {
            textBox1.Text += "Can't process XPath query, modify it and try again.";
        }
    }

Any help would be greatly appreciated! Thanks guys!

John
  • 195
  • 2
  • 12

1 Answers1

1

I get it that you want to parse them from craigslist.org?
Here's how I'd do it.

List<string> links = new List<string>();
List<string> names = new List<string>();
HtmlDocument doc = new HtmlDocument();
//Load the Html
doc.Load(new WebClient().OpenRead("http://geo.craigslist.org/iso/us"));
//Get all Links in the div with the ID = 'list' that have an href-Attribute
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[@id='list']/a[@href]");
//or if you have only the links already saved somewhere
//HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");
if (linkNodes != null)
{
  foreach (HtmlNode link in linkNodes)
  {
    links.Add(link.GetAttributeValue("href", ""));
    names.Add(link.InnerText);//Get the InnerText so you don't get any Html-Tags
  }
}
//Write both lists to a File
File.WriteAllText("urls.txt", string.Join(Environment.NewLine, links.ToArray()));
File.WriteAllText("cities.txt", string.Join(Environment.NewLine, names.ToArray()));
shriek
  • 5,157
  • 2
  • 36
  • 42