-6

I am trying to scrape specific html tags including their data from a google products page. I want to get all the <li> tags within this ordered list and put them in a list.

Here is the code:

   <td valign="top">
        <div id="center_col">
          <div id="res">
            <div id="ires">
              <ol>
                   <li class="g">
                  <div class="pslires">
                    <div class="psliimg">
                      <a href=
                      "https://www.google.com">
                     </a>
                    </div>

                    <div class="psliprice">
                      <div>
                        <b>$59.99</b> used
                      </div><cite>google auctions</cite>
                    </div>

                    <div class="pslimain">
                      <h3 class="r"><a href=
                      "https://www.google.com">
                      google</a></h3>

                      <div>
                 dummy data     </div>
                    </div>
                  </div>
                </li>

                 <li class="g">
                  <div class="pslires">
                    <div class="psliimg">
                      <a href=
                      "https://www.google.com">
                     </a>
                    </div>

                    <div class="psliprice">
                      <div>
                        <b>$59.99</b> used
                      </div><cite>google auctions</cite>
                    </div>

                    <div class="pslimain">
                      <h3 class="r"><a href=
                      "https://www.google.com">
                      google</a></h3>

                      <div>
                 dummy data     </div>
                    </div>
                  </div>
                </li>

              <li class="g">
                  <div class="pslires">
                    <div class="psliimg">
                      <a href=
                      "https://www.google.com">
                     </a>
                    </div>

                    <div class="psliprice">
                      <div>
                        <b>$59.99</b> used
                      </div><cite>google auctions</cite>
                    </div>

                    <div class="pslimain">
                      <h3 class="r"><a href=
                      "https://www.google.com">
                      google</a></h3>

                      <div>
                 dummy data     </div>
                    </div>
                  </div>
                </li>
                <li class="g">
                  <div class="pslires">
                    <div class="psliimg">
                      <a href=
                      "https://www.google.com">
                     </a>
                    </div>

                    <div class="psliprice">
                      <div>
                        <b>$59.99</b> used
                      </div><cite>google auctions</cite>
                    </div>

                    <div class="pslimain">
                      <h3 class="r"><a href=
                      "https://www.google.com">
                      google</a></h3>

                      <div>
                 dummy data     </div>
                    </div>
                  </div>
                </li>
              </ol>
            </div>
          </div>
        </div>

        <div id="foot">
          <p class="flc" id="bfl" style="margin:19px 0 0;text-align:center"><a href=
          "/support/websearch/bin/answer.py?answer=134479&amp;hl=en">Search Help</a>
          <a href=
          "/quality_form?q=Pioneer+Automotive+PF-555-2000&amp;hl=en&amp;tbm=shop">Give us
          feedback</a></p>

          <div class="flc" id="fll" style="margin:19px auto 19px auto;text-align:center">
            <a href="/">Google&nbsp;Home</a> <a href=
            "/intl/en/ads">Advertising&nbsp;Programs</a> <a href="/services">Business
            Solutions</a> <a href="/intl/en/policies/">Privacy &amp; Terms</a> <a href=
            "/intl/en/about.html">About Google</a>
          </div>
        </div>
      </td>

I want to get all the <li class="g"> tags and the data in each of them. Is that possible?

halfer
  • 19,824
  • 17
  • 99
  • 186
Laziale
  • 7,965
  • 46
  • 146
  • 262
  • Umm. A regex for all of that??? – Cole Tobin May 21 '12 at 15:11
  • 5
    [You can't parse HTML with regex](http://stackoverflow.com/a/1732454/26226) – jrummell May 21 '12 at 15:12
  • Not possible, HTML can't be parsed, it needs to be interpreted. Try googling 'c# data from html' (never done anything like it before, sry) – Alex May 21 '12 at 15:12
  • Check out http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c Basically: http://htmlagilitypack.codeplex.com Check the example – skarmats May 21 '12 at 15:12
  • When you say you want all the "Tags" do you mean HTML tags? How deep do you want to go? Is there any specific format it should follow? I would also suggest removing the divs around the edge of the
      - it made it a bit hard to understand what you were actually on about...
    – Stuart.Sklinar May 21 '12 at 15:12
  • Classic answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Chris Dworetzky May 21 '12 at 15:14

3 Answers3

2

instead of using a regex using something like an xml parser may be more useful to your situation. Load it up into an xml document and then use something like SelectNodes to get out your data you are looking for

http://msdn.microsoft.com/en-us/library/4bektfx9.aspx

tam
  • 1,583
  • 2
  • 13
  • 25
  • 1
    See my comment on the OP. There is library that is more specific to HTML and more tolerant of errors in the source - HTMLAgilityPack – skarmats May 21 '12 at 15:33
  • I will keep that in mind for future endeavors thanks! – tam May 21 '12 at 15:59
1

I wouldn't use regex for this particular problem.

Instead I would attack it thus:

1)Save off page as html string. 2)Use aforementioned htmlagilitypack or htmltidy(my preference) to convert to XML. 3)Use xDocument to navigate through Dom object by tag and save data.

Trying to create a regex to extract data from a possibly fluid HTML page will break your heart.

Totero
  • 2,524
  • 20
  • 34
0

Instead of using regex you can use HtmlAgilityPack to parse the HTML.

var doc = new HtmlDocument();
doc.LoadHtml(html);
var listItems = doc.DocumentNode.SelectNodes("//li");

The code above will give you all <li> items in the document. To add them to a list you'll just have to iterate the collection and add each item to the list.

RePierre
  • 9,358
  • 2
  • 20
  • 37