2

I've been working on scraping the following site: http://www.fightingillini.com/schedule.aspx?path=softball

I've had extensive experience using node/cheerio/scraperjs to scrape both static and dynamic content in the past, but I'm not having any luck cracking this site.

        scraperjs.DynamicScraper.create('http://www.fightingillini.com/calendar.ashx/calendar.rss?sport_id=9')
            .scrape(function() {
              return $('item').map(function() {
                return $(this).children('title').text();
              }).get();
            }, function(list) {
              console.log(list);
            });

Any help/feedback/suggestions on libraries to use would be really appreciated! Thanks!

Mark
  • 51
  • 1
  • 9

1 Answers1

0

Asp.Net web forms pages can be notoriously difficult to scrape, because of the complicated ViewState hidden form input. Some times that's even a feature ;)

In this case, I would go for the rss feed, found via one of the links on the page you are trying to scrape:

http://www.fightingillini.com/calendar.ashx/calendar.rss?sport_id=9

The link will give you the same content, but in a much-friendlier and standard XML format. The code to parse this will likely be easier to parse correctly. Most of all, the format here is guaranteed to be stable, whereas on the regular page even small tweaks to the site theme could throw your parsing code off.

The point is that rss links are, in a sense, made for scraping, so look there first.

Here's an example of one of the current entries:

<item>
<title>2/6 11:30 AM [L] Softball vs  Winthrop</title>
<description>L 1-5 http://www.fightingillini.com/calendar.aspx?id=8670</description>
<link>http://www.fightingillini.com/calendar.aspx?id=8670</link>
<guid isPermaLink="true">http://www.fightingillini.com/calendar.aspx?id=8670</guid>
<ev:gameid>8670</ev:gameid>
<ev:location>Athens, Ga.</ev:location>
<ev:startdate>2015-02-06T17:30:00.0000000Z</ev:startdate>
<ev:enddate>2015-02-06T20:30:00.0000000Z</ev:enddate>
<s:localstartdate>2015-02-06T11:30:00.0000000</s:localstartdate>
<s:localenddate>2015-02-06T14:30:00.0000000</s:localenddate>
<s:teamlogo>http://www.fightingillini.com/images/logos/site/site.png</s:teamlogo>
<s:opponentlogo>http://www.fightingillini.com/images/logos/z16.png</s:opponentlogo>
<s:links>
</s:links>
</item>

The page also has an iCal link, if that works better for you.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
  • Thanks for the reply! I actually tried looking into the rss feed, but that seemed to have the same problem as the other page, it returned an empty body. It seems like it may be a lost cause. – Mark Jul 13 '15 at 20:35
  • Try enclosing the url in the code in single quotes: `scraperjs.DynamicScraper.create('http://www.fightingillini.com/....').` – Joel Coehoorn Jul 13 '15 at 20:41
  • I've been attempting to scrape the rss feeds with the regular request module... `request('http://www.fightingillini.com/calendar.ashx/calendar.rss?sport_id=9', cb1);` and I'm still turning up an empty body... thanks for all the help – Mark Jul 13 '15 at 21:32
  • 3
    Found the solution, needed to add a 'User-Agent' header in order to scrape the page. For anyone interested, here was my final request: `request({ url: 'http://www.fightingillini.com/calendar.ashx/calendar.rss?sport_id=9', headers: { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X)' } }, cb1);` – Mark Jul 14 '15 at 01:07