2

I have no problem accessing the website with a browser, but when I programmatically try to access the website for scraping, I get the following error.

The remote server returned an error: (500) Internal Server Error.

Here is the code I'm using.

using System.Net;

string strURL1 = "http://www.covers.com/index.aspx";
WebRequest req = WebRequest.Create(strURL1);

// Get the stream from the returned web response
StreamReader stream = new StreamReader(req.GetResponse().GetResponseStream());
System.Text.StringBuilder sb = new System.Text.StringBuilder();
string strLine;
// Read the stream a line at a time and place each one
while ((strLine = stream.ReadLine()) != null)
{
  if (strLine.Length > 0)
    sb.Append(strLine + Environment.NewLine);
}

stream.Close();

This one has me stumped. TIA

ianaya89
  • 4,153
  • 3
  • 26
  • 34
Trey Balut
  • 1,355
  • 3
  • 19
  • 39

2 Answers2

6

Its the user agent.

Many sites like the one you're attempting to scrape will validate the user agent string in an attempt to stop you from scraping them. Like it has with you, this quickly stops junior programmers from attempting the scrape. Its not really a very solid way of stopping a scrape - but it stumps some people.

Setting the User-Agent string will work. Change the code to:

HttpWebRequest req = (HttpWebRequest)WebRequest.Create(strURL1);
req.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"; // Chrome user agent string

..and it will be fine.

Simon Whitehead
  • 63,300
  • 9
  • 114
  • 138
2

It looks like it's doing some sort of user-agent checking. I was able to replicate your problem in PowerShell, but I noticed that the PowerShell cmdlet Invoke-WebRequest was working fine.

So I hooked up Fiddler, reran it, and stole the user-agent string out of Fiddler.

Try to set the UserAgent property to: User-Agent: Mozilla/5.0 (Windows NT; Windows NT 6.2; en-US) WindowsPowerShell/4.0

Daniel Mann
  • 57,011
  • 13
  • 100
  • 120