17

I was wondering what would be the most ethical way to consume some bytes (386 precisely) of content from a given Site A, with an application (e.g. Google App Engine) in some Site B, but doing it right, no scraping intended, I really just need to check the status of a public service and they're currently not providing any API. So the markup in Site A has a JavaScript array with the info I need and being able to access that let's say once every five minutes would suffice.

Any advice will be much appreciated.

UPDATE:

First all thanks much for the feedback. Site A is basically the website of the company that currently runs our public subway network, so I'm planning to develop a tiny free Android app for anyone to have not only a map with the whole network and its stations but also updated information about the availability of the service (and those are the bytes I will eventually be consuming), etcétera.

Kevin Brown-Silva
  • 40,873
  • 40
  • 203
  • 237
Nano Taboada
  • 4,148
  • 11
  • 61
  • 89

5 Answers5

9

There will be some very differents points of view, but hopefully here is some food for thought:

  1. Ask the site owner first, if they know ahead of time they are less likely to be annoyed.
  2. Is the content on Site A accessible on a public part of the site, e.g. without the need to log in?
  3. If the answer to #2 is that it is public content, then I wouldn't see an issue, as scraping the site for that information is really no different then pointing your browser at the site and reading it for yourself.
  4. Of course, the answer to #3 is dependent on how the site is monetised. If Site A provides advertistment for generating revenue for the site, then it might not be an idea to start scraping content, as you would be bypassing how the site makes money.

I think the most important thing to do, is talk to the site owner first, and determine straight from them if:

  1. Is it ok for me to be scraping content from their site.
  2. Do they have an API in the pipeline (simply highlighting the desire may prompt them to consider it).

Just my point of view...

nathan gonzalez
  • 11,817
  • 4
  • 41
  • 57
Matthew Abbott
  • 60,571
  • 9
  • 104
  • 129
  • 4
    All good points. I'd add: Offer to attribute the source, with a link. (And if you do this without asking permission, do that as a matter of course. And expect, if you do this without permission, you may get blocked eventually. Every five mniutes isn't a DoS, but it's still suspicious activity that might well get blocked by admins.) – T.J. Crowder Jun 18 '11 at 07:03
  • 1
    @TJ - Add that as an aswer so we can upvote you – Matthew Abbott Jun 18 '11 at 07:05
  • An extra point to add to your list - poll as infrequently as practical. – Nick Johnson Jun 20 '11 at 01:49
  • 1
    Dont forget to cache some of the static information so you don't need to read their website too often. – Rudy Jun 20 '11 at 06:32
2

Update (4 years later): The question specifically embraces the ethical side of the problem. That's why this old answer is written in this way.

Typically in such situation you contact them.

If they don't like it, then ethically you can't do it (legally is another story, depending on providing license on the site or not. what login/anonymousity or other restrictions they have for access, do you have to use test/fake data, etc...).

If they allow it, they may provide an API (might involve costs - will be up to you to determine how much the fature is worth to your app), or promise some sort of expected behavior for you, which might itself be scrapping, or whatever other option they decide.

If they allow it but not ready to help make it easier, then scraping (with its other downsides still applicable) will be right, at least "ethically".

Meligy
  • 35,654
  • 11
  • 85
  • 109
1
  1. Use a user-agent header which identifies your service.
  2. Check their robots.txt (and re-check it at regular intervals, e.g. daily).
  3. Respect any Disallow in a record that matches your user agent (be liberal in interpreting the name). If there is no record for your user-agent, use the record for User-agent: *.
  4. Respect the (non-standard) Crawl-delay, which tells you how many seconds you should wait before requesting a resource from that host again.
Community
  • 1
  • 1
unor
  • 92,415
  • 26
  • 211
  • 360
1

I would not touch it save for emailing the site admin, then getting their written permission. That being said -- if you're consuming the content yet not extracting value beyond the value a single user gets when observing the data you need from them, it's arguable that any TOU they have wouldn't find you in violation. If however you get noteworthy value beyond what a single user would get from the data you need from their site -- ie., let's say you use the data then your results end up providing value to 100x of your own site's users -- I'd say you need express permission to do that, to sleep well at night.

All that's off however if the info is already in the public domain (and you can prove it), or the data you need from them is under some type of 'open license' such as from GNU.

Then again, the web is nothing without links to others' content. We all capture then re-post stuff on various forums, say -- we read an article on cnn then comment on it in an online forum, maybe quote the article, and provide a link back to it. Just depends I guess on how flexible and open-minded the site's admin and owner are. But really, to avoid being sued (if push comes to shove) I'd get permission.

wantTheBest
  • 1,682
  • 4
  • 43
  • 69
0

"no scraping intended" - You are intending to scrape. =)

The only reasonable ethics-based reasons one should not take it from their website is:

  1. They may wish to display advertisements or important security notices to users
  2. This may make their statistics inaccurate

In terms of hammering their site, it is probably not an issue. But if it is:

  • You probably wish to scrape the minimal amount necessary (e.g. make the minimal number of HTTP requests), and not hammer the server too often.
  • You probably do not wish to have all your apps query the website; you could have your own website query them via a cronjob. This will allow you better control in case they change their formatting, or let you throw "service currently unavailable" errors to your users, just by changing your website; it introduces another point of failure, but it's probably worth it. This way if there's a bug, people don't need to update their apps.

But the best thing you can do is to talk to the website, asking them what is best. They may have a hidden API they would allow you to use, and perhaps have allowed others to use as well.

ninjagecko
  • 88,546
  • 24
  • 137
  • 145