How to detect inbound HTTP requests sent anonymously via Tor?

Question

I'm developing a website and am sensitive to people screen scraping my data. I'm not worried about scraping one or two pages -- I'm more concerned about someone scraping thousands of pages as the aggregate of that data is much more valuable than a small percentage would be.

I can imagine strategies to block users based on heavy traffic from a single IP address, but the Tor network sets up many circuits that essentially mean a single user's traffic appears to come from different IP addresses over time.

I know that it is possible to detect Tor traffic as when I installed Vidalia with its Firefox extension, google.com presented me with a captcha.

So, how can I detect such requests?

(My website's in ASP.NET MVC 2, but I think any approach used here would be language independent)

score 15 · Accepted Answer · answered Sep 10 '10 at 19:46

15

I'm developing a website and am sensitive to people screen scraping my data

Forget about it. If it's on the web and someone wants it, it will be impossible to stop them from getting it. The more restrictions you put in place, the more you'll risk ruining user experience for legitimate users, who will hopefully be the majority of your audience. It also makes code harder to maintain.

I'll post countermeasures to any ideas future answers propose.

answered Sep 10 '10 at 19:46

Aillyn

23,354
24
59
84

2

I'm in agreement with Aillyn; it will be near-impossible to stop somebody from screen-scraping your site. Pursuing options to prevent it will merely consume time better spent improving other aspects of your site. Focus on things that make your site unique and better than the screen-scrapers. Look at Stack Overflow for instance: it is being scraped by tons of bottom-feeders, but that doesn't prevent it from being useful or awesome. – Cal Jacobson Sep 10 '10 at 20:06
@Cal They don't even have to scrape it, the content is made available thru the [data dumps](http://blog.stackoverflow.com/category/cc-wiki-dump/). – Aillyn Sep 10 '10 at 20:07
@Cal, SO data is available as a download under Creative Commons http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/ – Drew Noakes Sep 10 '10 at 20:08
1

@Aillyn, I'm with you that it's impossible to _stop_ people taking the data. Mostly I'm just interested in making it very hard for people to do so. I can envisage very simple blocking behaviour that wouldn't impede any human visitor, but that falls apart if someone's using the Tor network, or other distributed proxy. Thanks for your answer, if not your encouragement :) – Drew Noakes Sep 10 '10 at 20:12
@Aillyn, Drew -- Ah, I didn't know that. I think that only reinforces my point, however: despite the content being readily available, SO offers something that the imitators can't seem to provide. Maybe it's simply reputation. Maybe its timeliness. Maybe it's design. Maybe its the fact that a typical user can spot the crapfest knock-offs out there, see them for what they are, and choose to support the real deal. Whatever it is, Drew, don't let screen-scraping alone dissuade you from going forward with your site. – Cal Jacobson Sep 10 '10 at 20:16
1

@Drew I used this service that wouldn't let people copy their content. On top of tons of legal mumbo-jumbo (ie: you will be persecuted to the fullest extent of the law if you copy our content), the program ran in Java and cleared the clipboard, and also checked for processes running that were image capture programs. Very annoying indeed. I just installed a packet sniffer and saved *all* their SOAP responses. Not only I had the data, but I had it in a very usable, programming friendly format. Then I released the content anonymously. So yeah, don't do it. – Aillyn Sep 10 '10 at 20:16

score 5 · Answer 2 · answered Sep 11 '10 at 00:39

5

You can check their ip address against a list of Tor Exit Nodes. I know for a fact this won't even slow someone down who is interested in scraping your site. Tor is too slow, most scrapers won't even consider it. There are tens of thousands of open proxy servers that can be easily scanned for or a list can be purchased. Proxy servers are nice because you can thread them or rotate if your request cap gets hit.

Google has been abused by tor users and most of the exit nodes are on Google black list and thats why you are getting a captcha.

Let me be perfectly clear: THERE IS NOTHING YOU CAN DO TO PREVENT SOMEONE FROM SCRAPING YOUR SITE.

answered Sep 11 '10 at 00:39

rook

66,304
38
162
239

Tor is slow in terms of latency, but you can just as easily fan the load out across concurrent requests to get the same net throughput. – Drew Noakes Sep 12 '10 at 03:32
2

@Drew Noakes I disagree proxy servers are defiantly the way to go, much faster and more control over what your ip address is. Also on a side note, ip addresses are cheap, like pennies a pop, you can just buy a massive block and then rip down some site. You need to come up with a business model that works with the internet. It boggles my mind when people try and limit access in the information age. I have a feeling your next SO question is how to implement DRM that works. – rook Sep 12 '10 at 06:04
I understand your point and tend to agree. I'm not talking about trying to stop everyone, just those who aren't massively motivated or competent. Much like modern DRM deters the majority of people from ever learning how to strip it from music they buy, for example. – Drew Noakes Sep 12 '10 at 15:13
2

@Drew Noakes I think you missed my point. DRM doesn't do anything at all, just like this bogus security system. It cannot stop anything (thepiratebay.com), both the idea of trying to stop scrapping and the idea of DRM are conceived by people who do not understand. – rook Sep 12 '10 at 18:34
1

For this purpose, you can use a service such as [Ipregistry](https://ipregistry.co) (disclaimer: I run the service) to detect if an IP address is a Tor exit node, uses a proxy, etc. However, as said earlier by many people, nothing may prevent a public page to be scraped. – Laurent Oct 06 '19 at 19:56

score 2 · Answer 3 · answered Sep 10 '10 at 19:46

2

By design of the tor network components it is not possible for the receiver to find out if the requester is the original source or if it's just a relayed request.

The behaviour you saw with Google was probably caused by a different security measure. Google detects if a logged-in user changes it's ip and presents a captcha just in case to prevent harmful interception and also allow the continuation of the session if an authenticated user really changed its IP (by re-logon to ISP, etc.).

answered Sep 10 '10 at 19:46

Kosi2801

22,222
13
38
45

that's interesting, but I don't use Firefox regularly, so any cookie I had would have been weeks old. Also, what about ISPs that change people's IP addresses via DHCP? I'm not saying you're wrong, I just wondered whether they tracked Tor node IP addresses. Vidalia shows a list of all relays and their IP addresses in the UI. Perhaps Google monitors that list... – Drew Noakes Sep 10 '10 at 19:59
Google places cookies with an expiration date of 2 years ( http://googleblog.blogspot.com/2007/07/cookies-expiring-sooner-to-improve.html ), so a few weeks old cookie is not an issue. I do not know how many different mechanisms Google uses to identify sessions but there are plentiful of them. Just as a note, I regulary experience captchas using Google-services (one or two times a week) to continue my session and I'm not using any anonymizing technologies. These have been getting rarer though, I guess Google learns the IP-ranges I'm working from (maybe similar to Lattitude location learning). – Kosi2801 Sep 25 '10 at 10:41

score 0 · Answer 4 · answered Feb 10 '14 at 13:33

I know this is old, but I got here from a Google search so I figured I'd get to the root concerns in the question here. I develop web applications, but I also do a ton of abusing and exploiting other peoples. I'm probably the guy you're trying to keep out.

Detecting tor traffic really isn't the route you want to go here. You can detect a good amount of open proxy servers by parsing request headers, but you've got tor, high anonymity proxies, socks proxies, cheap VPNs marketed directly to spammers, botnets and countless other ways to break rate limits. You also

If your main concern is a DDoS effect, don't worry about it. Real DDoS attacks take either muscle or some vulnerability that puts strain on your server. No matter what type of site you have, you're going to be flooded with hits from spiders as well as bad people scanning for exploits. Just a fact of life. In fact, this kind of logic on the server almost never scales well and can be the single point of failure that leaves you open to a real DDoS attack.

This can also be a single point of failure for your end users (including friendly bots). If a legitimate user or customer gets blocked you've got a customer service nightmare and if the wrong crawler gets blocked you're saying goodbye to your search traffic.

If you really don't want anybody grabbing your data, there are some things you can do. If it's a blog content or something, I generally say either don't worry about it or have summary only RSS feeds if you need feeds at all. The danger with scraped blog content is that it's actually pretty easy to take an exact copy of an article, spam links to it and rank it while knocking the original out of the search results. At the same time, because it's so easy people aren't going to put effort into targeting specific sites when they can scrape RSS feeds in bulk.

If your site is more of a service with dynamic content that's a whole other story. I actually scrape a lot of sites like this to "steal" huge amounts of structured proprietary data, but there are options to make it harder. You can limit the request per IP, but that's easy to get around with proxies. For some real protection relatively simple obfuscation goes a long way. If you try to do something like scrape Google results or download videos from YouTube you'll find out there's a lot to reverse engineer. I do both of these, but 99% of people who try fail because they lack the knowledge to do it. They can scrape proxies to get around IP limits but they're not breaking any encryption.

As an example, as far as I remember a Google result page comes with obfuscated javscript that gets injected into the DOM on page load, then some kind of tokens are set so you have to parse them out. Then there's an ajax request with those tokens that returns obfuscated JS or JSON that's decoded to build the results and so on and so on. This isn't hard to do on your end as the developer, but the vast majority of potential thieves can't handle it. Most of the ones that can won't put in the effort. I do this to wrap really valuable services Google but for most other services I just move on to some lower hanging fruit at different providers.

Hope this is useful for anyone coming across it.

score 0 · Answer 5 · answered Feb 26 '14 at 07:38

I think the focus on how it is 'impossible' to prevent a determined and technically savvy user from scraping a website is given too much significance. @Drew Noakes states that the website contains information that when taken in aggregate has some 'value'. If a website has aggregate data that is readily accessible by unconstrained anonymous users, then yes, preventing scraping may be near 'impossible'.

I would suggest the problem to be solved is not how to prevent users from scraping the aggregate data, but rather what approaches could be used to remove the aggregate data from public access; thereby eliminating the target of the scrapers without the need to do the 'impossible', prevent scrapping.

The aggregate data should be treated like proprietary company information. Proprietary company information in general is not available publicly to anonymous users in an aggregate or raw form. I would argue that the solution to prevent the taking of valuable data would be to restrict and constrain access to the data, not to prevent scrapping of it when it is presented to the user.

1] User accounts/access – no one should ever have access to all the data in a within a given time period (data/domain specific). Users should be able to access data that is relevant to them, but clearly from the question, no user would have a legitimate purpose to query all the aggregate data. Without knowing the specifics of the site, I suspect that a legitimate user may need only some small subset of the data within some time period. Request that significantly exceed typical user needs should be blocked or alternatively throttled, so as to make scraping prohibitively time consuming and the scrapped data potentially stale.

2] Operations teams often monitor metrics to ensure that large distributed and complex systems are healthy. Unfortunately, it becomes very difficult to identify the causes of sporadic and intermittent problems, and often it is even difficult to identify that there is a problem as opposed to normal operational fluctuations. Operations teams often deal with statistical analysed historical data taken from many numerous metrics, and comparing them to current values to help identify significant deviations in system health, be they system up time, load, CPU utilization, etc.

Similarly, requests from users for data in amounts that are significantly greater than the norm could help identify individuals that are likely to be scrapping data; such an approach can even be automated and even extended further to look across multiple accounts for patterns that indicate scrapping. User 1 scrapes 10%, user 2 scrapes the next 10%, user 3 scrapes the next 10%, etc... Patterns like that (and others) could provide strong indicators of malicious use of the system by a single individual or group utilizing multiple accounts

3] Do not make the raw aggregate data directly accessible to end-users. Specifics matter here, but simply put, the data should reside on back end servers, and retrieved utilizing some domain specific API. Again, I assuming that you are not just serving up raw data, but rather responding to user requests for some subsets of the data. For example, if the data you have is detailed population demographics for a particular region, a legitimate end user would be interested in only a subset of that data. For example, an end user may want to know addresses of households with teenagers that reside with both parents in multi-unit housing or data on a specific city or county. Such a request would require the processing of the aggregate data to produce a resultant data set that is of interest to the end-user. It would prohibitively difficult to scrape every resultant data set retrieved from numerous potential permutations of the input query and reconstruct the aggregate data in its entirety. A scraper would also be constrained by the websites security, taking into account the # of requests/time, the total data size of the resultant data set, and other potential markers. A well developed API incorporating domain specific knowledge would be critical in ensuring that the API is comprehensive enough to serve its purpose but not overly general so as to return large raw data dumps.

The incorporation of user accounts in to the site, the establishment of usage baselines for users, the identification and throttling of users (or other mitigation approaches) that deviate significantly from typical usage patterns, and the creation of an interface for requesting processed/digested result sets (vs raw aggregate data) would create significant complexities for malicious individuals intent on stealing your data. It may be impossible to prevent scrapping of website data, but the 'impossibility' is predicated on the aggregate data being readily accessible to the scraper. You can't scrape what you can't see. So unless your aggregate data is raw unprocessed text (for example library e-books) end users should not have access to the raw aggregate data. Even in the library e-book example, significant deviation from acceptable usage patterns such as requesting large number of books in their entirety should be blocked or throttled.

score 0 · Answer 6 · answered Apr 02 '18 at 08:29

0

You can detect Tor users using TorDNSEL - https://www.torproject.org/projects/tordnsel.html.en.

You can just use this command-line/library - https://github.com/assafmo/IsTorExit.

answered Apr 02 '18 at 08:29

assafmo

1,047
3
15
32

How to detect inbound HTTP requests sent anonymously via Tor?

6 Answers6