Can I prevent people programmatically fetch contents from my site?

Question

Let's say I have a WordPress site with many Blog Post. I discovered there is someone out there copying the content of every page of my site and paste it on his own site. I believed he does not do it manually as the amount is huge. And I try to replicate what he did, and I find that it's actually quite easy for me to do the same using php by curl and some Dom Parsing (giving I know the class name of where the useful text resides).

Can I have any ways to prevent it, or at least make it harder for them to do it in the future? Thanks!

Despite the downvotes, I don't think it's that impossible to at least increase the difficulty for it. I can curl my site, but for some fiction websites in China, I can only get garbage through curl. They are not using JavaScript to unscramble the scrambled test as suggested in the answer, as even if I disable JS in my browser, I can view the normal version of the site with no problem. I will try to see how they achieve this and may post an answer if I find it. — user2335065, Sep 09 '15 at 14:09

score 1 · Answer 1 · answered Sep 09 '15 at 05:26

1

Remember that whatever information you expose to viewers of your site is always retrieved programmatically. All web browsers connect to the web server and request information using HTTP.

You could try to block the user-agent of whatever software he is using (if it provides a user-agent at all), but this would likely be in vain. Your blog posts are exposed to the public, because you intend for them to read them. Once this information is client-side, you have no further control over it.

answered Sep 09 '15 at 05:26

Luke Joshua Park

9,527
5
27
44

Blocking the IP address might be more effective. Changing the user agent is likely trivial, assuming it is not using a generic one already. – Alexander O'Mara Sep 09 '15 at 05:26
As is blocking the IP address, really. Both are short-term measures for a long-term problem. – Luke Joshua Park Sep 09 '15 at 05:27

score 1 · Answer 2 · answered Sep 09 '15 at 05:29

Since browsers are just machines downloading your content to show it to the user, there is really nothing you can do to completely prevent it.

There are things that you can do to make it more difficult, but they also increase the risk that your normal readers experience some problems.

Here are some ideas that I have seen in the past:

Images: Not suitable for complete articles but still popular for things like E-mail addresses: Don't put the text up, but some image of the text.
publish some scrambled version that then gets unscrambled with javascript. If anybody pulls the content with curl or similar he won't execute the javascript and gets only garbage.
Mutating images: Often those copy cats fetch images and other media from the original source. You can use the referrer on your server to serve different images, e.g. an image with a message "This content was stolen from ..."
Hire a lawyer and sue them. Might be difficult especially when international law is involved, but I have seen it done successfully.

score 0 · Answer 3 · answered Sep 09 '15 at 05:28

If the person scrapping your site is not doing much configuration via cURL then you could use some user agent string parsing to detect a cURL user and throw a 404 or do whatever other kind of handling you want to do. (More information: http://www.useragentstring.com/pages/curl/)

Keep in mind however, cURL lets you craft requests and allows you to spoof your user agent and most other details of a web request so that it becomes indistinguishable from regular web traffic.

Other than that, you could block the specific persons IP address, but that is a very specific fix and does not address the wider concern of anyone scrapping content.

Can I prevent people programmatically fetch contents from my site?

3 Answers3