0

I am currently trying to analyze the traffic of a website.

Besides specifics regarding the requested resource and timestamps, the tracking system only provides the request's HTTP referrer.

In most instances the referrer is null. Given that the website in question has an ssl certificate, can I assume that this traffic is mostly due to web crawlers?

If the referral data is not enough, what additional (accessible) data can I gather to identify web crawlers?

Thanks

  • Some browsers send empty `Referer` header when one uses "Open in a new tab/window" feature. Also, some browser privacy plugins also remove `Referer` header on requests. So, often it is a crawler, but not nearly 100% of the time. – Tero Kilkanen Nov 15 '20 at 16:11
  • @TeroKilkanen Thanks for your comment. Interesting, I did not know that. Are there any tricks you recommend to get a better picture? –  Nov 16 '20 at 00:08

1 Answers1

0

Try to add robots.txt to your public html directory and set the above, this will mainly instruct the crawlers to not index your pages ( But this is conventional and the robots still can ignore it ) and check whether the traffic went down :

    User-agent: * 
    Disallow: /

So better to use HTTP X Robots header tags in your Webserver with the above values and test back the traffic :

noindex, noarchive, nosnippet, nofollow

Also if the referer is null meaning a direct requests were made.

You can use access logs to track the upcoming requests and analyze them or better use a tool like Collectd-web.

Edit you NGINX configuration /etc/nginx/nginx.conf to configure access logs :

access_log <path_to_your_log_dir>/access.log compression buffer=32k;

Reaload NGINX configuration :

systemctl reload nginx 
OR 
service nginx reload
Reda Salih
  • 241
  • 2
  • 7
  • Hey there. Thanks for your response. Actually we welcome spiders, therefore I did not add any rules to our robots.txt. The issue is getting an idea of real website usage. The tracking data is related to specific resources with ID's in their URLs, so it is unlikely someone directly accesses them, unless of course they access the resource from an Email or App. This is why I concluded that empty referrals indicate a spider visit. Are access logs like this standard with Webservers like nginx or do I have to enable them? –  Nov 16 '20 at 00:02
  • I have edited the answer thanks ! – Reda Salih Nov 16 '20 at 00:12
  • Thanks mate. You helped me out here :) Now I dont't have to mess with the App's tracking. –  Nov 16 '20 at 00:20