0

I happened to compare the google analytics report with the apache access logs and it shows a startling 250% drop offs.

We have a wordpress installation hosted on aws with 2 webservers behind a ELB and a NFS server, RDS and an elastic cache.

The way i carried out the analysis is as follows:

  1. On all pages put a simple JavaScript which pings my server on PageReady i.e OnDomContentLoaded Event and I record the IP address the page URL. As this is the simplest JavaScript code my assumption is that it should run on most browsers and the results are very close to the one generated by .
  2. I examine the legitimate requests on the access logs (eliminate requests without User agents+ without referrer URL's etc.) and examine only the requests which generate 200,206,301,302 response codes.

As I compare the the server pings generated by client (custom JavaScript mentioned by 1) and the apache access logs the drop offs seem to be close to 250%.

So this means that the clients in those missing IP's did not execute the JavaScript, but the puzzling part is the server is sending the 200 status code. So I arrived at a conclusion that the server is sending an empty response for most. (I have accounted for few users who turn off JavaScript, some errors etc.) but I am unable to test the assumption. (If at all this is the case).

  • mod_dumpio doesn't let me map the response body to the client IP.

  • Audit log doesn't seem to support logging of response body.

With these things into consideration could anyone please point me in the right direction?

Clarification:

As I don't have the reputation to add a comment I would like to add a few points here.

I did look only for document requests i.e excluding all CSS and JS and image files, and I did filter out google bots and other suspicious crawls. Accounting for all this there is a clear drop off up to 250%.

chicks
  • 3,793
  • 10
  • 27
  • 36
Varun
  • 1
  • 1

2 Answers2

1

examine only the requests which generate 200,206,301,302 response codes.

This will overcount. The amount it overcounts depends on how many 301's and 302's you serve. Browsers that receive a 301 or 302 will redirect without sending your JavaScript ping, and will presumably later generate a 200, so that will produce double-counting.

Filtering out requests from bots, and requests for css, javascript, and images can be error prone. Instead, I would recommend choosing a single page on your site where you know the JS analytics are working (for instance, the home page), and count only queries for that. Additionally, pick one common user agent out of your logs that typically represents a real browser, and count only queries for that. If the numbers come closer to matching, you can broaden your scope a bit.

It's also possible your JS doesn't function properly in every browser. Try setting up a test instance of your site and then using a service like https://www.browserstack.com/ to load it in multiple browsers. Group the logs by user agent. Any user agent that makes the main request but doesn't send a ping likely has problems executing your JS. Fire up a copy of that user agent and test out your JS.

jsha
  • 111
  • 3
0

Your apache logs will report a number of things that analytics does not count. These include:

  • css, javascript, images and other content included by your content pages. These should be cached, so repeat visitors shouldn't need to retrieve them on subsequent pages. However, you should see HEAD requests if they start a new browser session.
  • Content scanned by bots which are indexing your site. Look for +http:// in the user agent field although not all spiders follow this standard.
  • Some users will be using tools that disable scripts, so this legitimate traffic will be missing from analytics reports.
BillThor
  • 27,737
  • 3
  • 37
  • 69