I happened to compare the google analytics report with the apache access logs and it shows a startling 250% drop offs.
We have a wordpress installation hosted on aws with 2 webservers behind a ELB and a NFS server, RDS and an elastic cache.
The way i carried out the analysis is as follows:
- On all pages put a simple JavaScript which pings my server on PageReady i.e OnDomContentLoaded Event and I record the IP address the page URL. As this is the simplest JavaScript code my assumption is that it should run on most browsers and the results are very close to the one generated by google-analytics.
- I examine the legitimate requests on the access logs (eliminate requests without User agents+ without referrer URL's etc.) and examine only the requests which generate 200,206,301,302 response codes.
As I compare the the server pings generated by client (custom JavaScript mentioned by 1) and the apache access logs the drop offs seem to be close to 250%.
So this means that the clients in those missing IP's did not execute the JavaScript, but the puzzling part is the server is sending the 200 status code. So I arrived at a conclusion that the server is sending an empty response for most. (I have accounted for few users who turn off JavaScript, some errors etc.) but I am unable to test the assumption. (If at all this is the case).
mod_dumpio
doesn't let me map the response body to the client IP.Audit log doesn't seem to support logging of response body.
With these things into consideration could anyone please point me in the right direction?
Clarification:
As I don't have the reputation to add a comment I would like to add a few points here.
I did look only for document requests i.e excluding all CSS and JS and image files, and I did filter out google bots and other suspicious crawls. Accounting for all this there is a clear drop off up to 250%.