0

I have collected all the requests made by websites with the aim to identify the third-parties through the requests which are made by a website. I used selenium and WebDriver to do that.

These requests can be made by the JavaScript present in the source code of the website or can be dynamically called by the web-page from the advertisements or can be initiated by Google or DoubleClick or Facebook. These requests help to track the data that is being shared by these websites with or without the user consent.

You can see an example of the requests when the browser wants to load this website: www.focuscamera.com/ in this excel file:

https://drive.google.com/file/d/16wNA0dFUehrjPww31TAIj8GZUZ05LsIU/view?usp=sharing

My questions are:

1- which kind of HTTP header field can be used for my analysis if I tend to gather some info about third parties? my goal is to distinguish and differentiate the third party behavior!

For example, the field content-length in the requests indicates the size of the entity-body. So a request with higher content-length means that the third party received and collect more data/information?

2- What does exactly content-length indicates? what does exactly "HTTP request body data" contain?

3- Are there any other HTTP header fields that I can use if I aim to distinguish and differentiate the third party behavior? ( a list of field I collect can be found in sheet1 of the excel file I shared before)

4- Are there any other information on the internet that I can use if I aim to distinguish and differentiate the third party behavior? For example, I use cookiepedia.co.uk in order to know what kind of services third parties provide? is it functionality, performance, or Targeting/advertising?

Vy Do
  • 46,709
  • 59
  • 215
  • 313

1 Answers1

0

It sounds like you may be reinventing the wheel here. Take a look at https://webbkoll.dataskydd.net; they provide lots of security and privacy analysis on any site you like. Generate nice visual request maps using https://requestmap.webperf.tools:

Focuscamera image map

Try using that tool on sites like wired.com and forbes.com to see how spectacularly bad it can get!

To answer your questions specifically:

  1. Headers are not massively useful as they are within each request (it's the request itself that's more interesting), but the important ones from a privacy perspective will be Referer and Set-cookie. Content-length does indeed tell you how big the request body is – that will always be 0 on a GET request and so is usually omitted – large post requests indicate more data is being transmitted, but that may be down to inefficiency rather than anything else.

  2. Content-length indicates the length of the data (in bytes) within the body of a POST request. An HTTP request body can contain any kind of data: text, images, video, audio, formatted data.

  3. There are some, but most headers are functional rather than semantic, concerned with making the request actually work. It's more interesting that requests happen at all than what they contain.

  4. You can't necessarily tell what kind of service a third party is providing from the requests themselves, but the domains they are going to are more interesting. For example anything going to doubleclick.com is going to be ad and tracking related because of what that domain is known to be used for (Webbkoll cites these as "known trackers"); So you're correct that sites like cookiepedia can help you find out what a particular service does. The divisions between functional/performance/profiling are mostly made up by ad companies to excuse their behaviour, and you can't tell what they are using data for, only whether they are receiving data, and what data they are receiving (because you can see what's in the requests they make using browser developer tools). To clarify - a site could receive your full name and address, but do absolutely nothing with it; but you can't tell that from looking at the data that's sent. In privacy terms, it's always best to assume the worst (because ad companies absolutely cannot be trusted!), so if they are receiving data, assume it will be abused.

Synchro
  • 35,538
  • 15
  • 81
  • 104