0

Short version: How do I know how to phrase additional data (like specific options on the page that display different html files but belong to the same URL) when getting an URL with urllib?

Long version: I am having trouble to figure out how to handle properties of an url request that are not determined by the Link URL but by probably other information that your browser is usually sending. To be more precise: This page contains a table that i want to read with python, but the length of the table depends on the number of items per page you choose in the bottom left (i.e. the number of items in the code I get from urllib.request.urlopen is the standard of 50 or something, not the complete table). Clicking the buttons for e.g. 400 items per page doesn't change the URL so I expect that there is some information sent somewhere else. I understand that using urllib can send additional data besides the url, but it is unclear to me how to figure out how I should phrase the "give me the whole table" (or "give me 400 items per page") in that data.

Studying the .html file I get from saving the webpage in my browser didn't give me any hints and I miss the vocabulary to search for answers on the web (that is, googling "urllib request parameter" is too vague). Hence I'd be completely satisfied if someone would point me to a duplicate of this question.

Thanks in advance :)

  • The information is probably stored and sent in a cookie. Inspect the actual request sent in your browser using your browser's debug tools. – deceze Jun 13 '16 at 13:16
  • @deceze This might be what I tried by saving the page as html and working through the document. (in chromium that also works with "view source", [this link might only work in chromium](view-source:http://virtonomics.com/mary/main/geo/transport/423083/370074/423081/423082/423083).) I have been unable to spot the information though, but I also don't know exactly what Im looking for. – SolUmbrae Jun 13 '16 at 14:50
  • I'm talking about this: https://developers.google.com/web/tools/chrome-devtools/profile/network-performance/resource-loading#view-details-for-a-single-resource – You want to replicate an HTTP request, so look at the original HTTP request; not the HTML document. – deceze Jun 13 '16 at 14:52
  • @deceze That looks great, I'll look into it, thanks :) – SolUmbrae Jun 13 '16 at 14:54
  • @deceze I'll just state where I'm at and then you can decide whether answering is worth your time: It looks like there are two requests made, the first one for http://virtonomics.com/mary/main/common/util/setpaging/dbproduct/transportReport/10 which includes the chosen option at the end in the url and the second one for the original url. The request headers do not seem to depend much on the chosen option (i.e. not enough information to determine which option was chosen), so I do not understand how the first request influences the second. – SolUmbrae Jun 13 '16 at 15:24
  • E.g. this is the cookie part of the header the second request (original url) for option "200" Cookie: language=en; _vwo_uuid_v2=73DE79322AAAFF3B48532C85F221A7CE|ada5b444309dffb170d78e9f62563ede; gamer=false; _ym_uid=1465824310890855545; _ym_isad=2; _mm_key_=d68ddbf70bd0fda1636ddf6913cae067; _mm_user_=1217393; registred_user=1; traidingHallProductCategory=1535; virtonomics_unitlist_size=50; last_realm=mary; _gat=1; PHPSESSID=tfbmtov1v8pg6sngts4kignmf1; _ga=GA1.2.527327008.1465824310 – SolUmbrae Jun 13 '16 at 15:28
  • And for option "10": Cookie: language=en; _vwo_uuid_v2=73DE79322AAAFF3B48532C85F221A7CE|ada5b444309dffb170d78e9f62563ede; gamer=false; _ym_uid=1465824310890855545; _ym_isad=2; _mm_key_=d68ddbf70bd0fda1636ddf6913cae067; _mm_user_=1217393; registred_user=1; traidingHallProductCategory=1535; virtonomics_unitlist_size=50; last_realm=mary; PHPSESSID=tfbmtov1v8pg6sngts4kignmf1; _ga=GA1.2.527327008.1465824310; _gat=1 They match except for "_gat=1;" but this does not distinguish between the 5 options possible (I checked and there are different options with same cookie, should have posted them) – SolUmbrae Jun 13 '16 at 15:30

1 Answers1

0

For everyone else finding this question I'll elaborate on the answer @deceze gave in the comments:

  • Open the webpage you want to read in your browser
  • Open your Browsers network panel (in chromium this is [Strg+Shift+I] or right-click > Inspect
  • Go to the "Network" Tab (at least in chromium)
  • Do whatever you want your program to do and the empty network panel list will fill with a lot of data
  • Find your request in the list of events (one of the very first ones is right, I would guess), click it and select "Headers"