I'm trying to web scrape with Java, and plan to eventually throw this code into Android, so at the moment I'm trying with JSOUP. Using Chrome's DevTools, I pulled the request headers and the curl command to return data from the webpage. I can run the following command in curl and it works:
curl 'mySite/campaign/List' -H 'Cookie: __RequestVerificationToken_L0N5YXJhV2ViUG9ydGFs0=IECNY-SOnB09IY9MQMm3xL1bSbASe8Eha9J1fWupurHtmlldojgqpaljhzIuhfFh6zRnOygjsrKyuhj2krWiSSNXif76gRNH_39lGvyMJ0I1; ASP.NET_SessionId=gojtobwzycl0lvs0ip4glf3n; myCompany.WEB.PORTAL.AUTH=40C13BAF08884380F805B99E217754F3D35920CE1861DEBB580DC143DA4249C4682C33A36DD29272A3A844880110E4D0EC1F24298E4D1B2A4A94E3FA2CAC08B934989ACF155616D6CB5665338FF3CFF82EAD87BF93EB46FA3BA6AAE6B00401F9' -H 'Origin: mySite' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36' -H 'Content-Type: application/json;charset=UTF-8' -H 'Accept: */*' -H 'Referer: mySite/campaign' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H '__RequestVerificationToken: G2RD7FtHMG12j00zNuLtiSZSWquXAOvh1hUNxObxMCFIZclrQueAo4d3cZonI1MZ7hxELl56yi5hci5vpC78m4Sh8PivHwRcKImcCibi9xk1' --data-binary '{"PageNumber":2,"SortColumn":"ScheduledRunDate","SortAscending":false,"PageSize":20,"CollectionSize":308,"SelectedAccountId":"1","SearchTerm":"","ShowInactive":true}' --compressed
I also pulled the headers request headers from Chrome DevTools:
POST mySite/campaign/List HTTP/1.1
Host: mySite
Connection: keep-alive
Content-Length: 165
Origin: mySite
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36
Content-Type: application/json;charset=UTF-8
Accept: */*
X-Requested-With: XMLHttpRequest
__RequestVerificationToken: G2RD7FtHMG12j00zNuLtiSZSWquXAOvh1hUNxObxMCFIZclrQueAo4d3cZonI1MZ7hxELl56yi5hci5vpC78m4Sh8PivHwRcKImcCibi9xk1
Referer: mySite/campaign
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8
Cookie: __RequestVerificationToken_L0N5YXJhV2ViUG9ydGFs0=IECNY-SOnB09IY9MQMm3xL1bSbASe8Eha9J1fWupurHtmlldojgqpaljhzIuhfFh6zRnOygjsrKyuhj2krWiSSNXif76gRNH_39lGvyMJ0I1; ASP.NET_SessionId=gojtobwzycl0lvs0ip4glf3n; myCompany.WEB.PORTAL.AUTH=40C13BAF08884380F805B99E217754F3D35920CE1861DEBB580DC143DA4249C4682C33A36DD29272A3A844880110E4D0EC1F24298E4D1B2A4A94E3FA2CAC08B934989ACF155616D6CB5665338FF3CFF82EAD87BF93EB46FA3BA6AAE6B00401F9
I then try converting that into jsoup and no luck. I tried using just the headers, and using the headers along with the PageNumber, ScheduledRunDate, etc. passed. Both attempts return org.jsoup.HttpStatusException: HTTP error fetching URL. Status=500. Here is the code I'm attempting:
Document pageDoc = Jsoup.connect("mySite/campaign/List")
.cookies(loginCookies)
//.header("Cookie",cookieList)
.userAgent("Mozilla/5.0")
.referrer("mySite/campaign")
//.data("Username", username)
//.data("Password", password)
//.followRedirects(true)
.header("Accept","*/*")
.header("Accept-Encoding","gzip, deflate")
.header("Accept-Language","en-US,en;q=0.8")
.header("Connection","keep-alive")
.header("Content-Type", "application/json;charset=UTF-8")
.header("Host","mySite")
.header("Origin", "mySite")
.header("Referer","mySite/campaign")
.header("User-Agent","Mozilla/5.0 (Windows NT 6.1: WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36")
.header("X-Requested-With", "XMLHttpRequest")
.header("__RequestVerificationToken", pageToken)
.header("Content-Length", "165") //not sure if needed. If it is, no idea how to get
.data("PageNumber","2")
.data("SortColumn", "ScheduledRunDate")
.data("SortAscending", "false")
.data("PageSize", "20")
.data("CollectionSize", "308")
.data("SelectedAccountId", "1")
.data("SearchTerm", "")
.data("ShowInactive", "true")
.ignoreContentType(true)
.post();
I can confirm all my tokens are correct. When I comment out .header("X-Requested-With", "XMLHttpRequest") I receive the general error page (this is expected) so I know I'm connecting, but when I leave it in I get the 500. I can also confirm all the "mySite" links are correct, I just have to remove them per my company. I'm also not sure if and how I need to add PageNumber, SortColumn, SortAscending etc. for jsoup so I just blindly added them as data parameters shown above.