3

Questions that pose a similar problem:

Issues with LWP when using HTTP/1.1: bad chunk-size, truncated responses.


I am using the Perl module WWW::Mechanize to scrape web sites. As far as I understand, WWW::Mechanize uses the Net::HTTP module to implement the HTTP protocol.

Here is the issue:

my $url = 'https://somewebsite.com/a/b/c?skey=svalue';
my $browser = WWW::Mechanize->new();
$browser->get($url);

When I execute the above snippet (assuming all imports are in place), I get an empty response content with the following error in response header inside the response object of WWW:Mechanize:

'x-died' = "Bad chunk-size in HTTP response: { at path/ to/perl/vendor/lib/Net/HTTP/Methods.pm line 542."

Notice the '{' in the exception message. I then tried to debug the Methods.pm module to see what was going on and it looks like the exception happens inside the read_entity_body subroutine.

I also did a curl for the url that I have and got the following response headers:

< HTTP/1.1 200 OK
< Set-Cookie: JSESSIONID=C61B57BA5DD0A05912C98CE1CFBAD435; Path=/; HttpOnly
< X-Frame-Options: DENY
< Transfer-Encoding: chunked
< Strict-Transport-Security: max-age=31536000 ; includeSubDomains
< Server: Apache-Coyote/1.1
< Cache-Control: no-cache, no-store, max-age=0, must-revalidate
< X-Content-Type-Options: nosniff
< Content-Disposition: attachment;filename=f.txt
< Pragma: no-cache
< Expires: 0
< X-XSS-Protection: 1; mode=block
< Date: Thu, 21 Sep 2017 18:31:27 GMT
< Content-Type: application/json;charset=UTF-8
< Transfer-Encoding: chunked

and with the following content:

{
  "total" : 1,
  "page" : 1,
  "records" : 1,
  "rows" : [ {
    "infoPostRptId" : 2,
    "mngPplId" : 1,
    "infoPostRptXsdId" : 1,
    "rptFmtCode" : "XML",
    "createUserId" : 5183202,
    "updateUserId" : 1,
    "statusId" : 309403,
    "seqNbr" : 0,
    "urlAnchor" : null,
  } ],
  "errors" : null
}
* Connection #0 to host xxxxxxx left intact

If I am not wrong, it looks like the content that came through from the website is not actually chunk encoded though the headers mention the transfer-encoding to be chunked.

More information regarding the Methods.pm module:

From what I understand, the read_entity_body subroutine tries to decode and combines the chunks to form the response content.

I think the problem is that the response headers have Transfer-Encoding: chunked but the content in fact is not chunked encoded.

Any help is highly appreciated. Thanks.

EDIT 1:

Versions:

WWW:Mechanize: 1.83, LWP:UserAgent: 6.15 and Net::HTTP: 6.12

EDIT 2:

Output of curl -s --raw -D - "https://....":

HTTP/1.1 200 OK
Set-Cookie: JSESSIONID=A29B1E0F561F1E4FBAF12583C0C2DE08; Path=/; HttpOnly
X-Frame-Options: DENY
Transfer-Encoding: chunked
Strict-Transport-Security: max-age=31536000 ; includeSubDomains
Server: Apache-Coyote/1.1
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
X-Content-Type-Options: nosniff
Content-Disposition: attachment;filename=f.txt
Pragma: no-cache
Expires: 0
X-XSS-Protection: 1; mode=block
Date: Fri, 22 Sep 2017 02:36:51 GMT
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked

45c
{
  "total" : 1,
  "page" : 1,
  "records" : 1,
  "rows" : [ {
        "infoPostRptId" : 2,
        "mngPplId" : 1,
        "infoPostRptXsdId" : 1,
        "rptFmtCode" : "XML",
        "createUserId" : 5183202,
        "updateUserId" : 1,
        "statusId" : 309403,
        "seqNbr" : 0,
        "urlAnchor" : null,
  } ],
  "errors" : null
}
0

Like the previous JSON content, I have removed/altered some values just to anonymize data.

EDIT 3: This is what I get when I execute the following command:

 perl -MLWP::UserAgent -e'print LWP::UserAgent->new->get($ARGV[0])->as_string' 'https://......'

  HTTP/1.1 200 OK
  Cache-Control: no-cache, no-store, max-age=0, must-revalidate
  Connection: close
  Date: Fri, 22 Sep 2017 04:15:06 GMT
  Pragma: no-cache
  Server: Apache-Coyote/1.1
  Content-Type: application/json;charset=UTF-8
  Expires: 0
  Client-Aborted: die
  Client-Date: Fri, 22 Sep 2017 04:15:06 GMT
  Client-Peer: 67.221.172.5:443
  Client-Response-Num: 1
  Client-SSL-Cert-Issuer: /C=US/ST=Arizona/L=Scottsdale/O=GoDaddy.com, Inc./OU=http://certs.godaddy.com/repository//CN=Go Daddy Secure Certificate Authority - G2
  Client-SSL-Cert-Subject: /OU=Domain Control Validated/CN=*.trellisenergy.com
  Client-SSL-Cipher: ECDHE-RSA-AES128-SHA256
  Client-SSL-Socket-Class: IO::Socket::SSL
  Client-Transfer-Encoding: chunked
  Content-Disposition: attachment;filename=f.txt
  Set-Cookie: JSESSIONID=5CAC35648DBBE25E3229DE9BF21C3794; Path=/; HttpOnly
  Strict-Transport-Security: max-age=31536000 ; includeSubDomains
  X-Content-Type-Options: nosniff
  X-Died: Bad chunk-size in HTTP response: { at /usr/local/share/perl5/Net/HTTP/Methods.pm line 544.
  X-Frame-Options: DENY
  X-XSS-Protection: 1; mode=block

EDIT 4: TCP Dump:

Did the following command in one terminal window:

perl -MLWP::UserAgent -e'print LWP::UserAgent->new->get($ARGV[0])->as_string' 'https://vgs.trellisenergy.com/ptms/public/infopost/getInfoPostRpts.do?tspId=1&proxyTspId=1&rptId=2&downloadInd=0&searchInd=0&showLatestInd=0&cycleId=10303&startDate=09/20/2017&endDate=09/20/2017&_search=false&nd=1505846852955&rows=10&page=1&sidx=&sord=asc&_=1505846826289'

And the following in another:

tcpdump -w tcpdump.pcap -A -s0 -e -n -vvv -i eth0 host vgs.trellisenergy.com

Pretty print tcpdump using:

tcpick -C -yP -r tcpdump.pcap

TCP Dump:

Starting tcpick 0.2.1 at 2017-09-22 10:24 MDT
Timeout for connections is 600
tcpick: reading from tcpdump.pcap
1      SYN-SENT       10.1.1.10:24876 > 67.221.172.5:https
1      SYN-RECEIVED   10.1.1.10:24876 > 67.221.172.5:https
1      ESTABLISHED    10.1.1.10:24876 > 67.221.172.5:https
...........Y.8..*m.i.'ZZP*....1...d
.._.$.^....0.,.(.$...
.....k.j.9.8.....2...*.&.......=.5.../.+.'.#... .....g.@.3.2.....E.D.1.-.).%.......<./...A.........
..................._.........vgs.trellisenergy.com.........
. .....................................
.....0..1.0.......U....US1.0...U....Arizona1.0...U...............>.s].s.a^.
Scottsdale1.0...U.
..........0..0A1!0...U....Domain Control Validated1.0...U....*.trellisenergy.com0.."0 Secure Certificate Authority - G20..
h@s0.*$.H.4./..E8.m.V......'!..f...!tY'.(..`......... ...E.)Tz..z2.%..KEi....Dd.....s....JW_.Y  ..8..6..Y ........i.r............"...a.
LI1V    6t....C.....20uB'..#:...n..(-...(..P..M..O...p.3L.].@A.........0...0...U.......0.0...U.%..0...+.........+.......0...U...........07..U...00.0,.*.(.&http://crl.godaddy.com/gdig2s1-337.crl0]..U. .V0T0H..`.H...m....0907..+........+http://certificates.godaddy.com/repository/0...g.....0v..+........j0h0$..+.....0...http://ocsp.godaddy.com/0@..+.....0..4http://certificates.godaddy.com/repository/gdig2.crt0...U.#..0...@..'..4.0.3..l...,..01..U...*0(..*.trellisenergy.com............z...;^..'.@.l..,Cj...N.LY.S.......~p...k.. ...Y..S}.\}o.......(.
.....H..SG.D.vy}...qM(.0LT.C.....R.......y...   Y.....wz.s4..Q.t...u...].8.|..q..+.>5...?..`z.X2. .{.%..[ 7.. r...y.yjY..h]...0I.$..x,O....h......n.b.....c.<.....X.Gi.P.vTM.d.B.
.....0..1.0...a...U....US1.0...U....Arizona1.0...U...
Scottsdale1.0...U.
310503070000Z0..1.0110/...U....US1.0...U....Arizona1.0...U...rity - G20..
Scottsdale1.0...U.
..........0.., Inc.1-0+..U...$http://certs.godaddy.com/repository/1301..U...*Go Daddy Secure Certificate Authority - G20.."0
...........v...b.0d...l...b../.>e...b.<R...EKU.xkc.b...il.....L.E3......+..a.yW....?0<]G.....7.AQ..KT.(.....08...&.fGcm.q&G.8GS.F......E...q..o....0:yO_LG...[...`;..C...3N...'O.%........t.dW..DU.-*:>....2
..d..:P.J..y3.. .....9.i.lcR.w...t.....PT5KiN.;.I.....R..........0...0...U.......0....0...U...........0...U......@..'..4.0.3..l...,..0...U.#..0...:....g(.....An .....04..+........(0&0$..+.....0...http://ocsp.godaddy.com/05..U....0,0*.(.&.......`..r.s$..."....bXD...%......b.Q...Q*...s.v.6....,....*...Mu..?.A.#}[K...X.F..``..}PA......../..T.D..}.C.D..p
...3..-v6&.....a....o.F.(..&}
.....0..1.0.......U....US1.0...U....Arizona1.0...U...
Scottsdale1.0...U.
09GoDaddy.com, Inc.110/..U...(Go Daddy Root Certificate Authority - G20..
371231235959Z0..1.0     ..U....US1.0...U....Arizona1.0...U...
Scottsdale1.0...U.
..........0.., Inc.110/..U...(Go Daddy Root Certificate Authority - G20.."0
..f"..im6.......`.8......F.. C.;....I.'....N...p..2...>.N...O/Y0"...Vk......u.9Q{..5.tN......?........j..............;F|2
>.]|.|..+S..biQ%.a.D..,.C.#..:...)....]....0
............]y...Yg.a.~;.1u-. .Oe......../..Z..t.s.8B..{..u...........S.~.F.....+....'....Z.7....l....=.$Oy.5._.......-.......s@.r%......h..W...:       ..D...7...2..8..d.,~........h..".8-z..T.i._3.z={
.8.. 'e...]p-..N.(F...6.....(....k.Q......8k...v...v...(...=!.:...;.L.....K./.....D....xH .Zi.<!.}i. t.c.!yWY..c.I......?.._.e......"...v.'8Qq.d].......O(8._M....%........]:LU....]l.  .....
............iA...~....C5...k.43... .F6. .\!....X......bJ.e..@.....[.uO.&..-....7.O. .......g2..R.b....H7.........G.....%u1.....8$.u..O....za..T..........P...V2.;.......j.L.Px;..-....&.......H...yQ,n.s..<KFx#...2..K.G..n4OG{N.5.6../...
......
....PU.T....A.d...*.iw..        c.Wjm.V\. ..vP.Z%......v...k......l...b7.|.u..c.=:....$.3K..
........v.{u...`..+.qU. .'.t.g....V......1..P.g..aO....nY..C..F...4x.d...Y....|3..Pz;.K.~]...H..;...PIR..hRv...)].=?.:..[...h...A.. /4..d.......C`....]LZK.Y..q......Q.L.R..D&...l..t..I.j2....8...y.L..).y.n..).u|..'.....z ..,Yg..md."i.......M.74x...3..N.b.6..tm.).u...|-.xK.9R..M,......!....}..[=B.J......     ...~Gx.8p.5.UQ........sJ
...w..Xf.#^..,..G.w.f4.V..'..Bb_..*e.i......P1.
U6!.l..%...ts. u!c5.0>.!.2J.G)p.W.........dF*5.....5..M.        .....G+.....I..vG&..>.}(....E.  ...9...N.i..Jm&b...G...3Wo#k.........e:..p........:w....V.L'9.-..)......d.P_....#..iide@.2..E>.?|..:....B.,mr...N.JAS1]:...O.......i..c..T.pZZ)..E."\b.r2HA..r!....L........K....~1.....x!.Gp.K..G..D*s.u....WN.?..(+..rU..g?d.....eG.L.^...*..a...]/...N0.gX..;...T...%...;.P?.O4{.i.....%.T.|..
...U..Ug......d...a3:$...p...v..t."...
.......%..J`E....5....n..M....>...ge.r.,...s..,..       k..R.N._>3}...=.0...........T.d..       ...u 7?T...3b.?.lr...8o.Gk.}xkBY[...l..^.-.Wt}..G/..l.f..z..^F.A.G.i8l4.....#.a.....BS.c.Q7..=y...{ELUP.R..c.{...a9.u3..-@F.H..M..2.o.j@.pI..S....R  ..vx.u.<-x..".T.d-...:...>......n..Z|..?Dz@N..?...#.../.....2.Z..y..Ej..........Q.....'8.....nC..7.....)e..7r..[..H...R.....h...x7G.+.......eBErwo.r....,..e*.8O..oQ. `O.@.J#...5).9.....!d.u....,...pV..oS...%.o..F..G.7....I...N...s .G..G@.".w6d......R..j
..........G.D..l....0..EH.Y..4.e.\#~s.i.-WKoyK...w.'.o.X-.,x.......4......T.*.>#..
..G(wP.V.i...F.U...t...-.\.!...Y4,...._............7..|<DM3.&u.%.0..G.......9....
.....Y......55ZW..X......Tz..D...r.6$..B...Wv..R..8.."../dL..-...i^o..>:..O...s.W.).i....gOH...@.....8k.......Q........#.....#.R..^.....f.......x^X....^S.R..u.7.._..T]A'/4>k\..Lg....H...J....o>.2 ......$.......PP..#..=.E..;2..>k...`...9..>*.....N...4........(...a....n....)w.I.@O+.(.cV..g.....%G..^.Z#.'EG...]..$_...!e...%.;VG.7.5.&...C........s4..1....t[
1      FIN-WAIT-1     10.1.1.10:24876 > 67.221.172.5:https
1      TIME-WAIT      10.1.1.10:24876 > 67.221.172.5:https
1      CLOSED         10.1.1.10:24876 > 67.221.172.5:https
tcpick: done reading from tcpdump.pcap

22 packets captured
1 tcp sessions detected
Athithyaa
  • 33
  • 1
  • 6
  • Could you include your versions of WWW::Mechanize, LWP::UserAgent and Net::HTTP? – oalders Sep 21 '17 at 22:33
  • Also, is the server emitting broken JSON or did the trailing curly bracket get lost in the copy/paste? – oalders Sep 21 '17 at 22:35
  • @oalders I have updates the post with versions. The broken JSON was my mistake. The server actually gives valid JSONs – Athithyaa Sep 21 '17 at 22:40
  • Re "*I think The problem happens when the response headers have Transfer-Encoding: chunked but the content in fact is not chunked encoded.*", You make it sound like you think getting an error message for bad data is a problem. I hope you meant "*I think the problem is that the response headers have Transfer-Encoding: chunked but the content in fact is not chunked encoded.*" For us to confirm that, you will need to provide us the output of `curl -s --raw -D - http://...`. (The output without the command is useless. The output without `--raw` is useless.) – ikegami Sep 22 '17 at 02:11
  • Why is `Transfer-Encoding: chunked` appearing twice? – ikegami Sep 22 '17 at 03:36
  • I tried to reproduce by placing your `curl` output in `response.html` (with `45c` replaced with `159`), then running `nc -l 8888 new->get($ARGV[0])->as_string;' 'http://localhost:8888/'` in another. Not able to reproduce. Do you get the error you do the same? – ikegami Sep 22 '17 at 03:37
  • @Athithyaa: can you provide a packet capture? It might be that the server is inserting some spaces somewhere where they don't belong or uses `\n` instead of `\r\n` - i.e. things which are not visible in the output of `--raw` you gave? – Steffen Ullrich Sep 22 '17 at 03:38
  • @Steffen Ullrich, I checked, but Net::HTTP doesn't care if `\r` is missing. In fact, the test I described above used just `\n` instead of `\r\n`. – ikegami Sep 22 '17 at 03:39
  • @ikegami: As you've already realized with `45c` instead of `192` - it looks like that the OP gives an edited version of the actual output or that there is lots of white space which is not visible. To reproduce one would need to have what exactly is send on the wire and not some approximation. And it is needed from the failed connection and not from the successful. – Steffen Ullrich Sep 22 '17 at 03:47
  • @Steffen Ullrich, Re "*looks like that the OP gives an edited version of the actual output*", They've said as much in the last paragraph. /// Re "*To reproduce one would need to have what exactly is send on the wire and not some approximation.*" Yes and no. But an accurate view would be preferable. If there is an edit, it should not affect length at all. What they could do is use the `nc` setup I mentioned above, make their edits, and make sure the problem still happens after the edits, then post that., – ikegami Sep 22 '17 at 03:53
  • @ikegami I tried the method you suggested. It seems to work with the original as well the one you changed. But as Steffen is suggesting, there might be issues with the original failed connection. – Athithyaa Sep 22 '17 at 03:55
  • 1
    @Athithyaa, That's too bad. There not much we can do if you can't demonstrate the problem. – ikegami Sep 22 '17 at 03:57
  • @Athithyaa: I think you can remove the tcpdump to clean up the question. This is not of much help but the URL you've provided was really useful to reproduce the issue and to find the cause of the problem. – Steffen Ullrich Sep 22 '17 at 17:06

1 Answers1

4

That's a bug in the server or (more likely) a bug in the application running on the server. If one is sending the following request:

GET /some-path HTTP/1.1
Host: some-host

The server is responding with a correct chunked response. Interestingly the Transfer-Encoding: chunked header is sent twice - one at the beginning of the HTTP header and one at the end:

HTTP/1.1 200 OK
Set-Cookie: ...
X-Frame-Options: DENY
Transfer-Encoding: chunked
...
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked

45c
{

Now, when sending a slightly changed request with an added Connection: close header the response looks different:

GET /some-path HTTP/1.1
Host: some-host
Connection: close

----

HTTP/1.1 200 OK
Set-Cookie: ...
X-Frame-Options: DENY
Transfer-Encoding: chunked
...
Content-Type: application/json;charset=UTF-8

{

The leading Transfer-Encoding: chunked is still there but the last one is no longer there. And the response body is not chunked anymore, even though there is still a Transfer-Encoding: chunked in the response header! .

This is whats is happening with LWP contrary to curl: LWP is sending a Connection: TE, close header while curl is not sending a Connection header. This means LWP is getting the broken response and is complaining correctly while curl does not get the broken response and thus has no reason to complain. But, if you explicitly add a Connection: close header to curl it will run into the same problem:

 $ curl -H 'Connection:close' https://...
 curl: (56) Illegal or missing hexadecimal sequence in chunked-encoding

Further tests show that the leading Transfer-Encoding: chunked header is also sent if the client is doing a HTTP/1.0 request! This should not happen at all because chunked is only defined with HTTP/1.1.

This suggests that some part of the web application running on the server and not the web server itself is issuing the first Transfer-Encoding: chunked header. Thus, if you have access to the application or to the developer of the application you should fix it there.

Steffen Ullrich
  • 114,247
  • 10
  • 131
  • 172