5

When you use the Web.Page(Web.Contents('url')) function to read a table from a web page, some sites will cause an error due to inconsistent linefeeds.

DataSource.Error: The server committed a protocol violation. Section=ResponseHeader Detail=CR must be followed by LF

There doesn't appear to be any option you can pass to the Web functions to ignore those errors.

This method works for a short while, but doesn't survive a save/refresh:

let
   BufferedBinary = Binary.Buffer(Web.Contents("http://vote.sos.ca.gov/returns/president/party/democratic/county/all")),
   CleanedUp = Text.Replace(Text.Replace(Text.FromBinary(BufferedBinary), "#(cr,lf)", "#(lf)"), "#(lf)", "#(cr,lf)"),
   Table = Web.Page(CleanedUp)
in
   Table
James White
  • 2,062
  • 2
  • 24
  • 36

1 Answers1

4

It might look like the other code sample works, but that probably means it just hasn't been run yet.


The two library functions Web.Contents('url') and Web.Page(Web.Contents('url')) use different HTTP clients.

Web.Contents uses a pretty basic HTTP client, but Web.Page over a non-transformed Web.Contents will instead use the IE browser to load the URL directly. IE is very forgiving for any HTTP protocol errors, but the basic HTTP client is much more strict, which causes the error you see.

By calling Binary.Buffer or Text.Replace in "between," that will skip the optimization where Web.Page uses the browser directly, and it instead uses Web.Contents to make the web request and then loads those bytes into the browser.

So in this case if you invoke:

Web.Page(Web.Contents("http://vote.sos.ca.gov/returns/president/party/democratic/county/all"))

You'll get a good result table, but "From Web" will be broken because the Web.Contents('url') by itself will error.


Since you shared the URL, I was able to try http://vote.sos.ca.gov/returns/president/party/democratic/county/all and get the same problem.

The underlying problem is the fourth cookie in the HTTP response header has an invalid hex character 0x01 right in the middle:

Bad Hex

Maybe this is something that http://vote.sos.ca.gov/ can fix on their server? I tried submitting a "bug report" on their Contact Us page but I'm not sure that is the right channel...


BTW, our HTTP client library doesn't give very good error messages, there's nothing wrong with CR or LF characters. Even if there was, there's nothing that Text.Replace over the response body can fix in this case because the problem is in the HTTP response headers.

Carl Walsh
  • 6,100
  • 2
  • 46
  • 50
  • 2
    Thank you SO MUCH for digging into this, I really appreciate it. Yes, it seems like it is largely related to a CDN or API product "Incapsula" which is used on this and other sites. – James White Jun 18 '16 at 13:48