Why am I getting a 403 error when connecting to a url that works

Question

I am trying to pull the quarter end dates for a company from the SEC government website. For some reason I keep getting a connection error. The code is working for my friend who is in the US, but not for me in Canada. I tried using a VPN, but was still getting the same error. Here is the code and the error that I was getting.

When I put the url into google it brings me to the page with all the information so I am not sure why I cant pull it into R.

library(derivmkts)
library(quantmod)
library(jsonlite)
library(tidyverse)

url = "https://data.sec.gov/submissions/CIK0000320193.json"
df <- fromJSON(url, flatten = T)

Error in open.connection(con, "rb") : 
  cannot open the connection to 'https://data.sec.gov/submissions/CIK0000320193.json'
In addition: Warning message:
In open.connection(con, "rb") :
  cannot open URL 'https://data.sec.gov/submissions/CIK0000320193.json': HTTP status was '403 Forbidden'

I am not expecting a 403 error when connecting to this url

Some sites block non-browser based requests because they do not want to be scraped. You could try downloading the content with the `httr` package so you can alter the user-agent request header which might be what they use to block access. It's hard to tell for sure. Every website is different. Make sure to check the terms and conditions so see that what you are trying to do is allowed. — MrFlick, Jul 27 '23 at 18:50
I think you need to look further into your VPN implementation to make sure it's doing what you expect. I'm in the US and it works. I VPN'd to Halifax and it still works. I VPN'd to Montreal and it fails with 403 (as for you). So it's sensitive to your apparent IP address, afaict. Whenever I'm researching something that is VPN-sensitive, I tend to find my apparent address (e.g., whatismyipaddress.com) both before and after connecting the VPN to ensure it changes. (You may also want/need to confirm that your VPN that says it's connecting to the US is actually doing that ...) — r2evans, Jul 27 '23 at 18:52
See also: https://www.sec.gov/os/webmaster-faq#code-support. The site will block requests that do no provide a company name in the User-Agent. I don't think there's any way to change that using `fromJSON`. You'll want to use something like `httr::GET` instead. — MrFlick, Jul 27 '23 at 19:00
@r2evans It worked for you with `fromJSON`? I tried it (in the US) and got an 403. The page contents state: "Your Request Originates from an Undeclared Automated Tool" — MrFlick, Jul 27 '23 at 19:00
@MrFlick yes, using `fromJSON` it worked, worked, and didn't work (respectively). That suggests it might be either a CDN problem, a geoip filter problem, or something else. (To make sure it isn't just sporadic, I just repeated it 100 times, no errors.) — r2evans, Jul 27 '23 at 19:29

margusl · Answer 1 · 2023-07-27T19:16:53.197

They ask you to declare user agent in request headers - https://www.sec.gov/os/accessing-edgar-data

Apparently the one provided as an example is also accepted, though you really should provide your contact details there.

With httr2, it still uses jsonlite for parsing JSON responses:

library(httr2)

resp <- request("https://data.sec.gov/submissions/CIK0000320193.json") |>
  req_user_agent("Sample Company Name AdminContact@<sample company domain>.com") |>
  # set verbosity level for debugging, 1: show headers
  req_perform(verbosity = 1)
#> -> GET /submissions/CIK0000320193.json HTTP/1.1
#> -> Host: data.sec.gov
#> -> User-Agent: Sample Company Name AdminContact@<sample company domain>.com
#> -> Accept: */*
#> -> Accept-Encoding: deflate, gzip
#> -> 
#> <- HTTP/1.1 200 OK
#> <- Content-Type: application/json
#> <- x-amzn-RequestId: c634dcbe-68aa-4777-9f18-4edfae752eb4
#> <- Access-Control-Allow-Origin: *
#> <- x-amz-apigw-id: IvJu4HiHIAMFidw=
#> <- X-Amzn-Trace-Id: Root=1-64c2bcc5-5db9315369e664da512cb6b5
#> <- Vary: Accept-Encoding
#> <- Content-Encoding: gzip
#> <- Expires: Thu, 27 Jul 2023 18:51:49 GMT
#> <- Cache-Control: max-age=0, no-cache, no-store
#> <- Pragma: no-cache
#> <- Date: Thu, 27 Jul 2023 18:51:49 GMT
#> <- Content-Length: 28594
#> <- Connection: keep-alive
#> <- Strict-Transport-Security: max-age=31536000 ; preload
#> <- Set-Cookie: ak_bmsc=E9...

resp
#> <httr2_response>
#> GET https://data.sec.gov/submissions/CIK0000320193.json
#> Status: 200 OK
#> Content-Type: application/json
#> Body: In memory (157568 bytes)

# first few keys / values from JSON:
resp_body_json(resp, simplifyVector = TRUE, flatten = TRUE) |>
  head(n = 10) |>
  str()
#> List of 10
#>  $ cik                              : chr "320193"
#>  $ entityType                       : chr "operating"
#>  $ sic                              : chr "3571"
#>  $ sicDescription                   : chr "Electronic Computers"
#>  $ insiderTransactionForOwnerExists : int 0
#>  $ insiderTransactionForIssuerExists: int 1
#>  $ name                             : chr "Apple Inc."
#>  $ tickers                          : chr "AAPL"
#>  $ exchanges                        : chr "Nasdaq"
#>  $ ein                              : chr "942404110"

^{Created on 2023-07-27 with reprex v2.0.2}

I'm from EU, I can open that JSON URL in the browser without any issues, but default jsonlite & httr2 agents are blocked. Using my browser's agent with httr2 works only when I also set accept-language. They check for some weird pattern in user agent when request is not coming from browser,
i.e. "foo_bar" - NOK / "foo.bar" - OK

This suggests their CDN or load-balancing has inconsistent configurations, interesting. — r2evans, Jul 27 '23 at 19:42

Why am I getting a 403 error when connecting to a url that works

1 Answers1