Why can R's read.csv() read a CSV from GitLab URL when pandas' read_csv() can't?

Question

I noticed that panda's read_csv() fails at reading a public CSV file hosted on GitLab:

import pandas as pd
df = pd.read_csv("https://gitlab.com/stragu/DSH/-/raw/master/Python/pandas/spi.csv")

The error I get (truncated):

HTTPError                                 Traceback (most recent call last)
<ipython-input-3-e1c0b52ee83c> in <module>
----> 1 df = pd.read_csv("https://gitlab.com/stragu/DSH/-/raw/master/Python/pandas/spi.csv")

[...]

~\Anaconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

However, using R, the base function read.csv() reads it happily:

df <- read.csv("https://gitlab.com/stragu/DSH/-/raw/master/Python/pandas/spi.csv")
head(df)
#>   country_code year   spi
#> 1          AFG 2020 42.29
#> 2          AFG 2019 42.34
#> 3          AFG 2018 40.61
#> 4          AFG 2017 38.94
#> 5          AFG 2016 39.65
#> 6          AFG 2015 38.62

^{Created on 2020-10-29 by the reprex package (v0.3.0)}

Any idea why that is, and how R achieves it?

Versions used:

R 4.0.3
Python 3.7.9
pandas 1.1.3

cs95 · Accepted Answer · 2020-10-29T05:33:44.423

If you're looking for a workaround, I recommend making the GET request via requests library:

import requests
from io import StringIO

url = "https://gitlab.com/stragu/DSH/-/raw/master/Python/pandas/spi.csv"
df = pd.read_csv(StringIO(requests.get(url).text))

df.head()
  country_code  year        spi
0          AFG  2020  42.290001
1          AFG  2019  42.340000
2          AFG  2018  40.610001
3          AFG  2017  38.939999
4          AFG  2016  39.650002

As to the "why" part of it, I see read_csv internally uses urllib for standard URLs, apparently the API in question blocks the request possibly because it thinks you are a crawler. If I repeat the same process, but add The "User-Agent" header, the request succeeds.

TLDR; what pandas does and fails:

from urllib.request import Request, urlopen

req = Request(<URL>)
urlopen(req).read() # fails

What pandas should have done for this to work:

req = Request(<URL>)
req.add_header('User-Agent', <literally anything>)
urlopen(req).read() # succeeds

Thank you for your detailed answer. Do you have any insights on how R's `read.csv()` does that differently for it to work without having to do anything extra? That would make a perfect answer. I suspect it has to do with how `base::url()` works. For example, `url("https://gitlab.com/stragu/DSH/-/raw/master/Python/pandas/spi.csv")` returns a valid connection that can be fed to `read.csv()` without a problem. — stragu, Oct 30 '20 at 00:30
Unfortunately I have 0 knowledge of R so your guess is as good (or possibly better) than mine. — cs95, Oct 30 '20 at 05:09

Why can R's read.csv() read a CSV from GitLab URL when pandas' read_csv() can't?

1 Answers1