3

I am trying to access a webpage to download some data like this:

from bs4 import BeautifulSoup
import urllib.request
from lxml import html

download_url = "ftp://nomads.ncdc.noaa.gov/NARR_monthly/"

s = requests.session()                                                         


page = BeautifulSoup(s.get(download_url).text, "lxml")

but this returns:

Traceback (most recent call last):

  File "<ipython-input-271-59c5b15a7e34>", line 1, in <module>
    r = requests.get(download_url)

  File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)

  File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)

  File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)

  File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 612, in send
    adapter = self.get_adapter(url=request.url)

  File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 703, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)

InvalidSchema: No connection adapters were found for 'ftp://nomads.ncdc.noaa.gov/NARR_monthly/'

even though the website is operational.

Normally I would then loop through each link like so if it worked:

for a in page.find_all('a', href=True):
    file = a['href']
    print (file)

I also tried this:

import ftplib

ftp = ftplib.FTP(download_url)

but this returns:

  File "<ipython-input-284-60bd19e600fe>", line 1, in <module>
    ftp = ftplib.FTP(download_url)

  File "/anaconda3/lib/python3.6/ftplib.py", line 117, in __init__
    self.connect(host)

  File "/anaconda3/lib/python3.6/ftplib.py", line 152, in connect
    source_address=self.source_address)

  File "/anaconda3/lib/python3.6/socket.py", line 704, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):

  File "/anaconda3/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):

gaierror: [Errno 8] nodename nor servname provided, or not known
Stefano Potter
  • 3,467
  • 10
  • 45
  • 82

1 Answers1

4

Unfortunately requests doesn't support FTP links, but you can use the builtin urllib module.

import urllib.request

download_url = "ftp://nomads.ncdc.noaa.gov/NARR_monthly/"
with urllib.request.urlopen(download_url) as r:
    data = r.read()

print(data)

The response is not html so you can't parse it with BeautifulSoup, but you could use regex or string manipulation.

links = [
    download_url + line.split()[-1] 
    for line in data.decode().splitlines()
]
for link in links:
    print(link)

You can also use ftplib if you wish, but you'll have to use the host name only. Then you can cd to 'NARR_monthly' and get the data.

from ftplib import FTP

with FTP('nomads.ncdc.noaa.gov') as ftp:
    ftp.login() 
    ftp.cwd('NARR_monthly')
    data = ftp.nlst()

path = "ftp://nomads.ncdc.noaa.gov/NARR_monthly/"
links = [path + i for i in data]

Sometimes the host will reject the connection because of too many clients, so you may want to use a try-except block.

t.m.adam
  • 15,106
  • 3
  • 32
  • 52
  • Thanks, I keep running into the host problem with R, its quite annoying – Stefano Potter Aug 14 '18 at 00:01
  • 1
    I'm afraid there is not much we can do about the connection error. I can't help with R, but if you're using Python you could use a try-ecxept block in a loop and break if the connection is successful. – t.m.adam Aug 14 '18 at 00:57