3

here are three very simple shell commands:

wget 'ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NHIS/2016/samadult_layout.pdf'

and

wget 'ftp://ftp.ibge.gov.br/Censos/Censo_Demografico_2010/Resultados_Gerais_da_Amostra/Microdados/1_Atualizacoes_20160311.txt'

and

wget ftp://ftp.cs.ru.nl/pub/robots.txt

that attempt to pull data from government ftp sites. they get to PASV then hang. screenshots below.. do i need to change some setting or something? thanks!

enter image description here

and

enter image description here

  • The answers to this question cover disabling passive mode, which may work for your use case (some rare FTP servers don't work with it). An alternative option may be to provision a free-tier VM instance - Google offer one VM with 600MB RAM and 30GB disk for free, per Google account. The catch is that outbound bandwidth is not free, *but* [the pricing page](https://cloud.google.com/compute/pricing) shows that outbound traffic to Google services is free, so you could copy data into Google Drive to export it. – i336_ May 09 '18 at 05:48

3 Answers3

2

Looks like Google Cloud Shell only permits outgoing ports 80 (HTTP), 443 (HTTPS), 8080 (sometimes used for HTTP proxy), 22 (SSH) and 21 (FTP control channel). Maybe some other ports too but definitely it's not unrestricted open outbound access.

Unfortunately that's not enough for a successful FTP connection - FTP transfers data on a separate TCP connection, either initiated by the client (passive mode) or by the server (active mode). Neither of these two methods seems to work.

One way around this is to download your files over HTTP or HTTPS. If they are available over these protocols of course. For instance the file in your last example can be retrieved as https://ftp.cs.ru.nl/robots.txt from Google Cloud Shell.

Another way is to set up a HTTP/FTP proxy on port 8080, e.g. on a small compute instance install squid package, and use that proxy to download your files. Something like this:

export ftp_proxy=http://your-instance:8080/
wget ftp://ftp.cs.ru.nl/pub/robots.txt

Third option is obviously to download the FTP files to your local machine and make them available through some file storage service over HTTPS.

Unfortunately it looks like FTP won't work from the cloud shell, neither in active nor in passive mode. You'll have to work around that in one of the ways above.

Good luck with that :)

MLu
  • 24,849
  • 5
  • 59
  • 86
1

Due to bad nature of FTP protocol and how it breaks TCP: http://slacksite.com/other/ftp.html

try to add --no-passive-ftp option to wget, if servers are configured to work with active FTP - it might help.

I suspect, some of these servers aren't configured to accept Passive FTP or routers between track TCP connections, but do not identify and track FTP connections as required. In fact, I was able to use passive ftp for these from my site, so - problem is between GC and those sites somewhere.

GioMac
  • 4,544
  • 4
  • 27
  • 41
1

EDIT: I didn't see the "Cloud Shell" in the question title, and a quick test shows that Cloud Shell indeed does not work. The answer below covers ordinary instances, which don't have any issues.


Actually, it works.

screenshot of success

The above is from a legitimate GC instance, in this case the unlimited free one Google provides per Google account.

Initially I thought about maybe whether you'd edited the network settings. You probably haven't.

And then I realized... wait, if you haven't, your instance is probably still on a dynamic IP, which might be the reason! It makes sense that if your IP is dynamic Google might be doing a bit of CGNAT on it. Unsure though.

My instance is of course on a static IP, which are free. Go into VPC Network, make a new IP address, go back to Instances, click the instance (you will have to Stop it), and under Network interfaces set up the new IP. That is, IIRC, what I did.

WARNING. Google charge 10c/hr for unused static IPs. You'll want to associate it with the instance promptly.

IP address info: https://cloud.google.com/compute/docs/ip-addresses/

Pricing info: https://cloud.google.com/compute/pricing

i336_
  • 184
  • 8
  • 1
    I believe the original question was about Google Cloud Shell (the web-based one), not about Google Cloud compute instance which of course should work. – MLu May 09 '18 at 04:37