How to create a directory in HDFS on Google Cloud Platform via Java API

Question

I am running an Hadoop Cluster on Google Cloud Platform, using Google Cloud Storage as backend for persistent data. I am able to ssh to the master node from a remote machine and run hadoop fs commands. Anyway when I try to execute the following code I get a timeout error.

Code

FileSystem hdfs =FileSystem.get(new URI("hdfs://mymasternodeip:8020"),new Configuration());
Path homeDir=hdfs.getHomeDirectory();
//Print the home directory
System.out.println("Home folder: " +homeDir); 

// Create a directory
Path workingDir=hdfs.getWorkingDirectory();
Path newFolderPath= new Path("/DemoFolder");

newFolderPath=Path.mergePaths(workingDir, newFolderPath);
if(hdfs.exists(newFolderPath))
    {
        hdfs.delete(newFolderPath, true); //Delete existing Directory
    }
//Create new Directory
hdfs.mkdirs(newFolderPath);

When executing the hdfs.exists() command I get a timeout error.

Error

org.apache.hadoop.net.ConnectTimeoutException: Call From gl051-win7/192.xxx.1.xxx to 111.222.333.444.bc.googleusercontent.com:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=111.222.333.444.bc.googleusercontent.com/111.222.333.444:8020]

Are you aware of any limitation using the Java Hadoop APIs against Hadoop on Google Cloud Platform ?

Thanks!

score 0 · Accepted Answer · answered Jun 30 '15 at 23:40

It looks like you're running that code on your local machine and trying to connect to the Google Compute Engine VM; by default, GCE has strict firewall settings to avoid exposing your external IP addresses to arbitrary inbound connections. If you're using defaults then your Hadoop cluster should be on the "default" GCE network. You'll need to follow the adding a firewall instructions to allow incoming TCP connections on port 8020 and possible on other Hadoop ports as well from your local IP address for this to work. It'll look something like this:

gcloud compute firewall-rules create allow-http \
    --description "Inbound HDFS." \
    --allow tcp:8020 \
    --format json \
    --source-ranges your.ip.address.here/32

Note that you really want to avoid opening a 0.0.0.0/0 source-range since Hadoop isn't doing authentication or authorization on those incoming requests. You'll want to restrict it as much as possible to only the inbound IP addresses from which you plan to dial in. You may need to open up a couple other ports as well depending on what functionality you use connecting to Hadoop.

The more general recommendation is that wherever possible, you should try to run your code on the Hadoop cluster itself; in that case, you'll use the master hostname itself as the HDFS authority rather than external IP:

hdfs://<master hostname>/foo/bar

That way, you can limit the port exposure to just the SSH port 22, where incoming traffic is properly gated by the SSH daemon, and then your code doesn't have to worry about what ports are open or even about dealing with IP addresses at all.

Hi Dennis, opening the port worked for me but as you have already pointed out there could be also the need to open other ports as well, in particular if I want to load into HDFS some data files from my local machine, this is my final goal. I think you are right, I should have the java code running on the master node to avoid to exposure too many ports to the inbound traffic, but what is the best practice to push to the master node the original data files (programmatically)? thanks! — gl051, Jul 01 '15 at 20:13
Generally you can either use [gcloud compute copy-files](https://cloud.google.com/sdk/gcloud/reference/compute/copy-files) or you can first stage the files to Google Cloud Storage using `gsutil cp gs:///`, and then SSH into the master and `gsutil cp gs:/// `. If you're talking about bulk data, you can also first upload to Google Cloud Storage, and then in your master node do `hadoop fs -cp gs://your-bucket/your-location/data hdfs:///`. — Dennis Huo, Jul 01 '15 at 20:35
If it's a lot of data you can even use "hadoop distcp" to then move from GCS into HDFS. Alternatively, consider just reading the files from GCS directly in your hadoop jobs; anywhere you would've used "hdfs://" just go ahead and use your gs://bucket/location instead. — Dennis Huo, Jul 01 '15 at 20:36

How to create a directory in HDFS on Google Cloud Platform via Java API

1 Answers1