1

Our team is exploring options for HDFS to local data fetch. We were suggested about StreamSets and no one in the team has an idea about it. Could anyone help me to understand if this will fit our requirement that is to fetch the data from HDFS onto our local system?

Just an additional question.
I have setup StreamSets locally. For example on local ip: xxx.xx.x.xx:18630 and it works fine on one machine. But when I try to access this URL from some other machine on the network, it doesn't work. While my other application like Shiny-server etc works fine with the same mechanism.

metadaddy
  • 4,234
  • 1
  • 22
  • 46
Prakhar Jhudele
  • 955
  • 1
  • 7
  • 14

2 Answers2

1

Yes - you can read data from HDFS to a local filesystem using StreamSets Data Collector's Hadoop FS Standalone origin. As cricket_007 mentions in his answer, though, you should carefully consider if this is what you really want to do, as a single Hadoop file can easily be larger than your local disk!

Answering your second question, Data Collector listens on all addresses by default. There is a http.bindHost setting in the sdc.properties config file that you can use to restrict the addresses that Data Collector listens on, but it is commented out by default.

You can use netstat to check - this is what I see on my Mac, with Data Collector listening on all addresses:

$ netstat -ant | grep 18630
tcp46      0      0  *.18630                *.*                    LISTEN    

That wildcard, * in front of the 18630 in the output means that Data Collector will accept connections on any address.

If you are running Data Collector directly on your machine, then the most likely problem is a firewall setting. If you are running Data Collector in a VM or on Docker, you will need to look at your VM/Docker network config.

metadaddy
  • 4,234
  • 1
  • 22
  • 46
0

I believe by default Streamsets only exposes its services on localhost. You'll need to go through the config files to find where you can set it to listen on external addresses

If you are using the CDH Quickstart VM, you'll need to externally forward that port.

Anyway, StreamSets is really designed to run as a cluster, on dedicated servers, for optimal performance. It's production deployments are comparable to Apache Nifi offered in Hortonworks HDF.

So no, it wouldn't make sense to use the local FS destinations for anything other than testing/evaluation purposes.

If you want HDFS exposed as a local device, look into installing an NFS Gateway. Or you can use Streamsets to write to FTP / NFS, probably.

It's not clear what data you're trying to get, but many BI tools can perform CSV exports or Hue can be used to download files from HDFS. At the very least, hdfs dfs -getmerge is the one minimalist way to get data from HDFS to local, however, Hadoop typically stores many TB worth of data in the ideal cases, and if you're using anything smaller, then dumping those results into a database is typically the better option than moving around flatfiles

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • 2
    By default, Data Collector actually listens on all addresses - you have to edit `sdc.properties` to restrict it. – metadaddy Jul 26 '18 at 16:26