Python and Hadoop - fetch and write data directly to hdfs using python?

Question

I want to fetch the data daily from yahoo/google finance, related to stock's eod prices. These prices should be directly stored in HDFS in file.

I can later make external table on top of it (using HIVE) and use for further analysis.

So, I am not looking for basic map-reduce, since I don't have any input file as such. Are there any connectors available in python, which can write data in Hadoop?

score 1 · Answer 1 · edited May 23 '17 at 11:51

1

Start with dumping your data in a local file. Then find a way to upload the file to HDFS.

If you are running your job on an "edge node" (i.e. a Linux box that is not part of the cluster but has all the Hadoop clients installed and configured), then you have the good old HDFS command-line interface

hdfs dfs -put data.txt /user/johndoe/some/hdfs/dir/

If you are running your job anywhere else, use an HTTP library (or good old curl command line) to connect to the HDFS REST service -- could be either webHDFS or httpFS depending on the way the cluster has been set up -- and upload the file with a PUT request

http://namenode:port/webhdfs/v1/user/johndoe/some/hdfs/dir/data.txt?op=CREATE&overwrite=false

(and the content of "data.txt" as payload, of course)

edited May 23 '17 at 11:51

Community

1
1

answered Aug 08 '15 at 19:47

Samson Scharfrichter

8,884
1
17
36

BTW: when using a REST service against a HA cluster, you must call each NameNode until you find the active one. – Samson Scharfrichter Aug 09 '15 at 08:30
BTW, when unsing a REST service against a secure cluster, you must set up a Kerberos SPNEGO authentification - and optionally store the Hadoop *delegation token* for the duration of the session. – Samson Scharfrichter Aug 09 '15 at 08:32

Python and Hadoop - fetch and write data directly to hdfs using python?

1 Answers1