1

I want to fetch the data daily from yahoo/google finance, related to stock's eod prices. These prices should be directly stored in HDFS in file.

I can later make external table on top of it (using HIVE) and use for further analysis.

So, I am not looking for basic map-reduce, since I don't have any input file as such. Are there any connectors available in python, which can write data in Hadoop?

Mihir
  • 531
  • 2
  • 10
  • 35

1 Answers1

1

Start with dumping your data in a local file. Then find a way to upload the file to HDFS.

  • If you are running your job on an "edge node" (i.e. a Linux box that is not part of the cluster but has all the Hadoop clients installed and configured), then you have the good old HDFS command-line interface

hdfs dfs -put data.txt /user/johndoe/some/hdfs/dir/

  • If you are running your job anywhere else, use an HTTP library (or good old curl command line) to connect to the HDFS REST service -- could be either webHDFS or httpFS depending on the way the cluster has been set up -- and upload the file with a PUT request

http://namenode:port/webhdfs/v1/user/johndoe/some/hdfs/dir/data.txt?op=CREATE&overwrite=false

(and the content of "data.txt" as payload, of course)

Community
  • 1
  • 1
Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36
  • BTW: when using a REST service against a HA cluster, you must call each NameNode until you find the active one. – Samson Scharfrichter Aug 09 '15 at 08:30
  • BTW, when unsing a REST service against a secure cluster, you must set up a Kerberos SPNEGO authentification - and optionally store the Hadoop *delegation token* for the duration of the session. – Samson Scharfrichter Aug 09 '15 at 08:32