8

I want to know what are the different ways through which I can bring data into HDFS.

I am a newbie to Hadoop and was a java web developer till this time. I want to know if I have a web application that is creating log files, how can i import the log files into HDFS.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Gaurav
  • 81
  • 1
  • 1
  • 2

2 Answers2

10

There are lot's of ways on how you can ingest data into HDFS, let me try to illustrate them here:

  1. hdfs dfs -put - simple way to insert files from local file system to HDFS
  2. HDFS Java API
  3. Sqoop - for bringing data to/from databases
  4. Flume - streaming files, logs
  5. Kafka - distributed queue, mostly for near-real time stream processing
  6. Nifi - incubating project at Apache for moving data into HDFS without making lots of changes

Best solution for bringing web application logs to HDFS is through Flume.

Ashrith
  • 6,745
  • 2
  • 29
  • 36
  • Thank you @Ashrith..can you please tell me that in a company with Bigdata services , how do they import data into HDFS ? Do they use the same methods which you mentioned here. – Gaurav Sep 26 '15 at 08:40
  • Which tool is used depends on the type of data that you want to import to HDFS. That being said , you can also have a pipeline for importing data that uses more that one tool . eg: Flume+Kafka – Clyde D'Cruz Sep 26 '15 at 10:29
  • Thank you @clyde D'cruz..If my company uses CRM, ERP and a server for data storage what can be the ways to import data from these systems into hdfs – Gaurav Sep 26 '15 at 12:26
  • @Gaurav yes the tools mentioned above are the ones being used by enterprises and companies implementing big data. For systems like CRM, ERP the data is generally gathered/exported on to a different system and either you could use flume for streaming those files or use typical hdfs put commands. Typically people use Hadoop if you have a big data problem (hundreds of terabytes) or for data warehouse offloads, if non of them are your concerns then don't go into the pain of implementing Hadoop. – Ashrith Sep 26 '15 at 16:26
1

We have three different kinds of data - Structured (schema based systems like Oracle/MySQL etc.), Unstructured (images, weblogs etc.) and Semi-structured data(XML etc.)

Structured data can be stored in database SQL in table with rows and columns

Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. With some process you can store them in relation database (e.g. XML)

Unstructured data often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents.

Depending on type of your data, you will choose the tools to import data into HDFS.

Your company may use CRM,ERP tools. But we don't exactly know how the data is organized & structured.

If we leave simple HDFS commands like put, copyFromLocal etc to load data into HDFS compatible format, below are the main tools to load data into HDFS

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Data from MySQL, SQL Server & Oracle tables can be loaded into HDFS with this tool.

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.

Other tools include Chukwa,Storm and Kafka

But other important technology, which is becoming very popular is Spark. It is a Friend & Foe for Hadoop.

Spark is emerging an good alternative to Hadoop for real time data processing, which may or may not use HDFS as data source.

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211