Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
90
votes
7 answers

hadoop copy a local file system folder to HDFS

I need to copy a folder from local file system to HDFS. I could not find any example of moving a folder(including its all subfolders) to HDFS $ hadoop fs -copyFromLocal /home/ubuntu/Source-Folder-To-Copy HDFS-URI
Tariq
  • 2,274
  • 4
  • 24
  • 40
87
votes
3 answers

Large scale data processing Hbase vs Cassandra

I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better solution for large scale data processing and analysis. While both are same key/value storage and both are/can run…
Gary Lindahl
  • 5,341
  • 2
  • 19
  • 18
85
votes
18 answers

How do I output the results of a HiveQL query to CSV?

we would like to put the results of a Hive query to a CSV file. I thought the command should look like this: insert overwrite directory '/home/output.csv' select books from table; When I run it, it says it completeld successfully but I can never…
AAA
  • 2,388
  • 9
  • 32
  • 47
85
votes
8 answers

When do reduce tasks start in Hadoop?

In Hadoop when do reduce tasks start? Do they start after a certain percentage (threshold) of mappers complete? If so, is this threshold fixed? What kind of threshold is typically used?
Slayer
  • 2,391
  • 4
  • 21
  • 18
81
votes
9 answers

How to Delete a directory from Hadoop cluster which is having comma(,) in its name?

I have uploaded a Directory to hadoop cluster that is having "," in its name like "MyDir, Name" when I am trying to delete this Directory by using rmr hadoop shell command as following hadoop dfs -rmr hdfs://host:port/Navi/MyDir, Name I'm getting…
java_dev
  • 992
  • 1
  • 8
  • 11
79
votes
2 answers

Hadoop truncated/inconsistent counter name

For now, I have a Hadoop job which creates counters with a pretty big name. For example, the following one:…
mr.nothing
  • 5,141
  • 10
  • 53
  • 77
79
votes
7 answers

Is there any way to get the column name along with the output while execute any query in Hive?

In Hive, when we do a query (like: select * from employee), we do not get any column names in the output (like name, age, salary that we would get in RDBMS SQL), we only get the values. Is there any way to get the column names to be displayed along…
Nithin
  • 9,661
  • 14
  • 44
  • 67
79
votes
11 answers

What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

I am getting: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask While trying to make a copy of a partitioned table using the commands in the hive console: CREATE TABLE copy_table_name LIKE table_name; INSERT…
nickponline
  • 25,354
  • 32
  • 99
  • 167
78
votes
16 answers

How to delete and update a record in Hive

I have installed Hadoop, Hive, Hive JDBC. which are running fine for me. But I still have a problem. How to delete or update a single record using Hive because delete or update command of MySQL is not working in Hive. Thanks hive> delete from…
Charnjeet Singh
  • 3,056
  • 6
  • 35
  • 65
77
votes
10 answers

merge output files after reduce phase

In mapreduce each reduce task write its output to a file named part-r-nnnnn where nnnnn is a partition ID associated with the reduce task. Does map/reduce merge these files? If yes, how?
Shahryar
  • 1,454
  • 2
  • 15
  • 32
77
votes
12 answers

Where does Hive store files in HDFS?

I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly. Where does Hive store its files in HDFS?
Yuval
  • 7,987
  • 12
  • 40
  • 54
75
votes
5 answers

Hive: how to show all partitions of a table?

I have a table with 1000+ partitions. "Show partitions" command only lists a small number of partitions. How can i show all partitions? Update: I found "show partitions" command only lists exactly 500 partitions. "select ... where ..." only…
Kevin Leo
  • 850
  • 1
  • 7
  • 9
75
votes
12 answers

Buiding Hadoop with Eclipse / Maven - Missing artifact jdk.tools:jdk.tools:jar:1.6

I am trying to import cloudera's org.apache.hadoop:hadoop-client:2.0.0-cdh4.0.0 from cdh4 maven repo in a maven project in eclipse 3.81, m2e plugin, with oracle's jdk 1.7.0_05 on win7 using org.apache.hadoop
jvataman
  • 1,357
  • 1
  • 12
  • 13
74
votes
16 answers

Hive insert query like SQL

I am new to hive, and want to know if there is anyway to insert data into Hive table like we do in SQL. I want to insert my data into hive like INSERT INTO tablename VALUES (value1,value2..) I have read that you can load the data from a file to…
Y0gesh Gupta
  • 2,184
  • 5
  • 40
  • 56
72
votes
3 answers

Differences between Amazon S3 and S3n in Hadoop

When I connected my Hadoop cluster to Amazon storage and downloaded files to HDFS, I found s3:// did not work. When looking for some help on the Internet I found I can use S3n. When I used S3n it worked. I do not understand the differences between…
user1355361