Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
31
votes
1 answer

Accessing stream output from hdfs of MRjob

I'm trying to use a Python driver to run an iterative MRjob program. The exit criteria depend on a counter. The job itself seems to run. If I run a single iteration from the command line, I can then hadoop fs -cat /user/myname/myhdfsdir/part-00000…
tony_tiger
  • 789
  • 1
  • 11
  • 25
31
votes
2 answers

What does msck stands for in Msck repair command

Hive Msck repair command is used to repair partitions, but what is full form of MSCK. I already tried to find in hive doc's but hard luck.
31
votes
6 answers

How to delete files from the HDFS?

I just downloaded Hortonworks sandbox VM, inside it there are Hadoop with the version 2.7.1. I adding some files by using the hadoop fs -put /hw1/* /hw1 ...command. After it I am deleting the added files, by the hadoop fs -rm /hw1/* ...command,…
serg
  • 1,003
  • 3
  • 16
  • 26
31
votes
6 answers

Parquet without Hadoop?

I want to use parquet in one of my projects as columnar storage. But i dont want to depends on hadoop/hdfs libs. Is it possible to use parquet outside of hdfs? Or What is the min dependency?
capacman
  • 317
  • 1
  • 4
  • 7
31
votes
7 answers

Is there an equivalent to `pwd` in hdfs?

I tried to do hdfs dfs -pwd, but that command does not exist. So currently I am resorting to doing hdfs dfs -ls .. followed by hdfs dfs -ls ../... I also looked at the command listing for hdfs dfs but did not see anything that looked promising. Is…
merlin2011
  • 71,677
  • 44
  • 195
  • 329
31
votes
3 answers

What is the advantage of storing schema in avro?

We need to serialize some data for putting into solr as well as hadoop. I am evaluating serialization tools for the same. The top two in my list are Gson and Avro. As far as I understand, Avro = Gson + Schema-In-JSON If that is correct, I do not see…
user2250246
  • 3,807
  • 5
  • 43
  • 71
31
votes
1 answer

Hadoop speculative task execution

In Google's MapReduce paper, they have a backup task, I think it's the same thing with speculative task in Hadoop. How is the speculative task implemented? When I start a speculative task, does the task start from the very begining as the older and…
lil
  • 2,527
  • 4
  • 22
  • 15
31
votes
12 answers

Working With Hadoop: localhost: Error: JAVA_HOME is not set

I'm working with Ubuntu 12.04 LTS. I'm going through the hadoop quickstart manual to make a pseudo-distributed operation. It seems simple and straightforward (easy!). However, when I try to run start-all.sh I get: localhost: Error: JAVA_HOME is…
Ali Ismail
  • 311
  • 1
  • 3
  • 5
31
votes
6 answers

No such method exception Hadoop

When I am running a Hadoop .jar file from the command prompt, it throws an exception saying no such method StockKey method. StockKey is my custom class defined for my own type of key. Here is the exception: 12/07/12 00:18:47 INFO mapred.JobClient:…
London guy
  • 27,522
  • 44
  • 121
  • 179
30
votes
5 answers

$HADOOP_HOME is deprecated

I started a hadoop cluster. I get this warning message: $HADOOP_HOME is deprecated I already add export HADOOP_HOME_WARN_SUPPRESS="TRUE" into hadoop-env.sh When I started the cluster, I do not see any more warning message. However, When I run…
chnet
  • 1,993
  • 9
  • 36
  • 51
30
votes
4 answers

Amazon Emr - What is the need of Task nodes when we have Core nodes?

I am learning about Amazon EMR lately, and according to my knowledge the EMR cluster lets us choose 3 nodes. Master which runs the Primary Hadoop daemons like NameNode,Job Tracker and Resource manager. Core which runs Datanode and Tasktracker…
Taher Koitawala
  • 301
  • 1
  • 3
  • 6
30
votes
4 answers

How to restart yarn on AWS EMR

I am using Hadoop 2.6.0 (emr-4.2.0 image). I have made some changes in yarn-site.xml and want to restart yarn to bring the changes into effect. Is there a command using which I can do this?
nish
  • 6,952
  • 18
  • 74
  • 128
30
votes
2 answers

Understand Spark: Cluster Manager, Master and Driver nodes

Having read this question, I would like to ask additional questions: The Cluster Manager is a long-running service, on which node it is running? Is it possible that the Master and the Driver nodes will be the same machine? I presume that there…
Rami
  • 8,044
  • 18
  • 66
  • 108
30
votes
4 answers

What is the purpose of "uber mode" in hadoop?

Hi I am a big data newbie. I searched all over the internet to find what exactly uber mode is. The more I searched the more I got confused. Can anybody please help me by answering my questions? What does uber mode do? Does it works differently in…
Mohammed Asad
  • 979
  • 1
  • 8
  • 18
30
votes
4 answers

Display the SQL definition of a hive view

How to display the view definition of a hive view in its SQL form. Most relational databases supports commands like SHOW CREATE VIEW viewname;
rogue-one
  • 11,259
  • 7
  • 53
  • 75