Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Ambari, A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for Hadoop
HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
health such as heatmaps and ability to view MapReduce, Pig and Hive
applications visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Avro, a data serialization system based on JSON schemas.
Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
Chukwa: A data collection system for managing large distributed systems.
Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
Flink, a fast and reliable large-scale data processing engine.
Giraph is an iterative graph processing framework, built on top of Apache Hadoop
HBase, A scalable, distributed database that supports structured data storage for large tables.
Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout, a library of machine learning algorithms compatible with M/R paradigm.
Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
Pig, a platform/programming language for authoring parallelizable jobs
Spark, a fast and general engine for large-scale data processing.
Storm, a system for real-time and stream processing
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions

votes

1 answer

Accessing stream output from hdfs of MRjob

I'm trying to use a Python driver to run an iterative MRjob program. The exit criteria depend on a counter. The job itself seems to run. If I run a single iteration from the command line, I can then hadoop fs -cat /user/myname/myhdfsdir/part-00000…

python hadoop mapreduce hdfs mrjob

asked Mar 25 '18 at 04:10

tony_tiger

votes

2 answers

What does msck stands for in Msck repair command

Hive Msck repair command is used to repair partitions, but what is full form of MSCK. I already tried to find in hive doc's but hard luck.

hadoop hive hiveql

asked Dec 30 '17 at 15:36

Kaustubh Deshpande

votes

6 answers

How to delete files from the HDFS?

I just downloaded Hortonworks sandbox VM, inside it there are Hadoop with the version 2.7.1. I adding some files by using the hadoop fs -put /hw1/* /hw1 ...command. After it I am deleting the added files, by the hadoop fs -rm /hw1/* ...command,…

hadoop hdfs hortonworks-data-platform

asked Dec 07 '15 at 18:12

serg

1,003
3
16
26

votes

6 answers

Parquet without Hadoop?

I want to use parquet in one of my projects as columnar storage. But i dont want to depends on hadoop/hdfs libs. Is it possible to use parquet outside of hdfs? Or What is the min dependency?

hadoop hdfs parquet

asked Mar 26 '15 at 13:35

capacman

votes

7 answers

Is there an equivalent to `pwd` in hdfs?

I tried to do hdfs dfs -pwd, but that command does not exist. So currently I am resorting to doing hdfs dfs -ls .. followed by hdfs dfs -ls ../... I also looked at the command listing for hdfs dfs but did not see anything that looked promising. Is…

hadoop hdfs

asked Feb 03 '14 at 23:05

merlin2011

71,677
44
195
329

votes

3 answers

What is the advantage of storing schema in avro?

We need to serialize some data for putting into solr as well as hadoop. I am evaluating serialization tools for the same. The top two in my list are Gson and Avro. As far as I understand, Avro = Gson + Schema-In-JSON If that is correct, I do not see…

java apache hadoop solr avro

asked Dec 12 '13 at 23:25

user2250246

3,807
5
43
71

votes

1 answer

Hadoop speculative task execution

In Google's MapReduce paper, they have a backup task, I think it's the same thing with speculative task in Hadoop. How is the speculative task implemented? When I start a speculative task, does the task start from the very begining as the older and…

hadoop mapreduce

asked Mar 01 '13 at 18:56

lil

2,527
4
22
15

votes

12 answers

Working With Hadoop: localhost: Error: JAVA_HOME is not set

I'm working with Ubuntu 12.04 LTS. I'm going through the hadoop quickstart manual to make a pseudo-distributed operation. It seems simple and straightforward (easy!). However, when I try to run start-all.sh I get: localhost: Error: JAVA_HOME is…

bash hadoop ubuntu-12.04 java-home

asked Jan 14 '13 at 19:52

Ali Ismail

votes

6 answers

No such method exception Hadoop

When I am running a Hadoop .jar file from the command prompt, it throws an exception saying no such method StockKey method. StockKey is my custom class defined for my own type of key. Here is the exception: 12/07/12 00:18:47 INFO mapred.JobClient:…

java hadoop mapreduce

asked Jul 12 '12 at 07:08

London guy

27,522
44
121
179

votes

5 answers

$HADOOP_HOME is deprecated

I started a hadoop cluster. I get this warning message: $HADOOP_HOME is deprecated I already add export HADOOP_HOME_WARN_SUPPRESS="TRUE" into hadoop-env.sh When I started the cluster, I do not see any more warning message. However, When I run…

hadoop warnings deprecated

asked Feb 15 '12 at 02:07

chnet

1,993
9
36
51

votes

4 answers

Amazon Emr - What is the need of Task nodes when we have Core nodes?

I am learning about Amazon EMR lately, and according to my knowledge the EMR cluster lets us choose 3 nodes. Master which runs the Primary Hadoop daemons like NameNode,Job Tracker and Resource manager. Core which runs Datanode and Tasktracker…

hadoop hadoop2 amazon-emr

asked Jan 07 '17 at 08:23

Taher Koitawala

votes

4 answers

How to restart yarn on AWS EMR

I am using Hadoop 2.6.0 (emr-4.2.0 image). I have made some changes in yarn-site.xml and want to restart yarn to bring the changes into effect. Is there a command using which I can do this?

hadoop hadoop-yarn emr

asked Jan 22 '16 at 18:11

nish

6,952
18
74
128

votes

2 answers

Understand Spark: Cluster Manager, Master and Driver nodes

Having read this question, I would like to ask additional questions: The Cluster Manager is a long-running service, on which node it is running? Is it possible that the Master and the Driver nodes will be the same machine? I presume that there…

hadoop apache-spark hadoop-yarn failover apache-spark-standalone

asked Jan 11 '16 at 13:10

Rami

8,044
18
66
108

votes

4 answers

What is the purpose of "uber mode" in hadoop?

Hi I am a big data newbie. I searched all over the internet to find what exactly uber mode is. The more I searched the more I got confused. Can anybody please help me by answering my questions? What does uber mode do? Does it works differently in…

hadoop mapreduce

asked May 17 '15 at 06:58

Mohammed Asad

votes

4 answers

Display the SQL definition of a hive view

How to display the view definition of a hive view in its SQL form. Most relational databases supports commands like SHOW CREATE VIEW viewname;

hadoop hive

asked Jul 04 '14 at 19:13

rogue-one

11,259
7
53
75

Prev 1 2 3

…

99 100 Next