Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Ambari, A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for Hadoop
HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
health such as heatmaps and ability to view MapReduce, Pig and Hive
applications visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Avro, a data serialization system based on JSON schemas.
Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
Chukwa: A data collection system for managing large distributed systems.
Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
Flink, a fast and reliable large-scale data processing engine.
Giraph is an iterative graph processing framework, built on top of Apache Hadoop
HBase, A scalable, distributed database that supports structured data storage for large tables.
Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout, a library of machine learning algorithms compatible with M/R paradigm.
Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
Pig, a platform/programming language for authoring parallelizable jobs
Spark, a fast and general engine for large-scale data processing.
Storm, a system for real-time and stream processing
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions

votes

3 answers

Why do we need ZooKeeper in the Hadoop stack?

I am new to Hadoop/ZooKeeper. I cannot understand the purpose of using ZooKeeper with Hadoop, is ZooKeeper writing data in Hadoop? If not, then why we do we use ZooKeeper with Hadoop?

java hadoop apache-zookeeper

asked May 24 '12 at 07:15

user1099871

votes

6 answers

Why isn't Hadoop implemented using MPI?

Correct me if I'm wrong, but my understanding is that Hadoop does not use MPI for communication between different nodes. What are the technical reasons for this? I could hazard a few guesses, but I do not know enough of how MPI is implemented "under…

tcp hadoop protocol-buffers mpi distributed-computing

asked Jan 04 '11 at 04:34

artif

votes

2 answers

Is it better to have one large parquet file or lots of smaller parquet files?

I understand hdfs will split files into something like 64mb chunks. We have data coming in streaming and we can store them to large files or medium sized files. What is the optimum size for columnar file storage? If I can store files to where the…

hadoop apache-spark parquet

asked Mar 21 '17 at 04:48

ForeverConfused

1,607
3
26
41

votes

1 answer

what is HiveServer and Thrift server

I just started learning Hive.There are three terms which often I seen in Hive books or Hive tutorials. Hive Server,Hive Service and Thrift Server. What is these ? how they are related ?. what is the difference ?. when each of these are used? please…

hadoop hive

asked Dec 02 '16 at 04:08

Surendiran Balasubramanian

votes

14 answers

There are 0 datanode(s) running and no node(s) are excluded in this operation

I have set up a multi node Hadoop Cluster. The NameNode and Secondary namenode runs on the same machine and the cluster has only one Datanode. All the nodes are configured on Amazon EC2 machines. Following are the configuration files on the master…

ubuntu hadoop amazon-ec2 hdfs hadoop2

asked Oct 24 '14 at 09:47

Learner

votes

6 answers

Where HDFS stores files locally by default?

I am running hadoop with default configuration with one-node cluster, and would like to find where HDFS stores files locally. Any ideas? Thanks.

hadoop hdfs

asked Mar 01 '10 at 19:19

crypto5

votes

7 answers

Download large data for Hadoop

I need a large data (more than 10GB) to run Hadoop demo. Anybody known where I can download it. Please let me know.

hadoop download

asked Jun 01 '12 at 03:07

Nevis

votes

3 answers

What is Google's Dremel? How is it different from Mapreduce?

Google's Dremel is described here. What's the difference between Dremel and Mapreduce?

hadoop mapreduce google-bigquery abstraction

asked Jul 07 '11 at 08:03

Yktula

14,179
14
48
71

votes

1 answer

Do exit codes and exit statuses mean anything in spark?

I see exit codes and exit statuses all the time when running spark on yarn: Here are a few: CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM ...failed 2 times due to AM Container for application_1431523563856_0001_000002 exited with …

hadoop apache-spark pyspark apache-spark-sql hadoop-yarn

asked Aug 01 '17 at 02:45

makansij

9,303
37
105
183

votes

1 answer

What is a keytab exactly?

I am trying to understand how Kerberos works and so came across this file called Keytab which, I believe, is used for authentication to the KDC server. Just like every user and service(say Hadoop) in a kerberos realm has a service principal, does…

security hadoop authentication kerberos keytab

asked May 09 '17 at 07:04

white-hawk-73

votes

4 answers

How to know what is the reason for ClosedChannelExceptions with spark-shell in YARN client mode?

I have been trying to run spark-shell in YARN client mode, but I am getting a lot of ClosedChannelException errors. I am using spark 2.0.0 build for Hadoop 2.6. Here are the exceptions : $ spark-2.0.0-bin-hadoop2.6/bin/spark-shell --master yarn…

hadoop apache-spark spark-streaming hadoop-yarn

asked Sep 13 '16 at 10:26

aks

1,019
1
9
17

votes

7 answers

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I'm trying to run the spark examples from Eclipse and getting this generic error: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources. The version I have is…

java hadoop apache-spark

asked Jun 30 '16 at 09:07

Eddy

3,533
13
59
89

votes

6 answers

Pyspark: get list of files/directories on HDFS path

As per title. I'm aware of textFile but, as the name suggests, it works only on text files. I would need to access files/directories inside a path on either HDFS or a local path. I'm using pyspark.

hadoop apache-spark pyspark

asked Mar 02 '16 at 14:53

Federico Ponzi

2,682
4
34
60

votes

10 answers

Cannot Read a file from HDFS using Spark

I have installed cloudera CDH 5 by using cloudera manager. I can easily do hadoop fs -ls /input/war-and-peace.txt hadoop fs -cat /input/war-and-peace.txt this above command will print the whole txt file on the console. now I start the spark shell…

hadoop apache-spark cloudera-cdh

asked Dec 15 '14 at 05:47

Knows Not Much

30,395
60
197
373

votes

5 answers

Loading Data from a .txt file to Table Stored as ORC in Hive

I have a data file which is in .txt format. I am using the file to load data into Hive tables. When I load the file in a table like CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE; the data is loaded correctly…

hadoop hive

asked Feb 12 '14 at 07:23

Neels

2,547
6
33
40

Prev 1 2 3

…

99 100 Next