Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
39
votes
3 answers

Why do we need ZooKeeper in the Hadoop stack?

I am new to Hadoop/ZooKeeper. I cannot understand the purpose of using ZooKeeper with Hadoop, is ZooKeeper writing data in Hadoop? If not, then why we do we use ZooKeeper with Hadoop?
user1099871
38
votes
6 answers

Why isn't Hadoop implemented using MPI?

Correct me if I'm wrong, but my understanding is that Hadoop does not use MPI for communication between different nodes. What are the technical reasons for this? I could hazard a few guesses, but I do not know enough of how MPI is implemented "under…
artif
  • 907
  • 1
  • 7
  • 12
38
votes
2 answers

Is it better to have one large parquet file or lots of smaller parquet files?

I understand hdfs will split files into something like 64mb chunks. We have data coming in streaming and we can store them to large files or medium sized files. What is the optimum size for columnar file storage? If I can store files to where the…
ForeverConfused
  • 1,607
  • 3
  • 26
  • 41
38
votes
1 answer

what is HiveServer and Thrift server

I just started learning Hive.There are three terms which often I seen in Hive books or Hive tutorials. Hive Server,Hive Service and Thrift Server. What is these ? how they are related ?. what is the difference ?. when each of these are used? please…
38
votes
14 answers

There are 0 datanode(s) running and no node(s) are excluded in this operation

I have set up a multi node Hadoop Cluster. The NameNode and Secondary namenode runs on the same machine and the cluster has only one Datanode. All the nodes are configured on Amazon EC2 machines. Following are the configuration files on the master…
Learner
  • 449
  • 1
  • 7
  • 16
38
votes
6 answers

Where HDFS stores files locally by default?

I am running hadoop with default configuration with one-node cluster, and would like to find where HDFS stores files locally. Any ideas? Thanks.
crypto5
  • 851
  • 2
  • 10
  • 16
38
votes
7 answers

Download large data for Hadoop

I need a large data (more than 10GB) to run Hadoop demo. Anybody known where I can download it. Please let me know.
Nevis
  • 409
  • 1
  • 7
  • 8
37
votes
3 answers

What is Google's Dremel? How is it different from Mapreduce?

Google's Dremel is described here. What's the difference between Dremel and Mapreduce?
Yktula
  • 14,179
  • 14
  • 48
  • 71
37
votes
1 answer

Do exit codes and exit statuses mean anything in spark?

I see exit codes and exit statuses all the time when running spark on yarn: Here are a few: CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM ...failed 2 times due to AM Container for application_1431523563856_0001_000002 exited with …
makansij
  • 9,303
  • 37
  • 105
  • 183
37
votes
1 answer

What is a keytab exactly?

I am trying to understand how Kerberos works and so came across this file called Keytab which, I believe, is used for authentication to the KDC server. Just like every user and service(say Hadoop) in a kerberos realm has a service principal, does…
white-hawk-73
  • 856
  • 2
  • 10
  • 24
37
votes
4 answers

How to know what is the reason for ClosedChannelExceptions with spark-shell in YARN client mode?

I have been trying to run spark-shell in YARN client mode, but I am getting a lot of ClosedChannelException errors. I am using spark 2.0.0 build for Hadoop 2.6. Here are the exceptions : $ spark-2.0.0-bin-hadoop2.6/bin/spark-shell --master yarn…
aks
  • 1,019
  • 1
  • 9
  • 17
37
votes
7 answers

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I'm trying to run the spark examples from Eclipse and getting this generic error: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources. The version I have is…
Eddy
  • 3,533
  • 13
  • 59
  • 89
37
votes
6 answers

Pyspark: get list of files/directories on HDFS path

As per title. I'm aware of textFile but, as the name suggests, it works only on text files. I would need to access files/directories inside a path on either HDFS or a local path. I'm using pyspark.
Federico Ponzi
  • 2,682
  • 4
  • 34
  • 60
37
votes
10 answers

Cannot Read a file from HDFS using Spark

I have installed cloudera CDH 5 by using cloudera manager. I can easily do hadoop fs -ls /input/war-and-peace.txt hadoop fs -cat /input/war-and-peace.txt this above command will print the whole txt file on the console. now I start the spark shell…
Knows Not Much
  • 30,395
  • 60
  • 197
  • 373
37
votes
5 answers

Loading Data from a .txt file to Table Stored as ORC in Hive

I have a data file which is in .txt format. I am using the file to load data into Hive tables. When I load the file in a table like CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE; the data is loaded correctly…
Neels
  • 2,547
  • 6
  • 33
  • 40