Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
8
votes
1 answer

How to run spark-shell with YARN in client mode?

I've installed spark-1.6.1-bin-hadoop2.6.tgz on a 15-node Hadoop cluster. All nodes run Java 1.8.0_72 and the latest version of Hadoop. The Hadoop cluster itself is functional, e.g. YARN can run various MapReduce jobs successfully. I can run Spark…
Emre Sevinç
  • 8,211
  • 14
  • 64
  • 105
8
votes
1 answer

cant find start-all.sh in hadoop installation

I am trying to setup hadoop on my local machine and was following this. I have setup hadoop home also This is the command I am trying to run now hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh And this is the error I get -su:…
Legendary_Hunter
  • 1,040
  • 2
  • 10
  • 29
8
votes
3 answers

Multi-node Hadoop cluster with Docker

I am in planning phase of a multi-node Hadoop cluster in a Docker based environment. So it should be based on a lightweight easy to use virtualized system. Current architecture (regarding to documentation) contains 1 master and 3 slave nodes. This…
user4725754
8
votes
1 answer

Python package installation: pip vs yum, or both together?

I've just started administering a Hadoop cluster. We're using Bright Cluster Manager up to the O/S level (CentOS 7.1) and then Ambari together with Hortonworks HDP 2.3 for Hadoop. I'm constantly getting requests for new python modules to be…
ClusterAdmin
  • 81
  • 1
  • 2
8
votes
2 answers

Passing additional parameters to dbConnect function for JDBCDriver in R

I am trying to connect to HiveServer2 via JDBC drivers from R using RJDBC package. I have seen a broad explanation on passing additional arguments to dbConnect wrapper for various drivers(What arguments can I pass to dbConnect?), but there appear…
Marcin
  • 7,834
  • 8
  • 52
  • 99
8
votes
2 answers

YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register

I am new to Hadoop ecosystem. I recently tried Hadoop (2.7.1) on a single-node Cluster without any problems and decided to move on to a Multi-node cluster having 1 namenode and 2 datanodes. However I am facing a weird issue. Whatever Jobs that I try…
Ashesh
  • 3,499
  • 1
  • 27
  • 44
8
votes
3 answers

Presto unnest json

follwing this question: how to cross join unnest a json array in presto I tried to run the example provided but I get and error while doing so the SQL command: select x.n from unnest(cast(json_extract('{"payload":[{"type":"b","value":"9"},…
Lior Baber
  • 852
  • 3
  • 11
  • 25
8
votes
2 answers

How does Hadoop's RunJar method distribute class/jar files across nodes?

I'm trying to use JIT compilation in clojure to generate mapper and reducer classes on the fly. However, these classes aren't being recognized by the JobClient (it's the usual ClassNotFoundException.) If I AOT compile the Mapper,Reducer and Tool,…
Jieren
  • 1,952
  • 4
  • 18
  • 26
8
votes
2 answers

How many types of InputFormat is there in Hadoop?

I'm new to Hadoop and wondering how many types of InputFormat are there in Hadoop such as TextInputFormat? Is there a certain InputFormat that I can use to read files via http requests to remote data servers? Thanks :)
Trams
  • 239
  • 1
  • 3
  • 11
8
votes
2 answers

How HBase partitions table across regionservers?

Please tell me how HBase partitions table across regionservers. For example, let's say my row keys are integers from 0 to 10M and I have 10 regionservers. Does this mean that first regionserver will store all rows with keys with values 0 - 10M,…
wlk
  • 5,695
  • 6
  • 54
  • 72
8
votes
2 answers

Hadoop log4j not working as No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory)

I am working on developing mapreduce using eclipse , and trying to test it using hadoop 2.6.0 windows standalone mode. But getting the below error for log4j, How to fix the below appender problem, No appenders could be found for logger…
RBanerjee
  • 957
  • 1
  • 9
  • 18
8
votes
2 answers

how to load a Kafka topic to HDFS?

I am using hortonworks sandbox. creating topic: ./kafka-topics.sh --create --zookeeper 10.25.3.207:2181 --replication-factor 1 --partitions 1 --topic lognew tailing the apache access log directory: tail -f /var/log/httpd/access_log…
Deepthy
  • 139
  • 5
  • 14
8
votes
2 answers

Making spark use /etc/hosts file for binding in YARN cluster mode

Have a spark cluster setup on a machine with two inets, one public another private. The /etc/hosts file in the cluster has the internal ip of all the other machines in the cluster, like so. internal_ip FQDN However when I request a SparkContext…
HackToHell
  • 2,223
  • 5
  • 29
  • 44
8
votes
1 answer

Why does YARN job not transition to RUNNING state?

I've got a number of Samza jobs that I want to run. I can get the first to run ok. However, the second job seems to sit at the ACCEPTED state and never transitions into the RUNNING state until I kill the first job. Here is the view from the YARN…
John
  • 10,837
  • 17
  • 78
  • 141
8
votes
2 answers

Persisting Spark Streaming output

I'm collecting the data from a messaging app, I'm currently using Flume, it sends approx 50 Million records per day I wish to use Kafka, consume from Kafka using Spark Streaming and persist it to hadoop and query with impala I'm having issues with…
ohallc
  • 93
  • 1
  • 2
  • 4