Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
45
votes
3 answers

Explode the Array of Struct in Hive

This is the below Hive Table CREATE EXTERNAL TABLE IF NOT EXISTS SampleTable ( USER_ID BIGINT, NEW_ITEM ARRAY> ) And this is the data in the above table- 1015826235 …
arsenal
  • 23,366
  • 85
  • 225
  • 331
45
votes
8 answers

Hadoop java.io.IOException: Mkdirs failed to create /some/path

When I try to run my Job I am getting the following exception: Exception in thread "main" java.io.IOException: Mkdirs failed to create /some/path at org.apache.hadoop.util.RunJar.ensureDirectory(RunJar.java:106) at…
alien01
  • 1,334
  • 2
  • 14
  • 31
44
votes
5 answers

Cleanest way in Gradle to get the path to a jar file in the gradle dependency cache

I'm using Gradle to help automate Hadoop tasks. When calling Hadoop, I need to be able to pass it the path to some jars that my code depends on so that Hadoop can send that dependency on during the map/reduce phase. I've figured out something that…
Ted Naleid
  • 26,511
  • 10
  • 70
  • 81
44
votes
7 answers

Save Spark dataframe as dynamic partitioned table in Hive

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.saveAsTable(tablename,mode). The above code works fine, but I have so much data for each…
Chetandalal
  • 674
  • 1
  • 7
  • 18
44
votes
8 answers

what is difference between partition and replica of a topic in kafka cluster

What is difference between partition and replica of a topic in kafka cluster. I mean both store the copies of messages in a topic. Then what is the real diffrence?
Gaurav Khare
  • 2,203
  • 4
  • 25
  • 23
44
votes
11 answers

How to get the input file name in the mapper in a Hadoop program?

How I can get the name of the input file within a mapper? I have multiple input files stored in the input directory, each mapper may read a different file, and I need to know which file the mapper has read.
HHH
  • 6,085
  • 20
  • 92
  • 164
44
votes
1 answer

What are SUCCESS and part-r-00000 files in hadoop

Although I use Hadoop frequently on my Ubuntu machine I have never thought about SUCCESS and part-r-00000 files. The output always resides in part-r-00000 file, but what is the use of SUCCESS file? Why does the output file have the name part-r-0000?…
ravi
  • 6,140
  • 18
  • 77
  • 154
43
votes
4 answers

How to write 'map only' hadoop jobs?

I'm a novice on hadoop, I'm getting familiar to the style of map-reduce programing but now I faced a problem : Sometimes I need only map for a job and I only need the map result directly as output, which means reduce phase is not needed here, how…
Breakinen
  • 619
  • 2
  • 7
  • 13
43
votes
9 answers

COLLECT_SET() in Hive, keep duplicates?

Is there a way to keep the duplicates in a collected set in Hive, or simulate the sort of aggregate collection that Hive provides using some other method? I want to aggregate all of the items in a column that have the same key into an array, with…
batman
  • 1,447
  • 5
  • 16
  • 27
43
votes
3 answers

When using --negotiate with curl, is a keytab file required?

The documentation describing how to connect to a kerberos secured endpoint shows the following: curl -i --negotiate -u : "http://:/webhdfs/v1/?op=..." The -u flag has to be provided but is ignored by curl. Does the --negotiate…
Chris Snow
  • 23,813
  • 35
  • 144
  • 309
43
votes
4 answers

Spark on yarn concept understanding

I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind. Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task…
Sporty
  • 741
  • 2
  • 8
  • 15
42
votes
2 answers

Spark Unable to load native-hadoop library for your platform

I'm a dummy on Ubuntu 16.04, desperately attempting to make Spark work. I've tried to fix my problem using the answers found here on stackoverflow but I couldn't resolve anything. Launching spark with the command ./spark-shell from bin folder I get…
cane_mastino
  • 421
  • 1
  • 4
  • 4
42
votes
11 answers

Datanode not starts correctly

I am trying to install Hadoop 2.2.0 in pseudo-distributed mode. While I am trying to start the datanode services it is showing the following error, can anyone please tell how to resolve this? **2**014-03-11 08:48:15,916 INFO…
user2631600
  • 759
  • 1
  • 11
  • 18
42
votes
1 answer

Why HBase is a better choice than Cassandra with Hadoop?

Why is using HBase a better choice than using Cassandra with Hadoop? Can anyone please give a detailed explanation on this? Thanks
Niladri Biswas
  • 4,153
  • 2
  • 17
  • 24
42
votes
7 answers

view contents of file in hdfs hadoop

Probably a noob question but is there a way to read the contents of file in hdfs besides copying to local and reading thru unix? So right now what I am doing is: bin/hadoop dfs -copyToLocal hdfs/path local/path nano local/path I am wondering…
frazman
  • 32,081
  • 75
  • 184
  • 269