Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Ambari, A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for Hadoop
HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
health such as heatmaps and ability to view MapReduce, Pig and Hive
applications visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Avro, a data serialization system based on JSON schemas.
Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
Chukwa: A data collection system for managing large distributed systems.
Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
Flink, a fast and reliable large-scale data processing engine.
Giraph is an iterative graph processing framework, built on top of Apache Hadoop
HBase, A scalable, distributed database that supports structured data storage for large tables.
Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout, a library of machine learning algorithms compatible with M/R paradigm.
Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
Pig, a platform/programming language for authoring parallelizable jobs
Spark, a fast and general engine for large-scale data processing.
Storm, a system for real-time and stream processing
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions

votes

3 answers

Explode the Array of Struct in Hive

This is the below Hive Table CREATE EXTERNAL TABLE IF NOT EXISTS SampleTable ( USER_ID BIGINT, NEW_ITEM ARRAY> ) And this is the data in the above table- 1015826235 …

hadoop mapreduce hive hiveql

asked Jul 07 '12 at 08:36

arsenal

23,366
85
225
331

votes

8 answers

Hadoop java.io.IOException: Mkdirs failed to create /some/path

When I try to run my Job I am getting the following exception: Exception in thread "main" java.io.IOException: Mkdirs failed to create /some/path at org.apache.hadoop.util.RunJar.ensureDirectory(RunJar.java:106) at…

hadoop hdfs ioexception

asked May 09 '12 at 19:29

alien01

1,334
2
14
31

votes

5 answers

Cleanest way in Gradle to get the path to a jar file in the gradle dependency cache

I'm using Gradle to help automate Hadoop tasks. When calling Hadoop, I need to be able to pass it the path to some jars that my code depends on so that Hadoop can send that dependency on during the map/reduce phase. I've figured out something that…

jar hadoop dependencies gradle

asked Mar 06 '12 at 04:33

Ted Naleid

26,511
10
70
81

votes

7 answers

Save Spark dataframe as dynamic partitioned table in Hive

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.saveAsTable(tablename,mode). The above code works fine, but I have so much data for each…

apache-spark hadoop hive apache-spark-sql

asked Jul 10 '15 at 13:03

Chetandalal

votes

8 answers

what is difference between partition and replica of a topic in kafka cluster

What is difference between partition and replica of a topic in kafka cluster. I mean both store the copies of messages in a topic. Then what is the real diffrence?

hadoop apache-kafka

asked Nov 26 '14 at 13:51

Gaurav Khare

2,203
4
25
23

votes

11 answers

How to get the input file name in the mapper in a Hadoop program?

How I can get the name of the input file within a mapper? I have multiple input files stored in the input directory, each mapper may read a different file, and I need to know which file the mapper has read.

hadoop mapreduce

asked Sep 25 '13 at 18:28

HHH

6,085
20
92
164

votes

1 answer

What are SUCCESS and part-r-00000 files in hadoop

Although I use Hadoop frequently on my Ubuntu machine I have never thought about SUCCESS and part-r-00000 files. The output always resides in part-r-00000 file, but what is the use of SUCCESS file? Why does the output file have the name part-r-0000?…

hadoop mapreduce

asked May 19 '12 at 15:22

ravi

6,140
18
77
154

votes

4 answers

How to write 'map only' hadoop jobs?

I'm a novice on hadoop, I'm getting familiar to the style of map-reduce programing but now I faced a problem : Sometimes I need only map for a job and I only need the map result directly as output, which means reduce phase is not needed here, how…

hadoop mapreduce

asked Feb 22 '12 at 12:06

Breakinen

votes

9 answers

COLLECT_SET() in Hive, keep duplicates?

Is there a way to keep the duplicates in a collected set in Hive, or simulate the sort of aggregate collection that Hive provides using some other method? I want to aggregate all of the items in a column that have the same key into an array, with…

java hadoop user-defined-functions hive

asked Jun 22 '11 at 19:23

batman

1,447
5
16
27

votes

3 answers

When using --negotiate with curl, is a keytab file required?

The documentation describing how to connect to a kerberos secured endpoint shows the following: curl -i --negotiate -u : "http://:/webhdfs/v1/?op=..." The -u flag has to be provided but is ignored by curl. Does the --negotiate…

hadoop curl kerberos webhdfs keytab

asked Jul 21 '16 at 16:39

Chris Snow

23,813
35
144
309

votes

4 answers

Spark on yarn concept understanding

I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind. Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task…

hadoop apache-spark hdfs hadoop-yarn

asked Jul 23 '14 at 12:00

Sporty

votes

2 answers

Spark Unable to load native-hadoop library for your platform

I'm a dummy on Ubuntu 16.04, desperately attempting to make Spark work. I've tried to fix my problem using the answers found here on stackoverflow but I couldn't resolve anything. Launching spark with the command ./spark-shell from bin folder I get…

hadoop apache-spark hadoop2

asked Oct 13 '16 at 08:00

cane_mastino

votes

11 answers

Datanode not starts correctly

I am trying to install Hadoop 2.2.0 in pseudo-distributed mode. While I am trying to start the datanode services it is showing the following error, can anyone please tell how to resolve this? **2**014-03-11 08:48:15,916 INFO…

hadoop hadoop2

asked Mar 11 '14 at 03:58

user2631600

votes

1 answer

Why HBase is a better choice than Cassandra with Hadoop?

Why is using HBase a better choice than using Cassandra with Hadoop? Can anyone please give a detailed explanation on this? Thanks

hadoop cassandra nosql hbase cap-theorem

asked Feb 19 '13 at 05:50

Niladri Biswas

4,153
2
17
24

votes

7 answers

view contents of file in hdfs hadoop

Probably a noob question but is there a way to read the contents of file in hdfs besides copying to local and reading thru unix? So right now what I am doing is: bin/hadoop dfs -copyToLocal hdfs/path local/path nano local/path I am wondering…

hadoop

asked Feb 17 '13 at 19:47

frazman

32,081
75
184
269

Prev 1 2 3

…

99 100 Next