Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Ambari, A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for Hadoop
HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
health such as heatmaps and ability to view MapReduce, Pig and Hive
applications visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Avro, a data serialization system based on JSON schemas.
Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
Chukwa: A data collection system for managing large distributed systems.
Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
Flink, a fast and reliable large-scale data processing engine.
Giraph is an iterative graph processing framework, built on top of Apache Hadoop
HBase, A scalable, distributed database that supports structured data storage for large tables.
Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout, a library of machine learning algorithms compatible with M/R paradigm.
Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
Pig, a platform/programming language for authoring parallelizable jobs
Spark, a fast and general engine for large-scale data processing.
Storm, a system for real-time and stream processing
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions

votes

3 answers

How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?

I am new to Spark. I have a large dataset of elements[RDD] and I want to divide it into two exactly equal sized partitions maintaining order of elements. I tried using RangePartitioner like var data = partitionedFile.partitionBy(new…

scala hadoop apache-spark

asked Apr 17 '14 at 07:41

yh18190

votes

3 answers

How to load a text file into a Hive table stored as sequence files

I have a hive table stored as a sequencefile. I need to load a text file into this table. How do I load the data into this table?

hadoop hive

asked Dec 28 '12 at 03:24

cldo

1,735
6
21
26

votes

4 answers

Python read file as stream from HDFS

Here is my problem: I have a file in HDFS which can potentially be huge (=not enough to fit all in memory) What I would like to do is avoid having to cache this file in memory, and only process it line by line like I would do with a regular…

python hadoop subprocess hdfs

asked Sep 18 '12 at 22:00

Charles Menguy

40,830
17
95
117

votes

4 answers

Change File Split size in Hadoop

I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. That is, a 64mb file, which is the default split size for TextInputFormat, would take even…

java hadoop mapreduce distributed-computing

asked Mar 13 '12 at 04:01

Ahmadov

1,567
5
31
48

votes

6 answers

LeaseExpiredException: No lease error on HDFS

I am trying to load large data to HDFS and I sometimes get the error below. any idea why? The error: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on…

hadoop hdfs

asked Sep 26 '11 at 18:55

zohar

2,298
13
45
75

votes

6 answers

Merge Spark output CSV files with a single header

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning. I have a Scala script that takes raw data from S3, processes it and writes it to HDFS or even S3 with Spark-CSV. I think I can use multiple…

scala csv hadoop apache-spark

asked Jun 27 '16 at 14:09

V. Samma

2,558
8
30
34

votes

11 answers

Pig Latin: Load multiple files from a date range (part of the directory structure)

I have the following scenario- Pig version used 0.70 Sample HDFS directory structure: /user/training/test/20100810/ /user/training/test/20100811/ /user/training/test/20100812/ /user/training/test/20100813/

hadoop apache-pig

asked Aug 18 '10 at 18:39

Arnkrishn

29,828
40
114
128

votes

2 answers

distinct vs group by which is better

for the simplest case we all refer to: select id from mytbl group by id and select distinct id from mytbl as we know, they generate same query plan which had been repeatedly mentioned in some items like Which is better: Distinct or Group By In…

sql hadoop hive distinct

asked Aug 07 '15 at 11:01

Chiron

votes

4 answers

Best splittable compression for Hadoop input = bz2?

We've realized a bit too late that archiving our files in GZip format for Hadoop processing isn't such a great idea. GZip isn't splittable, and for reference, here are the problems which I won't repeat: Very basic question about Hadoop and…

hadoop gzip hdfs bzip2

asked Feb 11 '13 at 20:39

Suman

9,221
5
49
62

votes

6 answers

Apache Storm compared to Hadoop

How does Storm compare to Hadoop? Hadoop seems to be the defacto standard for open-source large scale batch processing, does Storm has any advantages over hadoop? or Are they completely different?

hadoop streaming apache-storm

asked Jun 28 '12 at 17:38

18bytes

5,951
7
42
69

votes

3 answers

Set hadoop system user for client embedded in Java webapp

I would like to submit MapReduce jobs from a java web application to a remote Hadoop cluster but am unable to specify which user the job should be submitted for. I would like to configure and use a system user which should be used for all MapReduce…

hadoop cluster-computing

asked Jun 14 '12 at 20:59

Christoffer Soop

1,458
1
12
24

votes

2 answers

Hadoop safemode recovery - taking too long!

I have a Hadoop cluster with 18 data nodes. I restarted the name node over two hours ago and the name node is still in safe mode. I have been searching for why this might be taking too long and I cannot find a good answer. The posting here: Hadoop…

hadoop safe-mode

asked Feb 11 '11 at 07:28

senile_genius

votes

3 answers

How to append data to an existing parquet file

I'm using the following code to create ParquetWriter and to write records to it. ParquetWriter parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE); final GenericRecord record =…

java hadoop parquet

asked Aug 30 '16 at 18:12

Devas

1,544
4
23
28

votes

10 answers

IllegalAccessError to guava's StopWatch from org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus

I'm trying to run small spark application and am getting the following exception: Exception in thread "main" java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.()V from class…

hadoop apache-spark mapreduce guava

asked Apr 05 '16 at 13:07

Lika

1,043
2
10
13

votes

1 answer

Should I call ugi.checkTGTAndReloginFromKeytab() before every action on hadoop?

In my server application I'm connecting to Kerberos secured Hadoop cluster from my java application. I'm using various components like the HDFS file system, Oozie, Hive etc. On the application startup I do…

java hadoop kerberos

asked Jan 05 '16 at 16:36

Jan Zyka

17,460
16
70
118

Prev 1 2 3

…

99 100 Next