Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
30
votes
3 answers

How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?

I am new to Spark. I have a large dataset of elements[RDD] and I want to divide it into two exactly equal sized partitions maintaining order of elements. I tried using RangePartitioner like var data = partitionedFile.partitionBy(new…
yh18190
  • 399
  • 1
  • 6
  • 7
30
votes
3 answers

How to load a text file into a Hive table stored as sequence files

I have a hive table stored as a sequencefile. I need to load a text file into this table. How do I load the data into this table?
cldo
  • 1,735
  • 6
  • 21
  • 26
30
votes
4 answers

Python read file as stream from HDFS

Here is my problem: I have a file in HDFS which can potentially be huge (=not enough to fit all in memory) What I would like to do is avoid having to cache this file in memory, and only process it line by line like I would do with a regular…
Charles Menguy
  • 40,830
  • 17
  • 95
  • 117
29
votes
4 answers

Change File Split size in Hadoop

I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. That is, a 64mb file, which is the default split size for TextInputFormat, would take even…
Ahmadov
  • 1,567
  • 5
  • 31
  • 48
29
votes
6 answers

LeaseExpiredException: No lease error on HDFS

I am trying to load large data to HDFS and I sometimes get the error below. any idea why? The error: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on…
zohar
  • 2,298
  • 13
  • 45
  • 75
29
votes
6 answers

Merge Spark output CSV files with a single header

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning. I have a Scala script that takes raw data from S3, processes it and writes it to HDFS or even S3 with Spark-CSV. I think I can use multiple…
V. Samma
  • 2,558
  • 8
  • 30
  • 34
29
votes
11 answers

Pig Latin: Load multiple files from a date range (part of the directory structure)

I have the following scenario- Pig version used 0.70 Sample HDFS directory structure: /user/training/test/20100810/ /user/training/test/20100811/ /user/training/test/20100812/ /user/training/test/20100813/
Arnkrishn
  • 29,828
  • 40
  • 114
  • 128
29
votes
2 answers

distinct vs group by which is better

for the simplest case we all refer to: select id from mytbl group by id and select distinct id from mytbl as we know, they generate same query plan which had been repeatedly mentioned in some items like Which is better: Distinct or Group By In…
Chiron
  • 974
  • 1
  • 8
  • 20
29
votes
4 answers

Best splittable compression for Hadoop input = bz2?

We've realized a bit too late that archiving our files in GZip format for Hadoop processing isn't such a great idea. GZip isn't splittable, and for reference, here are the problems which I won't repeat: Very basic question about Hadoop and…
Suman
  • 9,221
  • 5
  • 49
  • 62
29
votes
6 answers

Apache Storm compared to Hadoop

How does Storm compare to Hadoop? Hadoop seems to be the defacto standard for open-source large scale batch processing, does Storm has any advantages over hadoop? or Are they completely different?
18bytes
  • 5,951
  • 7
  • 42
  • 69
29
votes
3 answers

Set hadoop system user for client embedded in Java webapp

I would like to submit MapReduce jobs from a java web application to a remote Hadoop cluster but am unable to specify which user the job should be submitted for. I would like to configure and use a system user which should be used for all MapReduce…
Christoffer Soop
  • 1,458
  • 1
  • 12
  • 24
28
votes
2 answers

Hadoop safemode recovery - taking too long!

I have a Hadoop cluster with 18 data nodes. I restarted the name node over two hours ago and the name node is still in safe mode. I have been searching for why this might be taking too long and I cannot find a good answer. The posting here: Hadoop…
senile_genius
  • 527
  • 1
  • 7
  • 12
28
votes
3 answers

How to append data to an existing parquet file

I'm using the following code to create ParquetWriter and to write records to it. ParquetWriter parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE); final GenericRecord record =…
Devas
  • 1,544
  • 4
  • 23
  • 28
28
votes
10 answers

IllegalAccessError to guava's StopWatch from org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus

I'm trying to run small spark application and am getting the following exception: Exception in thread "main" java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.()V from class…
Lika
  • 1,043
  • 2
  • 10
  • 13
28
votes
1 answer

Should I call ugi.checkTGTAndReloginFromKeytab() before every action on hadoop?

In my server application I'm connecting to Kerberos secured Hadoop cluster from my java application. I'm using various components like the HDFS file system, Oozie, Hive etc. On the application startup I do…
Jan Zyka
  • 17,460
  • 16
  • 70
  • 118