Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
71
votes
6 answers

Integration testing Hive jobs

I'm trying to write a non-trivial Hive job using the Hive Thrift and JDBC interfaces, and I'm having trouble setting up a decent JUnit test. By non-trivial, I mean that the job results in at least one MapReduce stage, as opposed to only dealing with…
yoni
  • 5,686
  • 3
  • 27
  • 28
70
votes
14 answers

HDFS error: could only be replicated to 0 nodes, instead of 1

I've created a ubuntu single node hadoop cluster in EC2. Testing a simple file upload to hdfs works from the EC2 machine, but doesn't work from a machine outside of EC2. I can browse the the filesystem through the web interface from the remote…
Steve
  • 4,859
  • 5
  • 21
  • 17
70
votes
6 answers

How to stop/kill Airflow tasks from the UI

How can I stop/kill a running task on Airflow UI? I am using LocalExecutor. Even if I use CeleryExecutor, how do can I kill/stop the running task?
Chetan J
  • 1,847
  • 5
  • 16
  • 21
70
votes
5 answers

Why is there no 'hadoop fs -head' shell command?

A fast method for inspecting files on HDFS is to use tail: ~$ hadoop fs -tail /path/to/file This displays the last kilobyte of data in the file, which is extremely helpful. However, the opposite command head does not appear to be part of the shell…
bbengfort
  • 5,254
  • 4
  • 44
  • 57
69
votes
10 answers

Write to multiple outputs by key Spark - one Spark job

How can you write to multiple outputs dependent on the key using Spark in a single Job. Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job E.g. sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c"))) .writeAsMultiple(prefix,…
samthebest
  • 30,803
  • 25
  • 102
  • 142
67
votes
4 answers

How to fix corrupt HDFS FIles

How does someone fix a HDFS that's corrupt? I looked on the Apache/Hadoop website and it said its fsck command, which doesn't fix it. Hopefully someone who has run into this problem before can tell me how to fix this. Unlike a traditional fsck…
Classified
  • 5,759
  • 18
  • 68
  • 99
67
votes
8 answers

Hive Data Retrieval Queries: Difference between CLUSTER BY, ORDER BY, and SORT BY

On Hive, for Data Retrieval Queries (e.g. SELECT ...), NOT Data Definition (e.g. CREATE TABLES ...), as far as I understand: SORT BY only sorts with in the reducer ORDER BY orders things globally but shoves everything into one reducers CLUSTER BY…
cashmere
  • 2,811
  • 1
  • 23
  • 32
65
votes
4 answers

HDFS free space available command

Is there a hdfs command to see available free space in hdfs. We can see that through browser at master:hdfsport in browser , but for some reason I can't access this and I need some command. I can see my disk usage through command ./bin/hadoop fs -du…
Animesh Raj Jha
  • 2,704
  • 1
  • 21
  • 25
64
votes
3 answers

How to check Spark Version

I want to check the spark version in cdh 5.7.0. I have searched on the internet but not able to understand. Please help.
Ironman
  • 1,330
  • 2
  • 19
  • 40
63
votes
16 answers

Hadoop cluster setup - java.net.ConnectException: Connection refused

I want to setup a hadoop-cluster in pseudo-distributed mode. I managed to perform all the setup-steps, including startuping a Namenode, Datanode, Jobtracker and a Tasktracker on my machine. Then I tried to run some exemplary programms and faced the…
Marta Karas
  • 4,967
  • 10
  • 47
  • 77
63
votes
6 answers

how to kill hadoop jobs

I want to kill all my hadoop jobs automatically when my code encounters an unhandled exception. I am wondering what is the best practice to do it? Thanks
Frank
  • 7,235
  • 9
  • 46
  • 56
61
votes
16 answers

out of Memory Error in Hadoop

I tried installing Hadoop following this http://hadoop.apache.org/common/docs/stable/single_node_setup.html document. When I tried executing this bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' I am getting the following…
Anuj
  • 9,222
  • 8
  • 33
  • 30
61
votes
6 answers

Difference between hadoop fs -put and hadoop fs -copyFromLocal

-put and -copyFromLocal are documented as identical, while most examples use the verbose variant -copyFromLocal. Why? Same thing for -get and -copyToLocal
snappy
  • 2,761
  • 5
  • 23
  • 24
61
votes
18 answers

java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

I have Hadoop 2.7.1 and apache-hive-1.2.1 versions installed on ubuntu 14.0. Why this error is occurring ? Is any metastore installation required? When we typing hive command on terminal how the xml's internally called, what is the flow of those…
Arti Nalawade
  • 957
  • 1
  • 7
  • 8
61
votes
11 answers

How to access s3a:// files from Apache Spark?

Hadoop 2.6 doesn't support s3a out of the box, so I've tried a series of solutions and fixes, including: deploy with hadoop-aws and aws-java-sdk => cannot read environment variable for credentials add hadoop-aws into maven => various transitive…
tribbloid
  • 4,026
  • 14
  • 64
  • 103