Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Ambari, A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for Hadoop
HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
health such as heatmaps and ability to view MapReduce, Pig and Hive
applications visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Avro, a data serialization system based on JSON schemas.
Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
Chukwa: A data collection system for managing large distributed systems.
Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
Flink, a fast and reliable large-scale data processing engine.
Giraph is an iterative graph processing framework, built on top of Apache Hadoop
HBase, A scalable, distributed database that supports structured data storage for large tables.
Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout, a library of machine learning algorithms compatible with M/R paradigm.
Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
Pig, a platform/programming language for authoring parallelizable jobs
Spark, a fast and general engine for large-scale data processing.
Storm, a system for real-time and stream processing
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions

votes

6 answers

Integration testing Hive jobs

I'm trying to write a non-trivial Hive job using the Hive Thrift and JDBC interfaces, and I'm having trouble setting up a decent JUnit test. By non-trivial, I mean that the job results in at least one MapReduce stage, as opposed to only dealing with…

java testing hadoop mapreduce hive

asked May 23 '13 at 16:47

yoni

5,686
3
27
28

votes

14 answers

HDFS error: could only be replicated to 0 nodes, instead of 1

I've created a ubuntu single node hadoop cluster in EC2. Testing a simple file upload to hdfs works from the EC2 machine, but doesn't work from a machine outside of EC2. I can browse the the filesystem through the web interface from the remote…

amazon-ec2 hadoop

asked Mar 14 '11 at 00:11

Steve

4,859
5
21
17

votes

6 answers

How to stop/kill Airflow tasks from the UI

How can I stop/kill a running task on Airflow UI? I am using LocalExecutor. Even if I use CeleryExecutor, how do can I kill/stop the running task?

python hadoop airflow

asked Apr 26 '17 at 10:33

Chetan J

1,847
5
16
21

votes

5 answers

Why is there no 'hadoop fs -head' shell command?

A fast method for inspecting files on HDFS is to use tail: ~$ hadoop fs -tail /path/to/file This displays the last kilobyte of data in the file, which is extremely helpful. However, the opposite command head does not appear to be part of the shell…

hadoop hdfs

asked Nov 04 '13 at 22:05

bbengfort

5,254
4
44
57

votes

10 answers

Write to multiple outputs by key Spark - one Spark job

How can you write to multiple outputs dependent on the key using Spark in a single Job. Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job E.g. sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c"))) .writeAsMultiple(prefix,…

scala hadoop output hdfs apache-spark

asked Jun 02 '14 at 12:54

samthebest

30,803
25
102
142

votes

4 answers

How to fix corrupt HDFS FIles

How does someone fix a HDFS that's corrupt? I looked on the Apache/Hadoop website and it said its fsck command, which doesn't fix it. Hopefully someone who has run into this problem before can tell me how to fix this. Unlike a traditional fsck…

hadoop hdfs

asked Oct 06 '13 at 03:17

Classified

5,759
18
68
99

votes

8 answers

Hive Data Retrieval Queries: Difference between CLUSTER BY, ORDER BY, and SORT BY

On Hive, for Data Retrieval Queries (e.g. SELECT ...), NOT Data Definition (e.g. CREATE TABLES ...), as far as I understand: SORT BY only sorts with in the reducer ORDER BY orders things globally but shoves everything into one reducers CLUSTER BY…

hadoop hql hive

asked Dec 05 '12 at 01:42

cashmere

2,811
1
23
32

votes

4 answers

HDFS free space available command

Is there a hdfs command to see available free space in hdfs. We can see that through browser at master:hdfsport in browser , but for some reason I can't access this and I need some command. I can see my disk usage through command ./bin/hadoop fs -du…

hadoop hdfs

asked Jul 20 '12 at 16:16

Animesh Raj Jha

2,704
1
21
25

votes

3 answers

How to check Spark Version

I want to check the spark version in cdh 5.7.0. I have searched on the internet but not able to understand. Please help.

apache-spark hadoop cloudera

asked Jul 26 '16 at 10:03

Ironman

1,330
2
19
40

votes

16 answers

Hadoop cluster setup - java.net.ConnectException: Connection refused

I want to setup a hadoop-cluster in pseudo-distributed mode. I managed to perform all the setup-steps, including startuping a Namenode, Datanode, Jobtracker and a Tasktracker on my machine. Then I tried to run some exemplary programms and faced the…

java hadoop configuration connectexception

asked Feb 22 '15 at 18:04

Marta Karas

4,967
10
47
77

votes

6 answers

how to kill hadoop jobs

I want to kill all my hadoop jobs automatically when my code encounters an unhandled exception. I am wondering what is the best practice to do it? Thanks

hadoop kill jobs

asked Jul 12 '12 at 18:42

Frank

7,235
9
46
56

votes

16 answers

out of Memory Error in Hadoop

I tried installing Hadoop following this http://hadoop.apache.org/common/docs/stable/single_node_setup.html document. When I tried executing this bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' I am getting the following…

java hadoop

asked Dec 11 '11 at 12:42

Anuj

9,222
8
33
30

votes

6 answers

Difference between hadoop fs -put and hadoop fs -copyFromLocal

-put and -copyFromLocal are documented as identical, while most examples use the verbose variant -copyFromLocal. Why? Same thing for -get and -copyToLocal

hadoop hdfs

asked Oct 18 '11 at 17:29

snappy

2,761
5
23
24

votes

18 answers

java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

I have Hadoop 2.7.1 and apache-hive-1.2.1 versions installed on ubuntu 14.0. Why this error is occurring ? Is any metastore installation required? When we typing hive command on terminal how the xml's internally called, what is the flow of those…

apache hadoop hive

asked Feb 17 '16 at 06:19

Arti Nalawade

votes

11 answers

How to access s3a:// files from Apache Spark?

Hadoop 2.6 doesn't support s3a out of the box, so I've tried a series of solutions and fixes, including: deploy with hadoop-aws and aws-java-sdk => cannot read environment variable for credentials add hadoop-aws into maven => various transitive…

hadoop apache-spark amazon-s3

asked May 21 '15 at 23:24

tribbloid

4,026
14
64
103

Prev 1 2 3

…

99 100 Next