Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Ambari, A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for Hadoop
HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
health such as heatmaps and ability to view MapReduce, Pig and Hive
applications visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Avro, a data serialization system based on JSON schemas.
Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
Chukwa: A data collection system for managing large distributed systems.
Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
Flink, a fast and reliable large-scale data processing engine.
Giraph is an iterative graph processing framework, built on top of Apache Hadoop
HBase, A scalable, distributed database that supports structured data storage for large tables.
Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout, a library of machine learning algorithms compatible with M/R paradigm.
Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
Pig, a platform/programming language for authoring parallelizable jobs
Spark, a fast and general engine for large-scale data processing.
Storm, a system for real-time and stream processing
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions

votes

7 answers

Hadoop 2.2 Installation `.' no such file or directory

I have installed Hadoop and HDFS using this tutorial http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html Everything is fine. I am also able to create directories and use them using hadoop fs -mkdir /tmp hadoop fs -mkdir…

hadoop hdfs

asked Dec 29 '13 at 02:39

Knows Not Much

30,395
60
197
373

votes

3 answers

Skip first line of csv while loading in hive table

Hello Friends, I created table in hive with help of following command - CREATE TABLE db.test ( fname STRING, lname STRING, age STRING, mob BIGINT ) row format delimited fields terminated BY '\t' stored AS textfile;…

hadoop hive hiveql

asked Dec 28 '13 at 10:14

Pankaj

votes

10 answers

Hadoop: Connecting to ResourceManager failed

After installing hadoop 2.2 and trying to launch pipes example ive got the folowing error (the same error shows up after trying to launch hadoop jar hadoop-mapreduce-examples-2.2.0.jar wordcount someFile.txt /out): /usr/local/hadoop$ hadoop pipes…

hadoop hadoop-yarn

asked Dec 14 '13 at 18:49

user3102852

votes

4 answers

Apache Pig: FLATTEN and parallel execution of reducers

I have implemented an Apache Pig script. When I execute the script it results in many mappers for a specific step, but has only one reducer for that step. Because of this condition (many mappers, one reducer) the Hadoop cluster is almost idle while…

hadoop apache-pig

asked Nov 07 '13 at 12:00

user2964640

votes

13 answers

Running Apache Hadoop 2.1.0 on Windows

I am new to Hadoop and have run into problems trying to run it on my Windows 7 machine. Particularly I am interested in running Hadoop 2.1.0 as its release notes mention that running on Windows is supported. I know that I can try to run 1.x versions…

windows hadoop

asked Sep 05 '13 at 07:14

Hatter

votes

5 answers

Permission Denied error while running start-dfs.sh

I am getting this error while performing start-dfs.sh Starting namenodes on [localhost] pdsh@Gaurav: localhost: rcmd: socket: Permission denied Starting datanodes pdsh@Gaurav: localhost: rcmd: socket: Permission denied Starting secondary namenodes…

sockets hadoop hdfs hadoop-yarn hadoop2

asked Mar 13 '17 at 04:18

Gaurav A Dubey

votes

6 answers

Sorting large data using MapReduce/Hadoop

I am reading about MapReduce and the following thing is confusing me. Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows: Write a mapper function that…

java hadoop mapreduce

asked Sep 02 '10 at 06:46

Chander Shivdasani

9,878
20
76
107

votes

8 answers

Alter hive table add or drop column

I have orc table in hive I want to drop column from this table ALTER TABLE table_name drop col_name; but I am getting the following exception Error occurred executing hive query: OK FAILED: ParseException line 1:35 mismatched input 'user_id1'…

hadoop hive

asked Dec 10 '15 at 09:31

Aryan Singh

votes

4 answers

How to calculate Date difference in Hive

I'm a novice. I have a employee table with a column specifying the joining date and I want to retrieve the list of employees who have joined in the last 3 months. I understand we can get the current date using from_unixtime(unix_timestamp()). How do…

hadoop hive hiveql

asked May 29 '15 at 05:21

Holmes

1,059
2
17
25

votes

5 answers

How can I access S3/S3n from a local Hadoop 2.6 installation?

I am trying to reproduce an Amazon EMR cluster on my local machine. For that purpose, I have installed the latest stable version of Hadoop as of now - 2.6.0. Now I would like to access an S3 bucket, as I do inside the EMR cluster. I have added the…

hadoop amazon-web-services amazon-s3 hadoop-yarn hadoop2

asked Jan 19 '15 at 16:23

doublebyte

1,225
3
13
22

votes

3 answers

How do I run graphx with Python / pyspark?

I am attempting to run Spark graphx with Python using pyspark. My installation appears correct, as I am able to run the pyspark tutorials and the (Java) GraphX tutorials just fine. Presumably since GraphX is part of Spark, pyspark should be able…

python hadoop graph-theory apache-spark

asked Apr 25 '14 at 20:18

Glenn Strycker

4,816
6
31
51

votes

6 answers

Hive query to quickly find table size (number of rows)

Is there a Hive query to quickly find table size (i.e. number of rows) without launching a time-consuming MapReduce job? (Which is why I want to avoid COUNT(*).) I tried DESCRIBE EXTENDED, but that yielded numRows=0 which is obviously not…

hadoop hive

asked Jan 18 '14 at 19:04

xenocyon

2,409
3
20
22

votes

3 answers

What exactly is hadoop namenode formatting?

What exactly is involved in namenode formatting. If I type in the following command into my terminal within my hadoop installation folder: bin/hadoop namenode -format What exactly does it accomplish? I am looking to understand principles of…

hadoop formatting principles

asked Sep 18 '13 at 02:22

Ace

1,501
4
30
49

votes

5 answers

how to replace characters in hive?

I have a string column description in a hive table which may contain tab characters '\t', these characters are however messing some views when connecting hive to an external application. is there a simple way to get rid of all tab characters in that…

hadoop hive

asked Aug 06 '13 at 21:05

user1745713

votes

4 answers

How to rename a hive table without changing location?

Based on the Hive doc below: Rename Table ALTER TABLE table_name RENAME TO new_table_name; This statement lets you change the name of a table to a different name. As of version 0.6, a rename on a managed table moves its HDFS location as well.…

hadoop hive hiveql

asked Mar 12 '16 at 21:24

Osiris

1,007
4
17
30

Prev 1 2 3

…

99 100 Next