Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
35
votes
7 answers

Hadoop 2.2 Installation `.' no such file or directory

I have installed Hadoop and HDFS using this tutorial http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html Everything is fine. I am also able to create directories and use them using hadoop fs -mkdir /tmp hadoop fs -mkdir…
Knows Not Much
  • 30,395
  • 60
  • 197
  • 373
35
votes
3 answers

Skip first line of csv while loading in hive table

Hello Friends, I created table in hive with help of following command - CREATE TABLE db.test ( fname STRING, lname STRING, age STRING, mob BIGINT ) row format delimited fields terminated BY '\t' stored AS textfile;…
Pankaj
  • 369
  • 1
  • 4
  • 7
35
votes
10 answers

Hadoop: Connecting to ResourceManager failed

After installing hadoop 2.2 and trying to launch pipes example ive got the folowing error (the same error shows up after trying to launch hadoop jar hadoop-mapreduce-examples-2.2.0.jar wordcount someFile.txt /out): /usr/local/hadoop$ hadoop pipes…
user3102852
  • 629
  • 1
  • 6
  • 7
35
votes
4 answers

Apache Pig: FLATTEN and parallel execution of reducers

I have implemented an Apache Pig script. When I execute the script it results in many mappers for a specific step, but has only one reducer for that step. Because of this condition (many mappers, one reducer) the Hadoop cluster is almost idle while…
user2964640
  • 351
  • 3
  • 5
35
votes
13 answers

Running Apache Hadoop 2.1.0 on Windows

I am new to Hadoop and have run into problems trying to run it on my Windows 7 machine. Particularly I am interested in running Hadoop 2.1.0 as its release notes mention that running on Windows is supported. I know that I can try to run 1.x versions…
Hatter
  • 773
  • 1
  • 6
  • 12
34
votes
5 answers

Permission Denied error while running start-dfs.sh

I am getting this error while performing start-dfs.sh Starting namenodes on [localhost] pdsh@Gaurav: localhost: rcmd: socket: Permission denied Starting datanodes pdsh@Gaurav: localhost: rcmd: socket: Permission denied Starting secondary namenodes…
Gaurav A Dubey
  • 641
  • 1
  • 6
  • 19
34
votes
6 answers

Sorting large data using MapReduce/Hadoop

I am reading about MapReduce and the following thing is confusing me. Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows: Write a mapper function that…
Chander Shivdasani
  • 9,878
  • 20
  • 76
  • 107
34
votes
8 answers

Alter hive table add or drop column

I have orc table in hive I want to drop column from this table ALTER TABLE table_name drop col_name; but I am getting the following exception Error occurred executing hive query: OK FAILED: ParseException line 1:35 mismatched input 'user_id1'…
Aryan Singh
  • 602
  • 1
  • 8
  • 17
34
votes
4 answers

How to calculate Date difference in Hive

I'm a novice. I have a employee table with a column specifying the joining date and I want to retrieve the list of employees who have joined in the last 3 months. I understand we can get the current date using from_unixtime(unix_timestamp()). How do…
Holmes
  • 1,059
  • 2
  • 17
  • 25
34
votes
5 answers

How can I access S3/S3n from a local Hadoop 2.6 installation?

I am trying to reproduce an Amazon EMR cluster on my local machine. For that purpose, I have installed the latest stable version of Hadoop as of now - 2.6.0. Now I would like to access an S3 bucket, as I do inside the EMR cluster. I have added the…
doublebyte
  • 1,225
  • 3
  • 13
  • 22
34
votes
3 answers

How do I run graphx with Python / pyspark?

I am attempting to run Spark graphx with Python using pyspark. My installation appears correct, as I am able to run the pyspark tutorials and the (Java) GraphX tutorials just fine. Presumably since GraphX is part of Spark, pyspark should be able…
Glenn Strycker
  • 4,816
  • 6
  • 31
  • 51
34
votes
6 answers

Hive query to quickly find table size (number of rows)

Is there a Hive query to quickly find table size (i.e. number of rows) without launching a time-consuming MapReduce job? (Which is why I want to avoid COUNT(*).) I tried DESCRIBE EXTENDED, but that yielded numRows=0 which is obviously not…
xenocyon
  • 2,409
  • 3
  • 20
  • 22
34
votes
3 answers

What exactly is hadoop namenode formatting?

What exactly is involved in namenode formatting. If I type in the following command into my terminal within my hadoop installation folder: bin/hadoop namenode -format What exactly does it accomplish? I am looking to understand principles of…
Ace
  • 1,501
  • 4
  • 30
  • 49
34
votes
5 answers

how to replace characters in hive?

I have a string column description in a hive table which may contain tab characters '\t', these characters are however messing some views when connecting hive to an external application. is there a simple way to get rid of all tab characters in that…
user1745713
  • 781
  • 4
  • 10
  • 16
33
votes
4 answers

How to rename a hive table without changing location?

Based on the Hive doc below: Rename Table ALTER TABLE table_name RENAME TO new_table_name; This statement lets you change the name of a table to a different name. As of version 0.6, a rename on a managed table moves its HDFS location as well.…
Osiris
  • 1,007
  • 4
  • 17
  • 30