Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Ambari, A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for Hadoop
HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
health such as heatmaps and ability to view MapReduce, Pig and Hive
applications visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Avro, a data serialization system based on JSON schemas.
Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
Chukwa: A data collection system for managing large distributed systems.
Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
Flink, a fast and reliable large-scale data processing engine.
Giraph is an iterative graph processing framework, built on top of Apache Hadoop
HBase, A scalable, distributed database that supports structured data storage for large tables.
Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout, a library of machine learning algorithms compatible with M/R paradigm.
Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
Pig, a platform/programming language for authoring parallelizable jobs
Spark, a fast and general engine for large-scale data processing.
Storm, a system for real-time and stream processing
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions

votes

7 answers

hadoop copy a local file system folder to HDFS

I need to copy a folder from local file system to HDFS. I could not find any example of moving a folder(including its all subfolders) to HDFS $ hadoop fs -copyFromLocal /home/ubuntu/Source-Folder-To-Copy HDFS-URI

hadoop hdfs

asked Jan 29 '15 at 11:05

Tariq

2,274
4
24
40

votes

3 answers

Large scale data processing Hbase vs Cassandra

I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better solution for large scale data processing and analysis. While both are same key/value storage and both are/can run…

nosql hadoop cassandra hbase data-processing

asked Aug 29 '11 at 23:46

Gary Lindahl

5,341
2
19
18

votes

18 answers

How do I output the results of a HiveQL query to CSV?

we would like to put the results of a Hive query to a CSV file. I thought the command should look like this: insert overwrite directory '/home/output.csv' select books from table; When I run it, it says it completeld successfully but I can never…

database hadoop hive hiveql

asked Aug 08 '13 at 15:07

AAA

2,388
9
32
47

votes

8 answers

When do reduce tasks start in Hadoop?

In Hadoop when do reduce tasks start? Do they start after a certain percentage (threshold) of mappers complete? If so, is this threshold fixed? What kind of threshold is typically used?

hadoop mapreduce reduce

asked Jul 26 '12 at 15:25

Slayer

2,391
4
21
18

votes

9 answers

How to Delete a directory from Hadoop cluster which is having comma(,) in its name?

I have uploaded a Directory to hadoop cluster that is having "," in its name like "MyDir, Name" when I am trying to delete this Directory by using rmr hadoop shell command as following hadoop dfs -rmr hdfs://host:port/Navi/MyDir, Name I'm getting…

file hadoop

asked Nov 23 '12 at 12:24

java_dev

votes

2 answers

Hadoop truncated/inconsistent counter name

For now, I have a Hadoop job which creates counters with a pretty big name. For example, the following one:…

java hadoop mapreduce hadoop-yarn

asked Jan 17 '17 at 15:32

mr.nothing

5,141
10
53
77

votes

7 answers

Is there any way to get the column name along with the output while execute any query in Hive?

In Hive, when we do a query (like: select * from employee), we do not get any column names in the output (like name, age, salary that we would get in RDBMS SQL), we only get the values. Is there any way to get the column names to be displayed along…

hadoop hive rdbms

asked Aug 01 '13 at 05:27

Nithin

9,661
14
44
67

votes

11 answers

What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

I am getting: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask While trying to make a copy of a partitioned table using the commands in the hive console: CREATE TABLE copy_table_name LIKE table_name; INSERT…

hadoop mapreduce hive

asked Jun 25 '12 at 08:04

nickponline

25,354
32
99
167

votes

16 answers

How to delete and update a record in Hive

I have installed Hadoop, Hive, Hive JDBC. which are running fine for me. But I still have a problem. How to delete or update a single record using Hive because delete or update command of MySQL is not working in Hive. Thanks hive> delete from…

hadoop hive sql-delete

asked Jul 23 '13 at 12:44

Charnjeet Singh

3,056
6
35
65

votes

10 answers

merge output files after reduce phase

In mapreduce each reduce task write its output to a file named part-r-nnnnn where nnnnn is a partition ID associated with the reduce task. Does map/reduce merge these files? If yes, how?

hadoop mapreduce

asked Apr 18 '11 at 08:01

Shahryar

1,454
2
15
32

votes

12 answers

Where does Hive store files in HDFS?

I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly. Where does Hive store its files in HDFS?

hadoop hive hdfs

asked Feb 20 '11 at 16:43

Yuval

7,987
12
40
54

votes

5 answers

Hive: how to show all partitions of a table?

I have a table with 1000+ partitions. "Show partitions" command only lists a small number of partitions. How can i show all partitions? Update: I found "show partitions" command only lists exactly 500 partitions. "select ... where ..." only…

hadoop hive

asked Mar 25 '13 at 13:34

Kevin Leo

votes

12 answers

Buiding Hadoop with Eclipse / Maven - Missing artifact jdk.tools:jdk.tools:jar:1.6

I am trying to import cloudera's org.apache.hadoop:hadoop-client:2.0.0-cdh4.0.0 from cdh4 maven repo in a maven project in eclipse 3.81, m2e plugin, with oracle's jdk 1.7.0_05 on win7 using org.apache.hadoop …

java maven maven-2 hadoop cloudera

asked Jun 20 '12 at 10:57

jvataman

1,357
1
12
13

votes

16 answers

Hive insert query like SQL

I am new to hive, and want to know if there is anyway to insert data into Hive table like we do in SQL. I want to insert my data into hive like INSERT INTO tablename VALUES (value1,value2..) I have read that you can load the data from a file to…

sql hadoop hive hiveql

asked Jul 02 '13 at 12:20

Y0gesh Gupta

2,184
5
40
56

votes

3 answers

Differences between Amazon S3 and S3n in Hadoop

When I connected my Hadoop cluster to Amazon storage and downloaded files to HDFS, I found s3:// did not work. When looking for some help on the Internet I found I can use S3n. When I used S3n it worked. I do not understand the differences between…

hadoop amazon-s3 hdfs

asked May 13 '12 at 05:04

user1355361

Prev 1 2

…

99 100 Next