Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
61
votes
5 answers

Just get column names from hive table

I know that you can get column names from a table via the following trick in hive: hive> set hive.cli.print.header=true; hive> select * from tablename; Is it also possible to just get the column names from the table? I dislike having to change a…
cantdutchthis
  • 31,949
  • 17
  • 74
  • 114
60
votes
19 answers

Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

trying to run MR program version(2.7) in windows 7 64 bit in eclipse while running the above exception occurring . I verified that using 64 bit 1.8 java version and observed that all the hadoop daemons are running. Any suggestions highly…
sukumar konduru
  • 599
  • 1
  • 4
  • 4
60
votes
7 answers

Life without JOINs... understanding, and common practices

Lots of "BAW"s (big ass-websites) are using data storage and retrieval techniques that rely on huge tables with indexes, and using queries that won't/can't use JOINs in their queries (BigTable, HQL, etc) to deal with scalability and sharding…
snicker
  • 6,080
  • 6
  • 43
  • 49
59
votes
3 answers

How to restart a failed task on Airflow

I am using a LocalExecutor and my dag has 3 tasks where task(C) is dependant on task(A). Task(B) and task(A) can run in parallel something like below A-->C B So task(A) has failed and but task(B) ran fine. Task(C) is yet to run as task(A) has…
Chetan J
  • 1,847
  • 5
  • 16
  • 21
59
votes
4 answers

Stop Java Coffee Cup icon from appearing in the Dock on Mac OSX

After upgrading to OSX 10.8.4, background Java processes started placing a Java Cup icon in the Dock. It causes the currently active window to loose focus which is very annoying when running some script that forks many short running Java processes…
Eugene
  • 826
  • 1
  • 7
  • 7
59
votes
4 answers

Write a file in hdfs with Java

I want to create a file in HDFS and write data in that. I used this code: Configuration config = new Configuration(); FileSystem fs = FileSystem.get(config); Path filenamePath = new Path("input.txt"); try { if (fs.exists(filenamePath))…
csperson
  • 901
  • 3
  • 12
  • 17
58
votes
5 answers

How does impala provide faster query response compared to hive

I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared to Hive for the queries I have used so far. I am wondering if there are…
techuser soma
  • 4,766
  • 5
  • 23
  • 43
57
votes
3 answers

How to load data to hive from HDFS without removing the source file?

When load data from HDFS to Hive, using LOAD DATA INPATH 'hdfs_file' INTO TABLE tablename; command, it looks like it is moving the hdfs_file to hive/warehouse dir. Is it possible (How?) to copy it instead of moving it, in order, for the file, to…
Suge
  • 2,808
  • 3
  • 48
  • 79
57
votes
7 answers

Hadoop on OSX "Unable to load realm info from SCDynamicStore"

I am getting this error on startup of Hadoop on OSX 10.7: Unable to load realm info from SCDynamicStore put: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/travis/input/conf. Name node is in safe mode. It…
Travis Nelson
  • 2,590
  • 5
  • 28
  • 34
57
votes
7 answers

How does Hive compare to HBase?

I'm interested in finding out how the recently-released (http://mirror.facebook.com/facebook/hive/hadoop-0.17/) Hive compares to HBase in terms of performance. The SQL-like interface used by Hive is very much preferable to the HBase API we have…
mrhahn
  • 607
  • 1
  • 7
  • 7
57
votes
11 answers

Scalable Image Storage

I'm currently designing an architecture for a web-based application that should also provide some kind of image storage. Users will be able to upload photos as one of the key feature of the service. Also viewing these images will be one of the…
b_erb
  • 20,932
  • 8
  • 55
  • 64
56
votes
7 answers

PIG how to count a number of rows in alias

I did something like this to count the number of rows in an alias in PIG: logs = LOAD 'log' logs_w_one = foreach logs generate 1 as one; logs_group = group logs_w_one all; logs_count = foreach logs_group generate SUM(logs_w_one.one); dump…
kee
  • 10,969
  • 24
  • 107
  • 168
56
votes
3 answers

Does Hive have a String split function?

I am looking for a in-built String split function in Hive? e.g. if String is: A|B|C|D|E Then I want to have a function like: array split(string input, char delimiter) So that I get back: [A,B,C,D,E] Does such a in-built split function…
user855
  • 19,048
  • 38
  • 98
  • 162
56
votes
6 answers

Permission denied at hdfs

I am new to hadoop distributed file system, I have done complete installation of hadoop single node on my machine.but after that when i am going to upload data to hdfs it give an error message Permission Denied. Message from terminal with…
Vignesh Prajapati
  • 2,320
  • 3
  • 28
  • 38
56
votes
12 answers

Hbase quickly count number of rows

Right now I implement row count over ResultScanner like this for (Result rs = scanner.next(); rs != null; rs = scanner.next()) { number++; } If data reaching millions time computing is large.I want to compute in real time that i don't want to…
cldo
  • 1,735
  • 6
  • 21
  • 26