Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
8
votes
2 answers

Checking if directory in HDFS is empty or not

Is there any command in HDFS to check whether a directory is empty or not
VSP
  • 293
  • 2
  • 5
  • 10
8
votes
2 answers

Hive impersonation not working with custom authenticator provider

I've developed a custom authenticator provider and everything seems OK with regards to authentication: HiveServer2 starts well and authenticated connections are properly validated. Even, simple Hive queries work, such as show tables. The problem is…
frb
  • 3,738
  • 2
  • 21
  • 51
8
votes
2 answers

Call From quickstart.cloudera/172.17.0.2 to quickstart.cloudera:8020 failed on connection exception: java.net.ConnectException: Connection refused

I am very new to Docker and Hadoop system. I have installed the Docker in Ubuntu 16.04 and run the Hadoop image from Cloudera inside a new Docker container. But when I try to run any command in hdfs the error message is shown as: Call From…
gd1
  • 655
  • 1
  • 10
  • 21
8
votes
1 answer

Error connecting Hortonworks Hive ODBC in Excel 2013

I am trying to query Hortonworks Hive via ODBC Driver in Excel 2013. I downloaded the driver here (32-bit): http://hortonworks.com/downloads/ Hortonworks 2.5 Hive 2.5.0.0-1245 Then I add the config in ODBC Data Source Administrator…
HP.
  • 19,226
  • 53
  • 154
  • 253
8
votes
1 answer

Not able to install hadoop using Cloudera Manager

I am trying to setup hadoop cluster in a single VM (for simplicity) using cloudera Manager 5.9. The below are the details of my environment: Host OS -> Windows 10 Virtualization software -> Virtual box 5.1.10 Guest OS -> Cent OS 6.8 I installed the…
CuriousMind
  • 8,301
  • 22
  • 65
  • 134
8
votes
2 answers

Hadoop Streaming: Mapper 'wrapping' a binary executable

I have a pipeline that I currently run on a large university computer cluster. For publication purposes I'd like to convert it into mapreduce format such that it could be run by anyone on using a hadoop cluster such as amazon webservices (AWS). …
Nick Crawford
  • 5,086
  • 2
  • 21
  • 20
8
votes
1 answer

CASE statements in Hive

Ok, i have a following code to mark records that have highest month_cd in tabl with binary flag: Select t1.month_cd, t2.max_month_cd ,CASE WHEN t2.max_month_cd != null then 0 else 1 end test_1 ,CASE WHEN t2.max_month_cd = null then 0 else 1 end…
JagdCrab
  • 635
  • 2
  • 9
  • 22
8
votes
2 answers

Cannot load main class from JAR file

I have a Spark-scala application. I tried to display a simple message - "Hello my App". When I compile it with sbt compile and run it by sbt run it's fine. I displayed my message with success but he display an error; like this: Hello my…
sirine
  • 517
  • 3
  • 7
  • 17
8
votes
2 answers

What is the principle of "code moving to data" rather than data to code?

In a recent discussion about distributed processing and streaming I came across the concept of 'code moving to data'. Can someone please help explaining the same. Reference for this phrase is MapReduceWay. In terms of Hadoop, it's stated in a…
8
votes
4 answers

Will Spark SQL completely replace Apache Impala or Apache Hive?

I need to deploy Big Data Cluster on our servers. But I just know about knowledge of Apache Spark. Now I need to know whether Spark SQL can completely replace Apache Impala or Apache Hive. I need your help. Thanks.
Tim Koo
  • 111
  • 1
  • 4
8
votes
1 answer

Hadoop: NullPointerException when redirecting to job history server

I have a Hadoop cluster (HDP 2.1). Everything has been working for a long time, but suddenly jobs have started to return the following recurrent error: 16/10/13 16:21:11 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use…
frb
  • 3,738
  • 2
  • 21
  • 51
8
votes
1 answer

Difference between mapreduce split and spark paritition

I wanted to ask is there any significant difference in data partitioning when working with Hadoop/MapReduce and Spark? They both work on HDFS(TextInputFormat) so it should be same in theory. Are there any cases where the there procedure of data…
shujaat
  • 279
  • 6
  • 17
8
votes
5 answers

How do you insert data into complex data type "Struct" in Hive

I'm completely new to Hive and Stack Overflow. I'm trying to create a table with complex data type "STRUCT" and then populate it using INSERT INTO TABLE in Hive. I'm using the following code: CREATE TABLE struct_test ( address STRUCT< …
data101
  • 145
  • 1
  • 5
  • 13
8
votes
2 answers

Writing to a file in Apache Spark

I am writing a Scala code that requires me to write to a file in HDFS. When I use Filewriter.write on local, it works. The same thing does not work on HDFS. Upon checking, I found that there are the following options to write in Apache Spark-…
kruparulz14
  • 163
  • 1
  • 3
  • 12
8
votes
3 answers

Spark - How many Executors and Cores are allocated to my spark job

Spark architecture is entirely revolves around the concept of executors and cores. I would like to see practically how many executors and cores running for my spark application running in a cluster. I was trying to use below snippet in my…
Krishna Reddy
  • 1,069
  • 5
  • 12
  • 18