Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
8
votes
2 answers

Different ways to import files into HDFS

I want to know what are the different ways through which I can bring data into HDFS. I am a newbie to Hadoop and was a java web developer till this time. I want to know if I have a web application that is creating log files, how can i import the log…
Gaurav
  • 81
  • 1
  • 1
  • 2
8
votes
4 answers

Hive Internal Error: java.lang.ClassNotFoundException(org.apache.atlas.hive.hook.HiveHook)

I am running a hive query throwh oozie using hue.. I am creating a table through hue-oozie work flow... My job is failing but when I check in hive the table is created. Log shows below error: 16157 [main] INFO …
Amaresh
  • 3,231
  • 7
  • 37
  • 60
8
votes
2 answers

how to find JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar

I'm practicing a video tutorial from plural sight about Amazon EMR. I am stuck as i cannot proceed as i am getting this error Not a valid JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar Please note that tutorial is old and it is using a…
harshil bhatt
  • 152
  • 1
  • 1
  • 10
8
votes
1 answer

Could not find uri with key dfs.encryption.key.provider.uri to create a keyProvider in HDFS encryption for CDH 5.4

CDH Version: CDH5.4.5 Issue: When HDFS Encryption is enabled using KMS available in Hadoop CDH 5.4 , getting error while putting file into encryption zone. Steps: Steps for Encryption of Hadoop as follows: Creating a key [SUCCESS] [tester@master…
Jack Sparrow
  • 81
  • 1
  • 1
  • 4
8
votes
0 answers

Hadoop Counters vs Spark Accumulators (or what's a best way to gather statistics from hadoop mr and spark applications)

I'd like to understands what are the best practices to gather statistics of job execution in standard hadoop map-reduce and spark. Given 1. A number of files in hdfs (each director, i.e. dataset1, dataset2, etc. is the name of the dataset from the…
szhem
  • 4,672
  • 2
  • 18
  • 30
8
votes
1 answer

"LOST" node in EMR Cluster

How do I troubleshoot and recover a Lost Node in my long running EMR cluster? The node stopped reporting a few days ago. The host seems to be fine and HDFS too. I noticed the issue only from the Hadoop Applications UI.
Marsellus Wallace
  • 17,991
  • 25
  • 90
  • 154
8
votes
3 answers

pyspark : how to check if a file exists in hdfs

I want to check if several files exist in hdfs before load them by SparkContext. I use pyspark. I tried os.system("hadoop fs -test -e %s" %path) but as I have a lot of paths to check, the job crashed. I tried also sc.wholeTextFiles(parent_path) and…
A7med
  • 451
  • 2
  • 5
  • 6
8
votes
1 answer

Unable to connect to Spark UI on EMR

I have set up my SSH tunnel as per the instructions on the EMR console using ssh -i ~/SparkTest.pem -ND 8157 hadoop@ec2-52-1-245-67.compute-1.amazonaws.com. I have also set up FoxyProxy as per the instructions. I can access the Hadoop…
Rory Byrne
  • 923
  • 1
  • 12
  • 22
8
votes
2 answers

How to use the ResourceManager web interface as an user

Every time i try to use the Hadoop Resource Manager web interface (http://resource-manger.host:8088/cluster/) i show up logged in as dr.who. My question, how can I login as another user? In this case i want to login as myself and have a higher lever…
SQL.injection
  • 2,607
  • 5
  • 20
  • 37
8
votes
2 answers

Pyspark: shuffle RDD

I'm trying to randomise the order of elements in an RDD. My current approach is to zip the elements with an RDD of shuffled integers, then later join by those integers. However, pyspark falls over with only 100000000 integers. I'm using the code…
Marcin
  • 48,559
  • 18
  • 128
  • 201
8
votes
3 answers

"Wrong FS... expected: file:///" when trying to read file from HDFS in Java

I am unable to read a file from HDFS using Java: String hdfsUrl = "hdfs://:"; Configuration configuration = new Configuration(); configuration.set("fs.defaultFS", hdfsUrl); FileSystem fs = FileSystem.get(configuration); Path filePath = new…
jds
  • 7,910
  • 11
  • 63
  • 101
8
votes
2 answers

Hive collect_list() does not collect NULL values

I am trying to collect a column with NULLs along with some values in that column...But collect_list ignores the NULLs and collects only the ones with values in it. Is there a way to retrieve the NULLs along with other values ? SELECT col1, col2,…
lalith kkvn
  • 310
  • 1
  • 3
  • 11
8
votes
1 answer

Hive: Is there a better way to percentile rank a column?

Currently, to percentile rank a column in hive, I am using something like the following. I am trying to rank items in a column by what percentile they fall under, assigning a value form 0 to 1 to each item. The code below assigns a value from 0 to…
Charlie Haley
  • 4,152
  • 4
  • 22
  • 36
8
votes
1 answer

How to set up Hadoop in Docker Swarm?

I would like to be able to start a Hadoop cluster in Docker, distributing the Hadoop nodes to the different physical nodes, using swarm. I have found the sequenceiq image that lets me run hadoop in a docker container, but this doesn't allow me to…
SGer
  • 544
  • 4
  • 18
8
votes
2 answers

Apache hive MSCK REPAIR TABLE new partition not added

I am new for Apache Hive. While working on external table partition, if I add new partition directly to HDFS, the new partition is not added after running MSCK REPAIR table. Below are the codes I tried, -- creating external table hive> create…
Green
  • 111
  • 1
  • 2
  • 7