Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
33
votes
2 answers

Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

We can persist an RDD into memory and/or disk when we want to use it more than once. However, do we have to unpersist it ourselves later on, or does Spark does some kind of garbage collection and unpersist the RDD when it is no longer needed? I…
MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
33
votes
3 answers

Select top 2 rows in Hive

I'm trying to retrieve top 2 tables from my employee list based on salary in hive (version 0.11). Since it doesn't support TOP function, is there any alternatives? Or do we have define a UDF?
Holmes
  • 1,059
  • 2
  • 17
  • 25
33
votes
5 answers

Spark-submit not working when application jar is in hdfs

I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception: Warning: Skip…
dilm
  • 687
  • 2
  • 7
  • 14
33
votes
5 answers

How to write to CSV in Spark

I'm trying to find an effective way of saving the result of my Spark Job as a csv file. I'm using Spark with Hadoop and so far all my files are saved as part-00000. Any ideas how to make my spark saving to file with a specified file name?
Karusmeister
  • 871
  • 2
  • 12
  • 25
33
votes
1 answer

Add a column in a table in HIVE QL

I'm writing a code in HIVE to create a table consisting of 1300 rows and 6 columns: create table test1 as SELECT cd_screen_function, SUM(access_count) AS max_count, MIN(response_time_min) as response_time_min, AVG(response_time_avg)…
user2532312
  • 331
  • 1
  • 3
  • 4
33
votes
5 answers

How to specify username when putting files on HDFS from a remote machine?

I have a Hadoop cluster setup and working under a common default username "user1". I want to put files into hadoop from a remote machine which is not part of the hadoop cluster. I configured hadoop files on the remote machine in a way that…
reza
  • 1,188
  • 3
  • 17
  • 32
32
votes
8 answers

Merging multiple files into one within Hadoop

I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig? Thanks!
uHadoop
  • 447
  • 1
  • 5
  • 7
32
votes
2 answers

Apache Hadoop Yarn - Underutilization of cores

No matter how much I tinker with the settings in yarn-site.xml i.e using all of the below…
Abbas Gadhia
  • 14,532
  • 10
  • 61
  • 73
32
votes
5 answers

Find port number where HDFS is listening

I want to access hdfs with fully qualified names such as : hadoop fs -ls hdfs://machine-name:8020/user I could also simply access hdfs with hadoop fs -ls /user However, I am writing test cases that should work on different distributions(HDP,…
ernesto
  • 1,899
  • 4
  • 26
  • 39
32
votes
5 answers

How to make shark/spark clear the cache?

when i run my shark queries, the memory gets hoarded in the main memory This is my top command result. Mem: 74237344k total, 70080492k used, 4156852k free, 399544k buffers Swap: 4194288k total, 480k used, 4193808k free, 65965904k…
venkat
  • 335
  • 1
  • 3
  • 7
32
votes
1 answer

What is hive, Is it a database?

I just started exploring Hive. It has all the structures similar to an RDBMS like tables, joins, partitions.. what i understand is Hive still uses HDFS for storage and it is an SQL abstraction of HDFS. From this I am not sure weather Hive itself is…
Brainchild
  • 1,814
  • 5
  • 27
  • 52
32
votes
10 answers

No data nodes are started

I am trying to setup Hadoop version 0.20.203.0 in a pseudo distributed configuration using the following guide: http://www.javacodegeeks.com/2012/01/hadoop-modes-explained-standalone.html After running the start-all.sh script I run "jps". I get…
Aaron S
  • 711
  • 2
  • 10
  • 23
31
votes
1 answer

Difference between `hadoop dfs` and `hadoop fs`

I saw the dfs command, then went to the documentation but I am unable to understand. In my point of view fs and dfs working similar. Any one give exact difference?
Arun
  • 569
  • 2
  • 10
  • 19
31
votes
1 answer

Deploying Spark and HDFS on Docker Swarm doesn't enable data locality

I am trying to set up a Spark + HDFS deployment on a small cluster using Docker Swarm as a stack deployment. I have it generally working, but I ran into an issue that is preventing Spark from taking advantage of data locality. In an attempt to…
kamprath
  • 2,220
  • 1
  • 23
  • 28
31
votes
7 answers

How to convert .txt file to Hadoop's sequence file format

To effectively utilise map-reduce jobs in Hadoop, i need data to be stored in hadoop's sequence file format. However,currently the data is only in flat .txt format.Can anyone suggest a way i can convert a .txt file to a sequence file?
Abhishek Pathak
  • 1,569
  • 1
  • 10
  • 19