Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
0
votes
1 answer

How can i use sql like and clauses in ElasticSearch?

i read this document to understand equality of sql in elasticsearch.(https://taohiko.wordpress.com/2014/07/18/query-dsl-elasticsearch-vs-sql/) i developed a kinda elasticsearch application it is making indexes from my data if i call below post query…
0
votes
4 answers

Read from HDFS and write to HBASE

The Mapper is reading file from two places 1) Articles visited by user(sorting by country) 2) Statistics of country (country wise) The output of both Mapper is Text, Text I am running program of Amazon Cluster My aim is read data from two different…
Ankush Singh
  • 560
  • 7
  • 17
0
votes
1 answer

How to free up resources on AWS EMR cluster?

I have a common problem where I start an AWS EMR Cluster and log in via SSH and then run spark-shell to test some Spark code and sometimes I lose my internet connection and Putty throws an error that the connection was lost. But it seems the Spark…
V. Samma
  • 2,558
  • 8
  • 30
  • 34
0
votes
1 answer

Which YARN configuration parameters are re-read on each application?

I've got one job that's much bigger than the other 50 or so that run in my daily workflow. I'd like the property yarn.app.mapreduce.am.resource.mb to be larger for just the big job. Am I in luck? How can I tell which properties require a complete…
Judge Mental
  • 5,209
  • 17
  • 22
0
votes
0 answers

Hadoop MapReduce MultipleOutputs - one in Mapper, one in Reducer

I want to use multiple outputs in a Hadoop job in Elastic MapReduce. So, I set up MultipleOutputs in the main() method like so: MultipleOutputs.addNamedOutput(hadoopJob, "One", TextOutputFormat.class, NullWritable.class,…
John Chrysostom
  • 3,973
  • 1
  • 34
  • 50
0
votes
1 answer

Hive query throwing exception - Error while compiling statement: FAILED: ArrayIndexOutOfBoundsException null

I just upgraded hive version to 2.1.0 for both hive-exec and hive-jdbc. But because of this, some queries started failing that previously working fine. Exception - Exception in thread "main" org.apache.hive.service.cli.HiveSQLException: Error while…
devsda
  • 4,112
  • 9
  • 50
  • 87
0
votes
0 answers

Dealing with a LARGE data in mongodb

It is going to be a "general-ish" question but I have a reason for that. I am asking this because I am not sure what kind of approach shall I take to make things faster. I have a mongoDB server running on a BIG aws instance (r3.4xlarge 16 core vCPU…
SRC
  • 2,123
  • 3
  • 31
  • 44
0
votes
1 answer

How can I write a MapReduce code in Python to implement a matrix transpose.

Assume the input file is a .txt and I am trying to run it on a cluster(like EMR on AWS) to test.
0
votes
1 answer

API to get count of task instance group instances in AWS EMR

I want to get the count of task instance groups instances in AWS EMR. For this, I used Cloudwatch to check heartbeat of each task instance groups instances. But I think, at the end EMR is a framework that uses hadoop, and hadoop's master must have…
devsda
  • 4,112
  • 9
  • 50
  • 87
0
votes
1 answer

Hadoop Access Control Exception: Permissions

Job setup failed : org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE,…
0
votes
1 answer

Is it possible to access the underlying org.apache.hadoop.mapreduce.Job from a Scalding job?

In my Scalding job, I have code like this: import org.apache.hadoop.mapreduce.lib.input.FileInputFormat class MyJob(args: Args) extends Job(args) { FileInputFormat.setInputPathFilter(???, classOf[MyFilter]) // ... rest of job ... } class…
0
votes
1 answer

How can I make my Scalding job operate recursively on its input bucket?

I have a Scalding job which runs on EMR. It runs on an S3 bucket containing several files. The source looks like this: MultipleTextLineFiles("s3://path/to/input/").read /* ... some data processing ... */ .write(Tsv("s3://paths/to/output/)) I…
0
votes
0 answers

Can a Scalding source select a subset of the files in an S3 bucket to process?

I have a Scalding job which operates on all the files in a particular timestamped S3 bucket. It looks like this: JsonLine("s3://path/to/timestampedbuckets/2016-02-03/", ('key1, 'key2)).read I want to alter the job to operate on the files in several…
0
votes
2 answers

Project file name field using '-tagFile' option, LOAD USING PigStorage '-tagFile', Pig 0.14

Amazon EMR-4.5, Hadoop 2.7.2, Pig 0.14 I would like to project the file name field and selected fields to a new relation after loading using the -tagFile option. The results do not seem to make sense. Examples: tagfile-test.txt (tab-delimited) AAA …
chillvibes
  • 39
  • 2
0
votes
0 answers

Getting error when invoking elasticSearch from spark

I have a use case, where I need to read messages from kafka and for each message, extract data and invoke elasticsearch Index. The response will be further used to do further processing. I am getting below error when invoking…