Questions tagged [input-split]

35 questions
1
vote
1 answer

Efficiency of NLineInputFormat's InputSplit calculations

I looked into getSplitsForFile() fn of NLineInputFormat. I found that a InputStream is created for the input file & then its iterated and splits are created every n lines. Is it efficient? Particularly when this read operation is happening on 1 node…
S Kr
  • 1,831
  • 2
  • 25
  • 50
1
vote
0 answers

splits in map reduce jobs

I have an input file on which I need to customize the RecordReader. But, the problem here is, the data may get distributed across different input splits and different mapper may get the data which should be consumed by the first mapper. For e.g. A…
1
vote
1 answer

Hadoop FileSplit reading

Assume a client application that uses a FileSplit object in order to read the actual bytes from the corresponding file. To do so, an InputStream object has to be created from the FileSplit, via code like: FileSplit split = ... // The FileSplit…
PNS
  • 19,295
  • 32
  • 96
  • 143
1
vote
0 answers

Hadoop map.input.start not a line boundary?

It seems that the map.input.start property isn't giving me the position of the start of a line (except, of course, the first map.input.start which is 0). Sometimes, map.input.start is somewhere in the middle of the first line of the mapper's input,…
Vyassa Baratham
  • 1,457
  • 12
  • 18
0
votes
0 answers

Apache Crunch map reduce job setting input split size not working

I have the following scenario: Multiple map reduce jobs using apache crunch. These jobs are scheduled using Oozie. Lets consider only one job for simplicity. What i want to achieve is reducing the number of mappers of that job. The number of mappers…
Stefan Ss
  • 45
  • 5
0
votes
1 answer

AttributeError: 'builtin_function_or_method' object has no attribute 'split' (3)

My code takes two inputs in one string inside the for loop. And I want to split that input to fill two variables. Here's my code: P = int(input()) #Principal amt T = int(input()) #Total tenure N1 = int(input()) #Number of slabs of interest rates by…
0
votes
1 answer

MapReduce basics

I have a text file of 300mb with block size of 128mb. So total 3 blocks 128+128+44 mb would be created. Correct me - For map reduce default input split is same as block size that is 128mb which can be configured. Now record reader will read…
Boron
  • 1
  • 1
0
votes
1 answer

InputSplits in mapreduce

I have just started learning Mapreduce and have some queries I want answers to. Here goes: 1) Case 1: FileInputFormat as Input format. A directory having multiple files to be processed is the Input Path. If I have n files, all of the files lesser…
0
votes
0 answers

How and where is input split size mentioned or passed to a MR program?

I understand what input split size and what a block size means. But what I am trying to understand is where and how the input split size is mentioned for a MR program... is it passed a parameter while starting a MR job using (Hadoop jar MRPROGRAM…
samshers
  • 1
  • 6
  • 37
  • 84
0
votes
1 answer

hadoop - how would input splits form if a file has only one record and the size of file is more than block size?

example to explain the question - i have a file of size 500MB (input.csv) the file contains only one line (record) in it so how the file will be stored in HDFS blocks and how the input splits would be computed ?
Ankush Rathi
  • 622
  • 1
  • 6
  • 26
0
votes
1 answer

Input Splits in Hadoop

If the input file size is 200MB, there will be 4 blocks/ input splits, but each data node will have a mapper running on it. If all the 4 input splits are in the same data node, then only one map task will be executed? or how does the number of map…
Harshi
  • 189
  • 1
  • 4
  • 20
0
votes
1 answer

Mapper not executing on the hostname returned from getLocations() of InputSplit in Hadoop

I have extended the InputSplit class of Hadoop to calculate my custom input split, however while am returning a particular HostIP(i.e datanode IP) as string for the overridden getLocations(), the Map Task for it is not being executed on that HostIP…
Sushil Ks
  • 403
  • 2
  • 10
  • 18
0
votes
1 answer

Location of HadoopPartition

I have a dataset in a csv file that occupies two blocks in HDFS and replicated on two nodes, A and B. Each node has a copy of the dataset. When Spark starts processing the data, I have seen two ways how Spark loads the dataset as input. It either…
0
votes
1 answer

jackson jsonparser restart parsing in broken JSON

I am using Jackson to process JSON that comes in chunks in Hadoop. That means, they are big files that are cut up in blocks (in my problem it's 128M but it doesn't really matter). For efficiency reasons, I need it to be streaming (not possible to…
xmar
  • 1,729
  • 20
  • 48
0
votes
1 answer

Does hadoop job submitter while calculating splits takes record boundries into account?

This question is NOT a duplicate of: How does Hadoop process records split across block boundaries? I've one question regarding the input split calculation. As per the hadoop guide 1) the InputSplits respect record boundaries 2) At the same time it…
user3105943
  • 13
  • 1
  • 5