Questions tagged [input-split]
35 questions
1
vote
1 answer
Efficiency of NLineInputFormat's InputSplit calculations
I looked into getSplitsForFile() fn of NLineInputFormat. I found that a InputStream is created for the input file & then its iterated and splits are created every n lines.
Is it efficient? Particularly when this read operation is happening on 1 node…

S Kr
- 1,831
- 2
- 25
- 50
1
vote
0 answers
splits in map reduce jobs
I have an input file on which I need to customize the RecordReader. But, the problem here is, the data may get distributed across different input splits and different mapper may get the data which should be consumed by the first mapper.
For e.g.
A…

user3065762
- 11
- 1
1
vote
1 answer
Hadoop FileSplit reading
Assume a client application that uses a FileSplit object in order to read the actual bytes from the corresponding file.
To do so, an InputStream object has to be created from the FileSplit, via code like:
FileSplit split = ... // The FileSplit…

PNS
- 19,295
- 32
- 96
- 143
1
vote
0 answers
Hadoop map.input.start not a line boundary?
It seems that the map.input.start property isn't giving me the position of the start of a line (except, of course, the first map.input.start which is 0). Sometimes, map.input.start is somewhere in the middle of the first line of the mapper's input,…

Vyassa Baratham
- 1,457
- 12
- 18
0
votes
0 answers
Apache Crunch map reduce job setting input split size not working
I have the following scenario:
Multiple map reduce jobs using apache crunch. These jobs are scheduled using Oozie. Lets consider only one job for simplicity. What i want to achieve is reducing the number of mappers of that job. The number of mappers…

Stefan Ss
- 45
- 5
0
votes
1 answer
AttributeError: 'builtin_function_or_method' object has no attribute 'split' (3)
My code takes two inputs in one string inside the for loop. And I want to split that input to fill two variables. Here's my code:
P = int(input()) #Principal amt
T = int(input()) #Total tenure
N1 = int(input()) #Number of slabs of interest rates by…

Ayush Verma
- 1
- 1
0
votes
1 answer
MapReduce basics
I have a text file of 300mb with block size of 128mb.
So total 3 blocks 128+128+44 mb would be created.
Correct me - For map reduce default input split is same as block size that is 128mb which can be configured.
Now record reader will read…

Boron
- 1
- 1
0
votes
1 answer
InputSplits in mapreduce
I have just started learning Mapreduce and have some queries I want answers to. Here goes:
1) Case 1: FileInputFormat as Input format. A directory having multiple files to be processed is the Input Path. If I have n files, all of the files lesser…

user1808266
- 61
- 4
0
votes
0 answers
How and where is input split size mentioned or passed to a MR program?
I understand what input split size and what a block size means. But what I am trying to understand is where and how the input split size is mentioned for a MR program... is it passed a parameter while starting a MR job using (Hadoop jar MRPROGRAM…

samshers
- 1
- 6
- 37
- 84
0
votes
1 answer
hadoop - how would input splits form if a file has only one record and the size of file is more than block size?
example to explain the question -
i have a file of size 500MB (input.csv)
the file contains only one line (record) in it
so how the file will be stored in HDFS blocks and how the input splits would be computed ?

Ankush Rathi
- 622
- 1
- 6
- 26
0
votes
1 answer
Input Splits in Hadoop
If the input file size is 200MB, there will be 4 blocks/ input splits, but each data node will have a mapper running on it. If all the 4 input splits are in the same data node, then only one map task will be executed?
or how does the number of map…

Harshi
- 189
- 1
- 4
- 20
0
votes
1 answer
Mapper not executing on the hostname returned from getLocations() of InputSplit in Hadoop
I have extended the InputSplit class of Hadoop to calculate my custom input split, however while am returning a particular HostIP(i.e datanode IP) as string for the overridden getLocations(), the Map Task for it is not being executed on that HostIP…

Sushil Ks
- 403
- 2
- 10
- 18
0
votes
1 answer
Location of HadoopPartition
I have a dataset in a csv file that occupies two blocks in HDFS and replicated on two nodes, A and B. Each node has a copy of the dataset.
When Spark starts processing the data, I have seen two ways how Spark loads the dataset as input. It either…

Freddie Feng
- 11
- 1
0
votes
1 answer
jackson jsonparser restart parsing in broken JSON
I am using Jackson to process JSON that comes in chunks in Hadoop. That means, they are big files that are cut up in blocks (in my problem it's 128M but it doesn't really matter).
For efficiency reasons, I need it to be streaming (not possible to…

xmar
- 1,729
- 20
- 48
0
votes
1 answer
Does hadoop job submitter while calculating splits takes record boundries into account?
This question is NOT a duplicate of:
How does Hadoop process records split across block boundaries?
I've one question regarding the input split calculation. As per the hadoop guide
1) the InputSplits respect record boundaries
2) At the same time it…

user3105943
- 13
- 1
- 5