Splitting a file on Hadoop

Question

I have an 8.8G file on the hadoop cluster that I'm trying to extract certain lines for testing purpose.

Seeing that Apache Hadoop 2.6.0 have no split command, how am I able to do it without having to download the file.

If the file was on a linux server I would've used:

$ csplit filename %2015-07-17%

The previous command works as desired, is something close to that possible on Hadoop?

score 0 · Answer 1 · answered Sep 24 '15 at 17:05

0

You could use a combination of unix and hdfs commands.

hadoop fs -cat filename.dat | head -250 > /redirect/filename

Or if last KB of the file is suffice you could use this.

hadoop fs -tail filename.dat > /redirect/filename

answered Sep 24 '15 at 17:05

Vignesh I

1

Not even realistic, there's about 54M lines in that file and the data I'm extracting is nowhere near `head -250` – Leb Sep 24 '15 at 17:08
HDFS doesn't give you much command line features to work out. If you want to explore that then this could be the way unless and until you bring the file to local. Else a pig(with LIMIT) or MR script would do the need for you. – Vignesh I Sep 24 '15 at 17:12
I'll just transfer to local since it's only one file. If I had multiple files then MR would be worth it. Thank you. – Leb Sep 24 '15 at 17:16
But still 8.8G file needs to be transferred. You could try writing a simple PIG script. A = LOAD 'file' using PigStorage as line; B = LIMIT your_number; C = STORE B INTO 'filename' ; – Vignesh I Sep 24 '15 at 17:18

1 Answers1