How to convert .txt file to Hadoop's sequence file format

Question

To effectively utilise map-reduce jobs in Hadoop, i need data to be stored in hadoop's sequence file format. However,currently the data is only in flat .txt format.Can anyone suggest a way i can convert a .txt file to a sequence file?

score 33 · Accepted Answer · answered Mar 21 '11 at 18:58

So the way more simplest answer is just an "identity" job that has a SequenceFile output.

Looks like this in java:

    public static void main(String[] args) throws IOException,
        InterruptedException, ClassNotFoundException {

    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJobName("Convert Text");
    job.setJarByClass(Mapper.class);

    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);

    // increase if you need sorting or a special number of files
    job.setNumReduceTasks(0);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(job, new Path("/lol"));
    SequenceFileOutputFormat.setOutputPath(job, new Path("/lolz"));

    // submit and wait for completion
    job.waitForCompletion(true);
   }

So, if I have 100 .txt files this will give me 100 .seq files, right? What if I want 1 big .seq file? — dranxo, Aug 03 '12 at 23:00
I get an error trying to write this: addInputPaths arguments must be of type Conf or JobConf, not Job. If I change Job to JobConf, the methods setMapperClass and setReducerClass are not available. — Vale, Jun 21 '16 at 08:57

score 17 · Answer 2 · answered Aug 30 '12 at 18:17

import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

//White, Tom (2012-05-10). Hadoop: The Definitive Guide (Kindle Locations 5375-5384). OReilly Media - A. Kindle Edition. 

public class SequenceFileWriteDemo { 

    private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };

    public static void main( String[] args) throws IOException { 
        String uri = args[ 0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create( uri), conf);
        Path path = new Path( uri);
        IntWritable key = new IntWritable();
        Text value = new Text();
        SequenceFile.Writer writer = null;
        try { 
            writer = SequenceFile.createWriter( fs, conf, path, key.getClass(), value.getClass());
            for (int i = 0; i < 100; i ++) { 
                key.set( 100 - i);
                value.set( DATA[ i % DATA.length]);
                System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); 
                writer.append( key, value); } 
        } finally 
        { IOUtils.closeStream( writer); 
        } 
    } 
}

score 7 · Answer 3 · answered Mar 21 '11 at 13:27

It depends on what the format of the TXT file is. Is it one line per record? If so, you can simply use TextInputFormat which creates one record for each line. In your mapper you can parse that line and use it whichever way you choose.

If it isn't one line per record, you might need to write your own InputFormat implementation. Take a look at this tutorial for more info.

score 4 · Answer 4 · answered Jan 17 '12 at 15:39

You can also just create an intermediate table, LOAD DATA the csv contents straight into it, then create a second table as sequencefile (partitioned, clustered, etc..) and insert into select from the intermediate table. You can also set options for compression, e.g.,

set hive.exec.compress.output = true;
set io.seqfile.compression.type = BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;

create table... stored as sequencefile;

insert overwrite table ... select * from ...;

The MR framework will then take care of the heavylifting for you, saving you the trouble of having to write Java code.

score 1 · Answer 5 · edited Jan 31 '19 at 04:54

1

Be watchful with format specifier :.

For example (note the space between % and s), System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); will give us java.util.FormatFlagsConversionMismatchException: Conversion = s, Flags =

Instead, we should use:

System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);

edited Jan 31 '19 at 04:54

Andrew Fan

1,313
5
17
29

answered Jan 31 '19 at 03:30

Pramod Kumar

41
3

score 0 · Answer 6 · answered Jun 02 '14 at 20:41

0

if you have Mahout installed - it has something called : seqdirectory -- which can do it

answered Jun 02 '14 at 20:41

Sumit Pal

443
7
9

score 0 · Answer 7 · answered Mar 31 '11 at 19:03

0

If your data is not on HDFS, you need to upload it to HDFS. Two options:

i) hdfs -put on your .txt file and once you get it on HDFS, you can convert it to seq file.

ii) You take text file as input on your HDFS Client box and convert to SeqFile using Sequence File APIs by creating a SequenceFile.Writer and appending (key,values) to it.

If you don't care about key, u can make line number as key and complete text as value.

answered Mar 31 '11 at 19:03

user656189

139
8
17

1

I need to use the first option. How can I do so? – zohar Jan 23 '12 at 15:00

How to convert .txt file to Hadoop's sequence file format

7 Answers7

Linked