To effectively utilise map-reduce jobs in Hadoop, i need data to be stored in hadoop's sequence file format. However,currently the data is only in flat .txt format.Can anyone suggest a way i can convert a .txt file to a sequence file?
7 Answers
So the way more simplest answer is just an "identity" job that has a SequenceFile output.
Looks like this in java:
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJobName("Convert Text");
job.setJarByClass(Mapper.class);
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
// increase if you need sorting or a special number of files
job.setNumReduceTasks(0);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job, new Path("/lol"));
SequenceFileOutputFormat.setOutputPath(job, new Path("/lolz"));
// submit and wait for completion
job.waitForCompletion(true);
}

- 20,854
- 6
- 68
- 91
-
1So, if I have 100 .txt files this will give me 100 .seq files, right? What if I want 1 big .seq file? – dranxo Aug 03 '12 at 23:00
-
10I'm guessing : job.setNumReduceTasks(1); – dranxo Aug 03 '12 at 23:07
-
I get an error trying to write this: addInputPaths arguments must be of type Conf or JobConf, not Job. If I change Job to JobConf, the methods setMapperClass and setReducerClass are not available. – Vale Jun 21 '16 at 08:57
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
//White, Tom (2012-05-10). Hadoop: The Definitive Guide (Kindle Locations 5375-5384). OReilly Media - A. Kindle Edition.
public class SequenceFileWriteDemo {
private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };
public static void main( String[] args) throws IOException {
String uri = args[ 0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create( uri), conf);
Path path = new Path( uri);
IntWritable key = new IntWritable();
Text value = new Text();
SequenceFile.Writer writer = null;
try {
writer = SequenceFile.createWriter( fs, conf, path, key.getClass(), value.getClass());
for (int i = 0; i < 100; i ++) {
key.set( 100 - i);
value.set( DATA[ i % DATA.length]);
System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value);
writer.append( key, value); }
} finally
{ IOUtils.closeStream( writer);
}
}
}

- 1,509
- 14
- 9
It depends on what the format of the TXT file is. Is it one line per record? If so, you can simply use TextInputFormat which creates one record for each line. In your mapper you can parse that line and use it whichever way you choose.
If it isn't one line per record, you might need to write your own InputFormat implementation. Take a look at this tutorial for more info.

- 12,491
- 5
- 37
- 46
You can also just create an intermediate table, LOAD DATA the csv contents straight into it, then create a second table as sequencefile (partitioned, clustered, etc..) and insert into select from the intermediate table. You can also set options for compression, e.g.,
set hive.exec.compress.output = true;
set io.seqfile.compression.type = BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
create table... stored as sequencefile;
insert overwrite table ... select * from ...;
The MR framework will then take care of the heavylifting for you, saving you the trouble of having to write Java code.

- 1,801
- 3
- 20
- 32
Be watchful with format specifier :
.
For example (note the space between %
and s
), System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value);
will give us java.util.FormatFlagsConversionMismatchException: Conversion = s, Flags =
Instead, we should use:
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);

- 1,313
- 5
- 17
- 29

- 41
- 3
if you have Mahout installed - it has something called : seqdirectory -- which can do it

- 443
- 7
- 9
If your data is not on HDFS, you need to upload it to HDFS. Two options:
i) hdfs -put on your .txt file and once you get it on HDFS, you can convert it to seq file.
ii) You take text file as input on your HDFS Client box and convert to SeqFile using Sequence File APIs by creating a SequenceFile.Writer and appending (key,values) to it.
If you don't care about key, u can make line number as key and complete text as value.

- 139
- 8
- 17