Tokenizing a String with tab delimiter in Java while skipping some tokens

Question

I have a huge file with data (~8Gb / ~80 Million records). Every record has 6-8 attributes which are split by a single tab. I would like for starters to copy some given attributes in another file. So I would like a more elegant code than the above, for example if I want only the second and the last token from a total of 4:

StringTokenizer st = new StringTokenizer(line, "\t");
st.nextToken(); //get rid of the first token
System.out.println(st.nextToken()); //show me the second token
st.nextToken(); //get rid of the third token
System.out.println(st.nextToken()); //show me the fourth token

I'm reminding that it's a huge file so I have to avoid any redundant if checks.

If you're just asking for "more elegant code", perhaps you should check out http://codereview.stackexchange.com/ — Kache, Oct 13 '12 at 20:43
Do you want more elegant code, or faster code? More elegant, in this case, will probably mean slower. Lower-level, less elegant code will probably mean marginally faster. — JB Nizet, Oct 13 '12 at 20:59
@JBNizet You're right I need speed. Well currently I'm working on a very small subset of my data (around 200 records) and beyond my "more elegant code" issue I'm also getting a "Error: null" which I can't understand where it comes from, if that rings any bell. — Michael, Oct 13 '12 at 21:24

score 4 · Accepted Answer · answered Oct 13 '12 at 23:17

Your question got me wondering about performance. Lately I've been using Guava's Splitter where possible, just because I dig the syntax. I've never measured performance, so I put together a quick test of four parsing styles. I put these together really quickly, so pardon mistakes in style and edge-case correctness. They're based on the understanding that we're only interested in the second and fourth items.

What I found interesting is that the "homeGrown" (really crude code) solution is the fastest when parsing a 350MB tab-delimited text file (with four columns), ex:

head test.txt 
0   0   0   0
1   2   3   4
2   4   6   8
3   6   9   12

When operating over 350MB of data on my laptop, I got the following results:

homegrown: 2271ms
guavaSplit: 3367ms
regex: 7302ms
tokenize: 3466ms

Given that, I think I'll stick with Guava's splitter for most work and consider custom code for larger data sets.

  public static List<String> tokenize(String line){
    List<String> result = Lists.newArrayList();
    StringTokenizer st = new StringTokenizer(line, "\t");
    st.nextToken(); //get rid of the first token
    result.add(st.nextToken()); //show me the second token
    st.nextToken(); //get rid of the third token
    result.add(st.nextToken()); //show me the fourth token
    return result;
  }

  static final Splitter splitter = Splitter.on('\t');
  public static List<String> guavaSplit(String line){
    List<String> result = Lists.newArrayList();
    int i=0;
    for(String str : splitter.split(line)){
      if(i==1 || i==3){
        result.add(str);
      }
      i++;
    }
    return result;
  }

  static final Pattern p = Pattern.compile("^(.*?)\\t(.*?)\\t(.*?)\\t(.*)$");
  public static List<String> regex(String line){
    List<String> result = null;
    Matcher m = p.matcher(line);
    if(m.find()){
      if(m.groupCount()>=4){
        result= Lists.newArrayList(m.group(2),m.group(4));
      }
    }
    return result;
  }

  public static List<String> homeGrown(String line){
    List<String> result = Lists.newArrayList();
    String subStr = line;
    int cnt = -1;
    int indx = subStr.indexOf('\t');
    while(++cnt < 4 && indx != -1){
      if(cnt==1||cnt==3){
        result.add(subStr.substring(0,indx));
      }
      subStr = subStr.substring(indx+1);
      indx = subStr.indexOf('\t');
    }
    if(cnt==1||cnt==3){
      result.add(subStr);
    }
    return result;
  }

Note that all of these would likely be slower with proper bound checking and more elegant implementation.

DNA · Answer 2 · 2012-10-13T21:36:30.393

You should probably use the unix cut utility, as Paul Tomblin says.

However, in Java you could also try:

String[] fields = line.split("\t");
System.out.println(fields[1]+" "+fields[3]);

Whether this is more 'elegant' is a matter of opinion. Whether it's faster on large files, I don't know - you would need to benchmark it on your system.

Relative performance will also depend on how many fields there are per line, and which fields you want; split() will process the whole line at once, but StringTokenizer will work through the line incrementally (good if you only want fields 2 and 4 out of 20, for example).

score 0 · Answer 3 · answered Oct 13 '12 at 21:01

Although your data file is huge, it sounds like your question is more about how to conveniently access items in a line of text, where the items are separated by tab. I think StringTokenizer is overkill for a format this simple.

I would use some type of "split" to convert the line into an array of tokens. I prefer the StringUtils split in commons-lang over String.split, especially when a regular expression is not needed. Since a tab is "whitespace", you can use the default split method without specifying the delimiter:

String [] items = StringUtils.split(line);
if (items != null && items.length > 6)
{
    System.out.println("Second: " + items[1]  + "; Fourth: " + items[3]);
}

Just a note: in Java 7, String.split has been improved to not use a regexp in case the delimiter is a single char. — JB Nizet, Oct 13 '12 at 21:05

score 0 · Answer 4 · answered Oct 13 '12 at 22:03

One point if you are doing readLines, you are actually scanning the file twice: 1) You search the file 1 character at a time for end of line character 2) you then scan each line for a Tabs.

You could look at one of the Csv libraries. From memory, flatpack just does the one scan. The libraries may deliver better performance (I have never tested it though).

A couple of java libraries: - Java Csv library - flatpack

score 0 · Answer 5 · answered Oct 14 '12 at 00:24

If your file is huge besides the speed you also will face problems with memory consumption because you have to load the file into memory to manipulate it.

I have an idea but please note that is platform specific and violates the java mobility.

You can run unix command from java to gain a lot of speed and memory consumption. For example:

    public static void main ( final String[] args)throws Exception {
         Runtime.getRuntime().exec("cat <file> | awk {print $1} >> myNewFile.txt");
    }

Tokenizing a String with tab delimiter in Java while skipping some tokens

5 Answers5