3

I have a huge file with data (~8Gb / ~80 Million records). Every record has 6-8 attributes which are split by a single tab. I would like for starters to copy some given attributes in another file. So I would like a more elegant code than the above, for example if I want only the second and the last token from a total of 4:

StringTokenizer st = new StringTokenizer(line, "\t");
st.nextToken(); //get rid of the first token
System.out.println(st.nextToken()); //show me the second token
st.nextToken(); //get rid of the third token
System.out.println(st.nextToken()); //show me the fourth token

I'm reminding that it's a huge file so I have to avoid any redundant if checks.

Michael
  • 791
  • 2
  • 12
  • 32
  • If you're just asking for "more elegant code", perhaps you should check out http://codereview.stackexchange.com/ – Kache Oct 13 '12 at 20:43
  • 2
    I would *not* use Java for this. I'd probably use `cut`. – Paul Tomblin Oct 13 '12 at 20:44
  • 1
    Do you want more elegant code, or faster code? More elegant, in this case, will probably mean slower. Lower-level, less elegant code will probably mean marginally faster. – JB Nizet Oct 13 '12 at 20:59
  • @JBNizet You're right I need speed. Well currently I'm working on a very small subset of my data (around 200 records) and beyond my "more elegant code" issue I'm also getting a "Error: null" which I can't understand where it comes from, if that rings any bell. – Michael Oct 13 '12 at 21:24

5 Answers5

4

Your question got me wondering about performance. Lately I've been using Guava's Splitter where possible, just because I dig the syntax. I've never measured performance, so I put together a quick test of four parsing styles. I put these together really quickly, so pardon mistakes in style and edge-case correctness. They're based on the understanding that we're only interested in the second and fourth items.

What I found interesting is that the "homeGrown" (really crude code) solution is the fastest when parsing a 350MB tab-delimited text file (with four columns), ex:

head test.txt 
0   0   0   0
1   2   3   4
2   4   6   8
3   6   9   12

When operating over 350MB of data on my laptop, I got the following results:

  • homegrown: 2271ms
  • guavaSplit: 3367ms
  • regex: 7302ms
  • tokenize: 3466ms

Given that, I think I'll stick with Guava's splitter for most work and consider custom code for larger data sets.

  public static List<String> tokenize(String line){
    List<String> result = Lists.newArrayList();
    StringTokenizer st = new StringTokenizer(line, "\t");
    st.nextToken(); //get rid of the first token
    result.add(st.nextToken()); //show me the second token
    st.nextToken(); //get rid of the third token
    result.add(st.nextToken()); //show me the fourth token
    return result;
  }

  static final Splitter splitter = Splitter.on('\t');
  public static List<String> guavaSplit(String line){
    List<String> result = Lists.newArrayList();
    int i=0;
    for(String str : splitter.split(line)){
      if(i==1 || i==3){
        result.add(str);
      }
      i++;
    }
    return result;
  }

  static final Pattern p = Pattern.compile("^(.*?)\\t(.*?)\\t(.*?)\\t(.*)$");
  public static List<String> regex(String line){
    List<String> result = null;
    Matcher m = p.matcher(line);
    if(m.find()){
      if(m.groupCount()>=4){
        result= Lists.newArrayList(m.group(2),m.group(4));
      }
    }
    return result;
  }

  public static List<String> homeGrown(String line){
    List<String> result = Lists.newArrayList();
    String subStr = line;
    int cnt = -1;
    int indx = subStr.indexOf('\t');
    while(++cnt < 4 && indx != -1){
      if(cnt==1||cnt==3){
        result.add(subStr.substring(0,indx));
      }
      subStr = subStr.substring(indx+1);
      indx = subStr.indexOf('\t');
    }
    if(cnt==1||cnt==3){
      result.add(subStr);
    }
    return result;
  }

Note that all of these would likely be slower with proper bound checking and more elegant implementation.

DSK
  • 558
  • 4
  • 8
0

You should probably use the unix cut utility, as Paul Tomblin says.

However, in Java you could also try:

String[] fields = line.split("\t");
System.out.println(fields[1]+" "+fields[3]);

Whether this is more 'elegant' is a matter of opinion. Whether it's faster on large files, I don't know - you would need to benchmark it on your system.

Relative performance will also depend on how many fields there are per line, and which fields you want; split() will process the whole line at once, but StringTokenizer will work through the line incrementally (good if you only want fields 2 and 4 out of 20, for example).

DNA
  • 42,007
  • 12
  • 107
  • 146
0

Although your data file is huge, it sounds like your question is more about how to conveniently access items in a line of text, where the items are separated by tab. I think StringTokenizer is overkill for a format this simple.

I would use some type of "split" to convert the line into an array of tokens. I prefer the StringUtils split in commons-lang over String.split, especially when a regular expression is not needed. Since a tab is "whitespace", you can use the default split method without specifying the delimiter:

String [] items = StringUtils.split(line);
if (items != null && items.length > 6)
{
    System.out.println("Second: " + items[1]  + "; Fourth: " + items[3]);
}
Guido Simone
  • 7,912
  • 2
  • 19
  • 21
  • 2
    Just a note: in Java 7, String.split has been improved to not use a regexp in case the delimiter is a single char. – JB Nizet Oct 13 '12 at 21:05
0

One point if you are doing readLines, you are actually scanning the file twice: 1) You search the file 1 character at a time for end of line character 2) you then scan each line for a Tabs.

You could look at one of the Csv libraries. From memory, flatpack just does the one scan. The libraries may deliver better performance (I have never tested it though).

A couple of java libraries: - Java Csv library - flatpack

Bruce Martin
  • 10,358
  • 1
  • 27
  • 38
0

If your file is huge besides the speed you also will face problems with memory consumption because you have to load the file into memory to manipulate it.

I have an idea but please note that is platform specific and violates the java mobility.

You can run unix command from java to gain a lot of speed and memory consumption. For example:

    public static void main ( final String[] args)throws Exception {
         Runtime.getRuntime().exec("cat <file> | awk {print $1} >> myNewFile.txt");
    }
mspapant
  • 1,860
  • 1
  • 22
  • 31