Here you have my solution:
First an input generator:
1 #!/usr/bin/env ruby
2 #
3 def usage(e)
4 puts "Usage #{__FILE__} <n_rows> <n_cols>"
5 exit e
6 end
7
8 usage 1 unless ARGV.size == 2
9
10 rows, cols = ARGV.map{|e| e.to_i}
11 (1..rows).each do |l|
12 (1..cols).each {|c| printf "%s ", c }
13 puts ""
14 end
The split tool:
1 #!/usr/bin/env ruby
2 #
3
4 def usage(e)
5 puts "Usage #{__FILE__} <column_start> <column_end>"
6 exit e
7 end
8
9 usage 1 unless ARGV.size == 2
10
11 c_start, c_end = ARGV.map{|e| e.to_i}
12 i = 0
13 buffer = []
14 $stdin.each_line do |l|
15 i += 1
16 buffer << l.split[c_start..c_end].join(" ")
17 $stderr.printf "\r%d", i if i % 100000 == 0
18 end
19 $stderr.puts ""
20 buffer.each {|l| puts l}
Notice that the split tool dumps to the stderr the value of number
of line it is processing so you can get an idea how fast is going.
Also, I am assuming that the separator is an space.
Example of how to run it:
$ time ./gen.data.rb 1000 10 | ./split.rb 0 4 > ./out
Generate 1000 lines with 10 columns each and split the first 5 columns. I use time(1)
to measure the running time.
We can use a little oneliner to do the splitting you requested (sequentially). It is very
easy to process it in parallel in a single node (check bash building command wait) or to
send them to a cluster.
$ ruby -e '(0..103).each {|i| puts "cat input.txt | ./split.rb #{i-4} #{i} > out.#{i/4}" if i % 4 == 0 && i > 0}' | /bin/bash
Which basically generates:
cat input.txt | ./split.rb 0 4 > out.1
cat input.txt | ./split.rb 4 8 > out.2
cat input.txt | ./split.rb 8 12 > out.3
cat input.txt | ./split.rb 12 16 > out.4
cat input.txt | ./split.rb 16 20 > out.5
cat input.txt | ./split.rb 20 24 > out.6
cat input.txt | ./split.rb 24 28 > out.7
cat input.txt | ./split.rb 28 32 > out.8
cat input.txt | ./split.rb 32 36 > out.9
cat input.txt | ./split.rb 36 40 > out.10
cat input.txt | ./split.rb 40 44 > out.11
cat input.txt | ./split.rb 44 48 > out.12
cat input.txt | ./split.rb 48 52 > out.13
cat input.txt | ./split.rb 52 56 > out.14
cat input.txt | ./split.rb 56 60 > out.15
cat input.txt | ./split.rb 60 64 > out.16
cat input.txt | ./split.rb 64 68 > out.17
cat input.txt | ./split.rb 68 72 > out.18
cat input.txt | ./split.rb 72 76 > out.19
cat input.txt | ./split.rb 76 80 > out.20
cat input.txt | ./split.rb 80 84 > out.21
cat input.txt | ./split.rb 84 88 > out.22
cat input.txt | ./split.rb 88 92 > out.23
cat input.txt | ./split.rb 92 96 > out.24
cat input.txt | ./split.rb 96 100 > out.25
And gets piped to bash.
Be careful with the number of processes (or jobs) you compute in parallel because it will flood your
storage (unless you have independent storage volumes).
Hope that helps. Let us know how fast it runs for you.
-drd