Split delimited file into smaller files by column

Question

I'm familiar with the split command in linux. If I have a file that's 100 lines long,

split -l 5 myfile.txt

...will split myfile.txt into 20 files, each having 5 lines, and will write them to file.

My question is, I want to do this by column. Given a file with 100 columns, tab delimited, is there a similar command to split this file into 20 smaller files, each having 5 columns and all the rows?

I'm aware of how to use cut, but I'm hoping there's a simple UNIX command I've never heard of that will accomplish this without wrapping cut with perl or something.

Thanks in advance.

by the way, i'm doing this on a 100GB file, 4 million columns, 11000 rows. — Stephen Turner, Mar 10 '11 at 23:50

SiegeX · Accepted Answer · 2011-03-10T22:38:52.570

9

#!/bin/bash

(($# == 2)) || { echo -e "\nUsage: $0 <file to split> <# columns in each split>\n\n"; exit; }

infile="$1"

inc=$2
ncol=$(awk 'NR==1{print NF}' "$infile")

((inc < ncol)) || { echo -e "\nSplit size >= number of columns\n\n"; exit; }

for((i=0, start=1, end=$inc; i < ncol/inc + 1; i++, start+=inc, end+=inc)); do
  cut -f$start-$end "$infile" > "${infile}.$i"
done

edited Mar 10 '11 at 22:38

answered Mar 10 '11 at 21:14

SiegeX

135,741
24
144
154

1

+0.91 (Deducted 0.02 for the dollar sign in the `for` arguments, 0.02 for the curly braces around `infile` and 0.04 for using AWK instead of something like `read -r -a arr < "$infile"; ncol=${#arr[@]}`, also 0.02 for `echo` instead of `printf`.) ;) – Dennis Williamson Mar 11 '11 at 03:59
Hi SiegeX, Your solution is very nice. Cheers! – Andy K Apr 18 '14 at 12:47
how can we also keep the first column in each file, Means the file should be like: File_1: col_1 col_2 File_2: col_1 col_3 File_3: col_1 col_4 – Waqas Khokhar May 23 '19 at 11:20

score 4 · Answer 2 · answered Sep 12 '14 at 11:23

if you only need a QAD (Quick & Dirty) solution for in my case a fixed 8 column ; separated csv

#!/bin/bash
# delimiter is ;
cut -d';' -f1 "$1" > "${1}.1"
cut -d';' -f2 "$1" > "${1}.2"
cut -d';' -f3 "$1" > "${1}.3"
cut -d';' -f4 "$1" > "${1}.4"
cut -d';' -f5 "$1" > "${1}.5"
cut -d';' -f6 "$1" > "${1}.6"
cut -d';' -f7 "$1" > "${1}.7"
cut -d';' -f8 "$1" > "${1}.8"

score 1 · Answer 3 · answered Mar 10 '11 at 21:39

Thanks for the help. I hoped there would be a unix command similar to split, but I ended up wrapping the cut command with perl, via SiegeX's suggestion.

#!/usr/bin/perl

chomp(my $pwd = `pwd`);
my $help = "\nUsage: $0 <file to split> <# columns in each split>\n\n";
die $help if @ARGV!=2;


$infile = $ARGV[0];
chomp($ncol = `head -n 1 $infile | wc -w`);

$start=1;
$inc = $ARGV[1];
$end = $start+$inc-1;

die "\nSplit size >= number of columns\n\n" if $inc>=$ncol;

for($i=1 ; $i<$ncol/$inc +1 ; $i++) {
    if ($end>$ncol) {$end=$ncol;}
    `cut -f $start-$end $infile > $infile.$i`;
    $start += $inc;
    $end += $inc;
}

score 0 · Answer 4 · answered Jan 25 '17 at 23:02

Split can actually do what you desire, with a little bit of preprocessing

sed -E $'s/(([^\t]+\t){4}[^\t]+)\t/\\1\\n/g' myfile.txt | split -nr/20

This will write out twenty files with an x prefix (in my version of split). You can verify this worked with:

paste x* | cmp - myfile.txt

Essentially what this is doing is using sed to split each line into twenty lines, and then using split with round robin chunks to write each line to the appropriate file. To use a different delimiter, switch the tabs in the expression. The number 4 should be the number of columns per file - 1, and the 20 at the end of split is the number of files. Additional parameters to split can be used to modify the filenames that are written. This example uses bashes escape expansion to write tabs into the sed expression and a version of sed that can use the + operator, but these effects can be achieved alternate ways if these aren't present on your system.

I got a variant of this solution from Reuti on the coreutils mailing list.

score 0 · Answer 5 · answered Jul 23 '18 at 16:16

There is not directly something similar that will split your file column-wise. However, you can use AWK for this in a straightforward manner:

The following splits input_file in output files containing NUMBER of columns

awk 'BEGIN{FS="\t"; m=NUMBER }
     { for(i=1;i<=NF;++i) { 
          s = (i%m==1 ? $i : s FS $i);                                                                                                                                                 
          if (i%m==0 || i==NF) {print s > (sprintf("out.%0.5d",int(i/m)+(i%m!=0)))}
     }}' input_file

The following splits input_file in CHUNKS output files

awk 'BEGIN{FS="\t"; n=CHUNKS}
     (NR==1){ m=int(NF/n)+(NF%n==0) }
     { for(i=1;i<=NF;++i) { 
          s = (i%m==1 ? $i : s FS $i);                                                                                                                                                 
          if (i%m==0 || i==NF) {print s > (sprintf("out.%0.5d",int(i/m)+(i%m!=0)))}
     }}' input_file

The script seems promising. However, I tried to run the first script on Mac OS and received a "awk: division by zero in mod". \- E — Edmund's Echo, May 25 '19 at 23:11

score 0 · Answer 6 · answered Mar 10 '11 at 21:30

0

# do something smarter with output files (& clear on start)
XIFS="${IFS}"
IFS=$'\t'
while read -a LINE; do 
  for (( i=0; i< ${#LINE[@]}; i++ )); do
    echo "${LINE[$i]}" >> /tmp/outfile${i}
  done
done < infile
IFS="${XIFS}"

Try the above ... using file name 'infile'

Note the saving and restoring of the IFS (does anyone have a better idea? a subshell?)

Also note that this appends, if you are running for a second time - you would want to delete prior run's outputs ...

answered Mar 10 '11 at 21:30

nhed

5,774
3
30
44

I don't think this script does what you think it does. Also, for your IFS question, just use `while IFS=$'\t' read -a LINE; do` – SiegeX Mar 10 '11 at 21:44
Good idea about the IFS before the read! As for the rest - it does what I think - I tested it before I posted it ... I had 3 columns separated with tabs with 3 lines in the whole file. It created 3 output files with the individual columns - try it! Also I like it better than using cut because it scales better for large data – nhed Mar 10 '11 at 23:17
nhed: Ok let me rephrase that, I don't think this script does what the OP wants =). He wants multiple columns (5) per file – SiegeX Mar 10 '11 at 23:44
@Siege: OK, I stand corrected, I just re-read about the 5 columns (I guess I may have stopped reading at 'tab delimited' :) – nhed Mar 11 '11 at 01:51

score 0 · Answer 7 · answered Mar 11 '11 at 05:02

Here you have my solution:

First an input generator:

    1 #!/usr/bin/env ruby                                                                                                                                                                                       
    2 #                                                                                                                                                                                                         
    3 def usage(e)                                                                                                                                                                                              
    4   puts "Usage #{__FILE__} <n_rows> <n_cols>"                                                                                                                                                              
    5   exit e                                                                                                                                                                                                  
    6 end                                                                                                                                                                                                       
    7                                                                                                                                                                                                           
    8 usage 1 unless ARGV.size == 2                                                                                                                                                                             
    9                                                                                                                                                                                                           
   10 rows, cols = ARGV.map{|e| e.to_i}                                                                                                                                                                         
   11 (1..rows).each do |l|                                                                                                                                                                                     
   12   (1..cols).each {|c| printf "%s ", c }                                                                                                                                                                   
   13   puts ""                                                                                                                                                                                                 
   14 end

The split tool:

    1 #!/usr/bin/env ruby                                                                                                                                                                                       
    2 #                                                                                                                                                                                                         
    3                                                                                                                                                                                                           
    4 def usage(e)                                                                                                                                                                                              
    5   puts "Usage #{__FILE__} <column_start> <column_end>"                                                                                                                                                    
    6   exit e                                                                                                                                                                                                  
    7 end                                                                                                                                                                                                       
    8                                                                                                                                                                                                           
    9 usage 1 unless ARGV.size == 2                                                                                                                                                                             
   10                                                                                                                                                                                                           
   11 c_start, c_end = ARGV.map{|e| e.to_i}                                                                                                                                                                     
   12 i = 0                                                                                                                                                                                                     
   13 buffer = []                                                                                                                                                                                               
   14 $stdin.each_line do |l|                                                                                                                                                                                   
   15   i += 1                                                                                                                                                                                                  
   16   buffer << l.split[c_start..c_end].join(" ")                                                                                                                                                             
   17   $stderr.printf "\r%d", i if i % 100000 == 0                                                                                                                                                             
   18 end                                                                                                                                                                                                       
   19 $stderr.puts ""                                                                                                                                                                                           
   20 buffer.each {|l| puts l}

Notice that the split tool dumps to the stderr the value of number of line it is processing so you can get an idea how fast is going.

Also, I am assuming that the separator is an space.

Example of how to run it:

 $ time ./gen.data.rb 1000 10 | ./split.rb 0 4 > ./out

Generate 1000 lines with 10 columns each and split the first 5 columns. I use time(1) to measure the running time.

We can use a little oneliner to do the splitting you requested (sequentially). It is very easy to process it in parallel in a single node (check bash building command wait) or to send them to a cluster.

$ ruby -e '(0..103).each {|i| puts "cat input.txt | ./split.rb #{i-4} #{i} > out.#{i/4}" if i % 4 == 0 && i > 0}' | /bin/bash

Which basically generates:

cat input.txt | ./split.rb 0 4 > out.1
cat input.txt | ./split.rb 4 8 > out.2
cat input.txt | ./split.rb 8 12 > out.3
cat input.txt | ./split.rb 12 16 > out.4
cat input.txt | ./split.rb 16 20 > out.5
cat input.txt | ./split.rb 20 24 > out.6
cat input.txt | ./split.rb 24 28 > out.7
cat input.txt | ./split.rb 28 32 > out.8
cat input.txt | ./split.rb 32 36 > out.9
cat input.txt | ./split.rb 36 40 > out.10
cat input.txt | ./split.rb 40 44 > out.11
cat input.txt | ./split.rb 44 48 > out.12
cat input.txt | ./split.rb 48 52 > out.13
cat input.txt | ./split.rb 52 56 > out.14
cat input.txt | ./split.rb 56 60 > out.15
cat input.txt | ./split.rb 60 64 > out.16
cat input.txt | ./split.rb 64 68 > out.17
cat input.txt | ./split.rb 68 72 > out.18
cat input.txt | ./split.rb 72 76 > out.19
cat input.txt | ./split.rb 76 80 > out.20
cat input.txt | ./split.rb 80 84 > out.21
cat input.txt | ./split.rb 84 88 > out.22
cat input.txt | ./split.rb 88 92 > out.23
cat input.txt | ./split.rb 92 96 > out.24
cat input.txt | ./split.rb 96 100 > out.25

And gets piped to bash.

Be careful with the number of processes (or jobs) you compute in parallel because it will flood your storage (unless you have independent storage volumes).

Hope that helps. Let us know how fast it runs for you.

-drd

Split delimited file into smaller files by column

7 Answers7

Linked

Related