25

I would like to have your advice/help on how to subset a big file (millions of rows or lines).

For example,

(1) I have big file (millions of rows, tab-delimited). I want to a subset of this file with only rows from 10000 to 100000.

(2) I have big file (millions of columns, tab-delimited). I want to a subset of this file with only columns from 10000 to 100000.

I know there are tools like head, tail, cut, split, and awk or sed. I can use them to do simple subsetting. But, I do not know how to do this job.

Could you please give any advice? Thanks in advance.

jianfeng.mao
  • 945
  • 3
  • 13
  • 21

4 Answers4

33

Filtering rows is easy, for example with AWK:

cat largefile | awk 'NR >= 10000  && NR <= 100000 { print }'

Filtering columns is easier with CUT:

cat largefile | cut -d '\t' -f 10000-100000

As Rahul Dravid mentioned, cat is not a must here, and as Zsolt Botykai added you can improve performance using:

awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile
cut -d '\t' -f 10000-100000 largefile 
Drakosha
  • 11,925
  • 4
  • 39
  • 52
  • 5
    +1 this is the simplest answer i could ever think of.But the cat is not required here.we can do this without cat.`awk 'NR >= 10000 && NR <= 100000 { print }' largefile` and `cut -d '\t' -f 10000-100000 largefile` – Vijay Jun 27 '11 at 10:44
  • 7
    You can shorten the first one to: `awk 'NR >= 10000 && NR <= 100000' largefile`, that's one less process, and awk's default action is print if the range evaluates to true. But that's slurps the lines afters 100000 for nothing, so better write it like: `awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile` as this way the processing will stop on the 100001st line. – Zsolt Botykai Jun 27 '11 at 10:47
  • Thanks a lot, Drakosha and others. I got much from you all. You scripts can improve my ability of file operation much. – jianfeng.mao Jun 27 '11 at 12:39
  • 3
    Since you are exiting at >100000, there is no need to check for NR<=100000. `awk 'NR>100000{exit}NR>=10000' largefile` – ghostdog74 Jun 28 '11 at 02:14
18

Some different solutions:

For row ranges: In sed :

sed -n 10000,100000p somefile.txt

For column ranges in awk:

awk -v f=10000 -v t=100000 '{ for (i=f; i<=t;i++) printf("%s%s", $i,(i==t) ? "\n" : OFS) }' details.txt
Vijay
  • 65,327
  • 90
  • 227
  • 319
  • Thanks a lot, Rahul. You give me a good direction on sed and awk. Though I can use them in simple operation, I learned much from you. – jianfeng.mao Jun 27 '11 at 12:36
4

For the first problem, selecting a set of rows from a large file, piping tail to head is very simple. You want 90000 rows from largefile starting at row 10000. tail grabs the back end of largefile starting at row 10000 and then head chops off all but the first 90000 rows.

tail -n +10000 largefile | head -n 90000 -
Warren
  • 41
  • 1
3

Was beaten to it for the sed solution, so I'll post a perl dito instead. To print selected lines.

$ seq 100 | perl -ne 'print if $. >= 10 && $. <= 20' 
10
11
12
13
14
15
16
17
18
19
20

To print selective columns, use

perl -lane 'print $F[1] .. $F[3] '

-F is used in conjunction with -a, to choose the delimiter on which to split lines.

To test, use seq and paste to get generate some columns

$ seq 50 | paste - - - - -
1   2   3   4   5
6   7   8   9   10
11  12  13  14  15
16  17  18  19  20
21  22  23  24  25
26  27  28  29  30
31  32  33  34  35
36  37  38  39  40
41  42  43  44  45
46  47  48  49  50

Lets's print everything except the first and the last column

$ seq 50 | paste - - - - - | perl -lane 'print join "   ", $F[1] .. $F[3]'
2   3   4
7   8   9
12  13  14
17  18  19
22  23  24
27  28  29
32  33  34
37  38  39
42  43  44
47  48  49

In the join statement above, there is a tab, you get it by doing a ctrl-v tab.

Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130