0

I have a file with almost 5*(10^6) lines of integer numbers. So, my file is big enough.

The question is all about extract specific lines, filtering them by a condition. For example, I'd like to:

  1. Extract the N first lines without read entire file.
  2. Extract the lines with the numbers less or equal X (or >=, <=, <, >)
  3. Extract the lines with a condition related a number (math predicate)

Is there a cleaver way to perform these tasks? (using sed or awk or cat or head)

Thanks in advance.

Jonathan Prieto-Cubides
  • 2,577
  • 2
  • 18
  • 17

1 Answers1

0

To extract the first $NUMBER lines,

head -n $NUMBER filename

Assuming every line contains just a number (although it will also work if the first token is one), 2 can be solved like this:

awk '$1 >= 1234 && $1 < 5678' filename

And keeping in spirit with that, 3 is just the extension

awk 'condition' filename

It would have helped if you had specified what condition is supposed to be, though. This way, you'll have to read the awk documentation to find out how to code it. Again, the number will be represented by $1.

I don't think I can explain anything about the head call, it's really just what it says on the tin. As for the awk lines: awk, like sed, works linewise. awk fetches lines in a loop and applies your code to each line. This code takes the form

condition1 { action1 }
condition2 { action2 }
# and so forth

For every line awk fetches, the conditions are checked in the order they appear, and the associated action to each condition is performed if the condition is true. It would, for example, have been possible to extract the first $NUMBER lines of a file with awk like this:

awk -v number="$NUMBER" '1 { print } NR == number { exit }' filename

where 1 is synonymous with true (like in C) and NR is the line number. The -v command line option initializes the awk variable number to $NUMBER. If no action is specified, the default action is { print }, which prints the whole line. So

awk 'condition' filename

is shorthand for

awk 'condition { print }' filename

...which prints every line where the condition holds.

Wintermute
  • 42,983
  • 5
  • 77
  • 80
  • Really helpful as I thought, Thanks! – Jonathan Prieto-Cubides Jan 02 '15 at 03:12
  • In OSX, with zsh, only works `awk` command if I use condition between single quotes. – Jonathan Prieto-Cubides Jan 02 '15 at 03:18
  • I never worked much with zsh. Anyway, if you have a non-POSIX shell, you can transfer shell variables into awk with its `-v` option. The example in the explanation can be written as `awk -v number=$NUMBER '1 { print } NR == number { exit }' filename`. It is actually better style to do this, come to think of it, because substituting shell variables directly into code can lead to weirdness. if `$NUMBER` is, for example, `10 { print } 0`, the code will be completely changed. I think I'll change that in the answer so nobody ends up imitating it. – Wintermute Jan 02 '15 at 03:25
  • Oh, maybe the reason it broke for you with double quotes is that the shell will expand `$1` if it appears in a double-quoted string. You could write `\$1` everywhere, but it's better to use `-v`. – Wintermute Jan 02 '15 at 03:30
  • just a couple of clarifications: 1) unlike sed which always works on lines, awk works on records which are lines by default but can be multi-line blocks of text based on the `RS` value. 2) Never encase awk scripts in double quotes on UNIX, see http://cfajohnson.com/shell/cus-faq-2.html#Q24 for how to pass the value of shell variables to awk scripts. 3) Always quote shell variables so its `number="$NUMBER"`. 4) All-upper-case is reserved for exported shell variables by convention. – Ed Morton Jan 03 '15 at 15:35