3

I have a txt file like this:

ID   row1   row2   row3   score
rs16 ...    ...    ...    0.23
rs52 ...    ...    ...    1.43
rs87 ...    ...    ...    0.45
rs89 ...    ...    ...    2.34
rs67 ...    ...    ...    1.89

Rows1- row3 do not matter.

I have about 8 million rows, and the scores range from 0-3. I would like to the score that correlates with being the top 1%. I was thinking of re-ordering the data by score and then printing the ~80,000 line? What do you guys think would be the best code for this?

Evan
  • 1,477
  • 1
  • 17
  • 34

2 Answers2

2

With GNU coreutils you can do it like this:

sort -k5gr <(tail -n+2 infile) | head -n80KB

You can increase to speed of the above pipeline by removing columns 2 through 4 like this:

tr -s ' ' < infile | cut -d' ' -f1,5 > outfile

Or taken together:

sort -k5gr <(tail -n+2 <(tr -s ' ' < infile | cut -d' ' -f1,5)) | head -n80KB

Edit

I noticed that you are only interested in the 80000th line of the result, then sed -n 80000 {p;q} instead of head as you suggested, is the way to go.

Explanation

tail:

  • -n+2 - skip header.

sort:

  • k5 - sort on 5th column.
  • gr - flags that make sort choose reverse general-numeric-sort.

head:

  • n - number of lines to keep. KB is a 1000 multiplier, see info head for others.
Thor
  • 45,082
  • 11
  • 119
  • 130
  • Would `sort -k5nr (infile) > sort.infile` , and then a `sed -80000p (sort.infile)` work? – Evan Jul 24 '15 at 21:07
  • @Evan: Sure, but without the parenthesis around the file names. The `<(...)` construct executes programs within and redirects their output to the "outer" command. – Thor Jul 24 '15 at 21:21
0

With GNU awk you can sort the values by setting the PROCINFO["sorted_in"] to "@val_num_desc". For example like this:

parse.awk

# Set sorting method
BEGIN { PROCINFO["sorted_in"]="@val_num_desc" }

# Print header
NR == 1 { print $1, $5 }

# Save 1st and 5th columns in g and h hashes respectively
NR>1 { g[NR] = $1; h[NR] = $5 }

# Print values from g and h until ratio is reached
END {
  for(k in h) { 
    if(i++ >= int(0.5 + NR*ratio_to_keep)) 
      exit
    print g[k], h[k]
  }
}

Run it like this:

awk -f parse.awk OFS='\t' ratio_to_keep=.01 infile
Thor
  • 45,082
  • 11
  • 119
  • 130