How to awk every nth line starting from different lines each iteration

Question

I would like awk to print every nth line out of a file starting from line 0. Then, after awk has gone through the whole file, I would like it to print every nth line starting from line 1...then print every nth line starting from line 2...etc, up to printing every nth line starting from line n-1. My sad attempt thus far:

#!/bin/bash

rm *.sad *.sadd *.out

#Create loop index
for i in $(seq 20 1 36);
do
        listm+=($i)
done

#Create input file
for j in "${listm[@]}"
do
        if [ $j -eq 20 ];
        then
                awk 'NR % 20 == 0' vel_VMDout > atomvel.dat
                awk '{print $2,$3,$4}' atomvel.dat > velocity.dat
        else
                awk 'NR % 20 == 1' vel_VMDout  > $j.sad
                egrep -v "^[[:space:]]*$|^#" $j.sad > $j.sadd
                awk '{print $2, $3, $4}' $j.sadd > $j.out
                paste velocity.dat $j.out  > taste
        fi
done

Let me try to clarify this by providing the input and what the output should look like. Th input is an xyz file of an MD simulation consisting of frames of the atoms' xyz coordinates.

INPUT:

This image shows the 1st snapshot and part of the second snapshot. Because these are snapshot, the ordering of the atoms do not change. Thus, I am trying to print the xyz coordinates from each snapshot for each specific atom in their own columns as shown below. This would eventually make a file consisting of 3N columns, where N is the number of atoms.

OUTPUT:

As you can see, the each atoms' coordinates are in their own columns and the total file is a Nx3N array. My bash script was me trying to do this, but could only do the first two atoms. I wanted to print every nth line (coordinates of the nth atom) so they look like the output. I really appreciate your patience all.

It is very often useful to show sample input and desired output. It removes ambiguities. (To keep the sample a manageable size, you might the sample for a relatively small n but state that the final version will have n=20.) Your script removes blank and empty lines for `NR % 20 == 1` but not for `NR % 20 == 0`. Is that intentional? What happens when `NR % 20` has other values? — John1024, Jul 11 '15 at 04:22
When you say `awk` in statements like `I would like to awk every nth line` - do you mean `print` or something else? — Ed Morton, Jul 11 '15 at 04:35
How big is the data set? Is it too large to keep in memory? Also, it helps everyone help you if you don't post images of your data but you post an editable copy of the data. — Jonathan Leffler, Jul 11 '15 at 06:57
How are we supposed to test a potential solution against an **image** of the input and output? Edit your question to provide the textual sample input and output, not a picture of it, and make sure the output is exactly what you want given that input. And be very clear in your requirements about which rows and columns from the input are moving to which rows and columns of the output - don't assume that anyone reading this has a clue what `Th input is an xyz file of an MD simulation consisting of frames of the atoms' xyz coordinates`or any other terminology specific to your domain means. — Ed Morton, Jul 11 '15 at 11:32

Jonathan Leffler · Accepted Answer · 2015-07-12T06:49:26.783

Generating sample data

This is a step that should not be necessary; the question should have included usable sample data and the required output from that sample data.

At one level, it won't help much because you don't have my random number generator program, but the script below shows how I generated the data that follows, and it illustrates the lengths to which it might be necessary to go when the question doesn't supply readable data. I generated some data that looks similar to the data in the question (at least superficially):

18
 Generated by VMD in absentia
 C     0.979485   -6.665347   0.575383
 C     1.191999   -3.002386   2.859484
 C     3.151517   -5.610077   0.429413
 C     3.439828   -6.454984   1.319724
 C     3.726201   -0.123038   2.096854
 C     1.363325   -3.031238   0.016019
 C     6.090283   -3.915340   2.396358
 C     0.407755   -7.957784   -0.846842
 C     0.203074   -0.796428   2.659573
 O     2.600610   -2.259674   -0.260378
 O     4.773839   -6.765097   0.588508
 H     2.743424   -2.890016   2.906452
 H     2.810233   -6.641054   -0.797672
 H     6.854169   -3.191721   -0.925670
 O     2.914233   -1.060001   0.776983
 H     3.803923   -1.497032   2.908799
 H     5.669443   -7.227666   -0.647552
 H     0.092455   -5.850637   2.959987
18
 Generated by VMD in absentia
 C     6.042840   -7.254720   2.093573
 C     2.551942   -6.044322   2.061072
 C     3.523150   -6.167163   2.451689
 C     5.197316   -3.429866   -0.412062
 C     2.548777   -6.422851   1.282846
 C     3.775197   -2.012031   1.377440
 C     3.405112   -3.206415   -0.879886
 C     1.448359   -5.419629   0.467291
 C     3.661964   -2.789234   2.644294
 O     4.214854   -2.439574   -0.951704
 O     5.297609   -2.320418   2.709898
 H     2.653940   -4.431080   -0.511743
 H     5.040635   -0.676199   -0.590970
 H     1.546725   -1.294582   2.562937
 O     4.231461   -7.180908   1.629901
 H     3.297836   -1.557133   -0.133280
 H     3.442481   -4.489962   2.111930
 H     1.423611   -7.982655   0.715618
18
 Generated by VMD in absentia
 C     1.432495   -7.686243   2.525734
 C     5.038409   -4.976270   2.826846
 C     6.184137   -7.303094   2.711561
 C     3.208125   -0.606556   1.978725
 C     2.171859   -6.792060   0.678988
 C     6.521124   -5.622797   -0.773797
 C     1.725619   -5.768633   -0.223397
 C     3.602427   -2.325680   1.762008
 C     1.937521   -1.686895   1.743159
 O     0.745526   -0.114246   -0.949490
 O     4.754360   -6.531145   1.998913
 H     1.114732   -1.158810   1.486939
 H     6.410490   -5.411647   0.062737
 H     4.164330   -6.743763   1.802804
 O     2.587841   -3.979700   2.609748
 H     2.192073   -2.815376   -0.809569
 H     5.501795   -2.326438   1.325829
 H     3.285032   -1.212541   1.284453
18
 Generated by VMD in absentia
 C     3.564424   -3.117406   -0.032879
 C     2.894745   -0.632591   0.532311
 C     3.384916   -5.383135   1.179585
 C     0.793488   -0.894539   -0.886891
 C     1.348785   -6.501867   1.648604
 C     2.189941   -2.438067   0.616090
 C     2.043378   -4.966472   0.691603
 C     3.124161   -5.792896   0.545362
 C     5.741472   -0.640590   2.825374
 O     0.300550   -7.149663   0.942726
 O     1.344387   -0.121382   2.169401
 H     4.963296   -0.964665   -0.230523
 H     6.651423   -4.905053   2.509626
 H     5.059694   -6.166516   0.102255
 O     5.046864   -3.288883   0.853948
 H     2.389007   -3.057664   1.806301
 H     2.365876   -0.956860   1.458959
 H     2.892502   -0.097422   -0.531714

The script I used to do it was:

random -n $((4 * 18)) -T '%8:6[0:7]F   %8:6[-8:0]F   %8:6[-1:3]F' |
awk 'BEGIN { n = split("CCCCCCCCCOOHHHOHHH", atoms, ""); atoms[0] = atoms[n] }
     NR % n == 1 { print n; print " Generated by VMD in absentia" }
     { print "", atoms[NR%18], "   ", $0 }'

The -n option to random says how many rows to generate; I chose 72. The -T option is a template, and the notation %8:6[0:7]F means use %8.6F format to print uniformly distributed random numbers between 0 and 7. The awk script takes the data that is so generated and interpolates the noise (the number of atoms and a variant on the 'generated by VMD' line), as well as tagging the lines with the appropriate atomic symbol.

Processing the sample data

Given some data, you then need to munge it to get the required output. This script more or less does the job. There are endless ways it should be improved, of course, such as taking file names as command line arguments, using temporary file names instead of fixed names, cleaning up the intermediate files, different compounds, different atoms (nitrogen, phosphorous, etc), and so on. However, it should adapt reasonably easily.

input="data"
output="output"
n=$(sed 1q "$input")
n2=$(($n+2))

for ((i = 3; i <= n2; i++))
do
    colno=$(printf "%.2d" $(($i-2)))
    awk -v N=$n2 -v R=$i \
        '   BEGIN { name["C"] = "Carbon"; name["H"] = "Hydrogen"; name["O"] = "Oxygen";
                    R0 = R % N }
            NR > 2 && NR <= R { count[$1]++; }
            NR == R { printf "%-32.32s\n", name[$1] " " count[$1]; }
            NR % N == R0 { xyz = sprintf("%s %s %s", $2, $3, $4); printf "%-32.32s\n", xyz }
        ' "$input" > "column.$colno"
done

paste -d ' ' column.* > "$output"

The first four lines set up the control parameters, collecting the number of lines per unit of data from the input file, and adjusting things accordingly. The for loop iterates over offsets 3 to $n2 inclusive (skipping the two header lines), and runs the awk script. That encodes atom types (BEGIN), determines which atom it is processing this time (NR > 2 && NR <= R and NR == R), and then arranges to print the triplets of data for the relevant atom. The formatting is carefully organized so that the column headings and the actual xyz-triplets are uniformly spaced. These are written to a file column.$colno. When all's done, the column.* files are pasted to generate a single output file, which looks like this:

Carbon 1                         Carbon 2                         Carbon 3                         Carbon 4                         Carbon 5                         Carbon 6                         Carbon 7                         Carbon 8                         Carbon 9                         Oxygen 1                         Oxygen 2                         Hydrogen 1                       Hydrogen 2                       Hydrogen 3                       Oxygen 3                         Hydrogen 4                       Hydrogen 5                       Hydrogen 6                      
0.979485 -6.665347 0.575383      1.191999 -3.002386 2.859484      3.151517 -5.610077 0.429413      3.439828 -6.454984 1.319724      3.726201 -0.123038 2.096854      1.363325 -3.031238 0.016019      6.090283 -3.915340 2.396358      0.407755 -7.957784 -0.846842     0.203074 -0.796428 2.659573      2.600610 -2.259674 -0.260378     4.773839 -6.765097 0.588508      2.743424 -2.890016 2.906452      2.810233 -6.641054 -0.797672     6.854169 -3.191721 -0.925670     2.914233 -1.060001 0.776983      3.803923 -1.497032 2.908799      5.669443 -7.227666 -0.647552     0.092455 -5.850637 2.959987     
6.042840 -7.254720 2.093573      2.551942 -6.044322 2.061072      3.523150 -6.167163 2.451689      5.197316 -3.429866 -0.412062     2.548777 -6.422851 1.282846      3.775197 -2.012031 1.377440      3.405112 -3.206415 -0.879886     1.448359 -5.419629 0.467291      3.661964 -2.789234 2.644294      4.214854 -2.439574 -0.951704     5.297609 -2.320418 2.709898      2.653940 -4.431080 -0.511743     5.040635 -0.676199 -0.590970     1.546725 -1.294582 2.562937      4.231461 -7.180908 1.629901      3.297836 -1.557133 -0.133280     3.442481 -4.489962 2.111930      1.423611 -7.982655 0.715618     
1.432495 -7.686243 2.525734      5.038409 -4.976270 2.826846      6.184137 -7.303094 2.711561      3.208125 -0.606556 1.978725      2.171859 -6.792060 0.678988      6.521124 -5.622797 -0.773797     1.725619 -5.768633 -0.223397     3.602427 -2.325680 1.762008      1.937521 -1.686895 1.743159      0.745526 -0.114246 -0.949490     4.754360 -6.531145 1.998913      1.114732 -1.158810 1.486939      6.410490 -5.411647 0.062737      4.164330 -6.743763 1.802804      2.587841 -3.979700 2.609748      2.192073 -2.815376 -0.809569     5.501795 -2.326438 1.325829      3.285032 -1.212541 1.284453     
3.564424 -3.117406 -0.032879     2.894745 -0.632591 0.532311      3.384916 -5.383135 1.179585      0.793488 -0.894539 -0.886891     1.348785 -6.501867 1.648604      2.189941 -2.438067 0.616090      2.043378 -4.966472 0.691603      3.124161 -5.792896 0.545362      5.741472 -0.640590 2.825374      0.300550 -7.149663 0.942726      1.344387 -0.121382 2.169401      4.963296 -0.964665 -0.230523     6.651423 -4.905053 2.509626      5.059694 -6.166516 0.102255      5.046864 -3.288883 0.853948      2.389007 -3.057664 1.806301      2.365876 -0.956860 1.458959      2.892502 -0.097422 -0.531714

Your task is to understand why all the bits of the awk script are present. For example, why is R0 needed (hint, experiment without the R0 calculation, and use R in its place).

How to awk every nth line starting from different lines each iteration

1 Answers1

Generating sample data

Processing the sample data