Take every nth row from a file with groups and n is a given in a column

Question

I have seen here and here on how to return every nth row; but my problem is different. A separate column in the file provides specifics about which nth element to return; which are different depending on the group. Here is a sample of the dataset where the Nth column provides the rows to return. That is, for Id group a every 3rd row and forId group b every 4th row. The data is quite sizable which with several Id groups.

Id  TagNo   Nth
a   A-A-3   3
a   A-A-1   3
a   A-A-5   3
a   A-A-2   3
a   AX-45   3
a   AX-33   3
b   B-B-5   4
b   B-B-4   4
b   B-B-3   4
b   BX-B2   4

Desired output:

Id  TagNo   Nth
 a  A-A-3   3
 a  A-A-2   3
 b  B-B-5   4

Thank you for your help.

Edit: Please kindly note that I want to start picking from the first and every nth item; that is every 3rd for a and 4th for b. For group a it will be 1st,4th, 7th... for group b it will be 1st,5th, 9th rows. The original output has error and an edit has been done. My sincere apologies.

Must be a typo, the idea is to return every `3rd` for `a` and every `4th` for `b`. Will do an edit. — deepseefan, Oct 17 '17 at 07:41
Yes, I want to start picking from the first and then every `nth`. — deepseefan, Oct 17 '17 at 07:43
Thank you and an edit is submitted. My sincere apologies; my head might need a recharge ;) — deepseefan, Oct 17 '17 at 07:55

anubhava · Answer 1 · 2017-10-17T08:02:46.110

6

This awk should work:

awk '!a[$1]++{print; if(NR>1) n=NR+$3} NR==n{print; n=NR+$3}' file

Id  TagNo   Nth
a   A-A-3   3
a   A-A-2   3
b   B-B-5   4

edited Oct 17 '17 at 08:02

answered Oct 17 '17 at 07:46

anubhava

761,203
64
569
643

CWLiu · Answer 2 · 2017-10-17T08:13:39.217

3

For the awk solution,

$ cat awk-sc
{
  if(id==$1){
    nth--;
    if(nth==0){print; nth=$3}
  } else {
    id=$1;nth=$3;print
  }
}

$ awk -f awk-sc file
Id  TagNo   Nth
a   A-A-3   3
a   A-A-2   3
b   B-B-5   4

edited Oct 17 '17 at 08:13

answered Oct 17 '17 at 07:46

CWLiu

3,913
1
10
14

score 3 · Accepted Answer · answered Oct 17 '17 at 08:01

3

Base R solution:

do.call(rbind, lapply(split(df, df$Id), function(x) x[seq(from = 1, to = nrow(x), by = unique(x$Nth)), ]))

    Id TagNo Nth
a.1  a A-A-3   3
a.4  a A-A-2   3
b    b B-B-5   4

answered Oct 17 '17 at 08:01

LAP

6,605
2
15
28

This is really a good solution, Could you also try something in data.table – Hardik Gupta Oct 17 '17 at 08:05
Sadly, I'm not well versed in data.table. But there are some here (akrun, Sotos and others), who should be able to translate this into data.table. – LAP Oct 17 '17 at 08:08
I found the `data.table` solution :) – Hardik Gupta Oct 17 '17 at 09:30
Thanks @LAP. An elegant `base R` solution. – deepseefan Oct 17 '17 at 11:42

pacholik · Answer 4 · 2017-10-17T09:16:11.477

2

Python solution.

from __future__ import print_function

with open('file.csv') as f:
    print(*next(f).split())    # header

    lastid = None
    lineno = 0
    for line in f:
        id_, tagno, nth = line.split()

        if lastid != id_:
            lineno = 0

        if lineno % int(nth) == 0:
            print(id_, tagno, nth)

        lastid = id_
        lineno += 1

edited Oct 17 '17 at 09:16

answered Oct 17 '17 at 07:55

pacholik

8,607
9
43
55

Thank you but it gave me `ValueError: not enough values to unpack (expected 3, got 1)` error. Will check other solutions first and get back to trace the error. – deepseefan Oct 17 '17 at 08:32
@deepseefan Oh, so the file is really not *comma separated*. Edited. – pacholik Oct 17 '17 at 08:37
Had both versions; for the comma separated it gave me a syntax error on line `print(*next(reader))`. – deepseefan Oct 17 '17 at 08:44
@deepseefan You should use Python 3 or at very least the `__future__` module. – pacholik Oct 17 '17 at 08:53
Usually I ran the code with `python3`. Now it works; but could you please edit the line `if lineno % (int(nth) - 1) == 0:` to `if lineno % (int(nth)) == 0:` so the output will be as specified in the edit. You are correct it produces the output as specified before edit is made. – deepseefan Oct 17 '17 at 09:03

score 2 · Answer 5 · answered Oct 17 '17 at 08:09

Here is a base R solution.
First, the data. I assume you read it in with dat <- read.csv("file.csv").

dat <-
structure(list(Id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L), .Label = c("a", "b"), class = "factor"), TagNo = structure(c(3L, 
1L, 4L, 2L, 6L, 5L, 9L, 8L, 7L, 10L), .Label = c("A-A-1", "A-A-2", 
"A-A-3", "A-A-5", "AX-33", "AX-45", "B-B-3", "B-B-4", "B-B-5", 
"BX-B2"), class = "factor"), Nth = c(3L, 3L, 3L, 3L, 3L, 3L, 
4L, 4L, 4L, 4L)), .Names = c("Id", "TagNo", "Nth"), class = "data.frame", row.names = c(NA, 
-10L))

Now the R code.

dat2 <- do.call(rbind, lapply(split(dat, dat$Nth), function(x)
            x[c(1 + (1:(nrow(x) %/% x[1, "Nth"]) - 1)*x[1, "Nth"]), ]))
row.names(dat2) <- NULL
dat2
#  Id TagNo Nth
#1  a A-A-3   3
#2  a A-A-2   3
#3  b B-B-5   4

Rahul Verma · Answer 6 · 2017-10-17T08:48:25.473

awk one-liner

$ awk 'a!=$1{a=$1; n=$3; k=-1} FNR>1 && ++k%n!=0{next} 1' f1
Id  TagNo   Nth
a   A-A-3   3
a   A-A-2   3
b   B-B-5   4

a!=$1{a=$1; n=$3; k=-1}: a is a variable that keeps track of the first field/column. If a is not initialized or the first column is different than the previous one then this will satisfy and it will set a, n and k=-1.

FNR>1 && ++k%n!=0{next} : increment k with each line after first/header line and if the remainder with n doesn't give zero that means it's not the nth record and don't print it. Else It's nth and print it.

Below version to help you understand better :

$ awk 'FNR==1{print; next;}  a!=$1{a=$1; n=$3; k=0; print; next} ++k%n==0{print}' f1
Id  TagNo   Nth
a   A-A-3   3
a   A-A-2   3
b   B-B-5   4

FNR==1{print; next;}: Simply print header and do nothing

a!=$1{a=$1; n=$3; k=0; print; next} : a is a variable that keeps track of the first field/column. If a is not initialized or the first column is different than the previous one then set a, n and k=0.

++k%n==0{print} : Keep incrementing k with each new record and if the remainder with n gives zero that means it's the nth record.

score 2 · Answer 7 · answered Oct 17 '17 at 09:25

Using data.table

df <- data.table(read.table(text = "Id  TagNo   Nth
a   A-A-3   3
a   A-A-1   3
a   A-A-5   3
a   A-A-2   3
a   AX-45   3
a   AX-33   3
b   B-B-5   4
b   B-B-4   4
b   B-B-3   4
b   BX-B2   4", header = T))

df <- df[, id := seq_len(.N), by = Id]
df[id %% Nth == 1 , 1:3, by = Id]

  Id TagNo Nth
1:  a A-A-3   3
2:  a A-A-2   3
3:  b B-B-5   4

Diyi Wang · Answer 8 · 2017-10-17T16:02:50.267

0

Python solution:

with open('YOURFILENAME', 'r') as f:
    i = 1
    print('Id  TagNo   Nth')
    for line in f.readlines():
        if not i:
            print(line, end='')
            i = int(line.split()[-1])
        i -= 1

You can change the print() to write() or any other functions you want. Since the header is fixed, I didn't include it in my code.

Update: Print the header separately.

edited Oct 17 '17 at 16:02

answered Oct 17 '17 at 08:19

Diyi Wang

5
6

Thank you; but your solution is incomplete; and it has an error `ValueError: invalid literal for int() with base 10: 'Id,TagNo,Nth'`. What it means is discussed [here](https://stackoverflow.com/questions/30903967/invalid-literal-for-int-with-base-10-what-does-this-actually-mean). In other words, `int` doesn't know how to convert the argument passed via `line.split()[-1]` – deepseefan Oct 17 '17 at 11:24
@deepseefan Good to know. But I think the problem is you didn’t skip the first line since the last element in the header is “Nth” instead of a number. Cuz the header is fixed, I didn’t handle that in my code. – Diyi Wang Oct 17 '17 at 15:58

Take every nth row from a file with groups and n is a given in a column

8 Answers8