5

I have seen here and here on how to return every nth row; but my problem is different. A separate column in the file provides specifics about which nth element to return; which are different depending on the group. Here is a sample of the dataset where the Nth column provides the rows to return. That is, for Id group a every 3rd row and forId group b every 4th row. The data is quite sizable which with several Id groups.

Id  TagNo   Nth
a   A-A-3   3
a   A-A-1   3
a   A-A-5   3
a   A-A-2   3
a   AX-45   3
a   AX-33   3
b   B-B-5   4
b   B-B-4   4
b   B-B-3   4
b   BX-B2   4 

Desired output:

Id  TagNo   Nth
 a  A-A-3   3
 a  A-A-2   3
 b  B-B-5   4

Thank you for your help.

Edit: Please kindly note that I want to start picking from the first and every nth item; that is every 3rd for a and 4th for b. For group a it will be 1st,4th, 7th... for group b it will be 1st,5th, 9th rows. The original output has error and an edit has been done. My sincere apologies.

pacholik
  • 8,607
  • 9
  • 43
  • 55
deepseefan
  • 3,701
  • 3
  • 18
  • 31

8 Answers8

6

This awk should work:

awk '!a[$1]++{print; if(NR>1) n=NR+$3} NR==n{print; n=NR+$3}' file

Id  TagNo   Nth
a   A-A-3   3
a   A-A-2   3
b   B-B-5   4
anubhava
  • 761,203
  • 64
  • 569
  • 643
3

For the awk solution,

$ cat awk-sc
{
  if(id==$1){
    nth--;
    if(nth==0){print; nth=$3}
  } else {
    id=$1;nth=$3;print
  }
}

$ awk -f awk-sc file
Id  TagNo   Nth
a   A-A-3   3
a   A-A-2   3
b   B-B-5   4
CWLiu
  • 3,913
  • 1
  • 10
  • 14
3

Base R solution:

do.call(rbind, lapply(split(df, df$Id), function(x) x[seq(from = 1, to = nrow(x), by = unique(x$Nth)), ]))

    Id TagNo Nth
a.1  a A-A-3   3
a.4  a A-A-2   3
b    b B-B-5   4
LAP
  • 6,605
  • 2
  • 15
  • 28
2

Python solution.

from __future__ import print_function

with open('file.csv') as f:
    print(*next(f).split())    # header

    lastid = None
    lineno = 0
    for line in f:
        id_, tagno, nth = line.split()

        if lastid != id_:
            lineno = 0

        if lineno % int(nth) == 0:
            print(id_, tagno, nth)

        lastid = id_
        lineno += 1
pacholik
  • 8,607
  • 9
  • 43
  • 55
  • Thank you but it gave me `ValueError: not enough values to unpack (expected 3, got 1)` error. Will check other solutions first and get back to trace the error. – deepseefan Oct 17 '17 at 08:32
  • @deepseefan Oh, so the file is really not *comma separated*. Edited. – pacholik Oct 17 '17 at 08:37
  • Had both versions; for the comma separated it gave me a syntax error on line `print(*next(reader))`. – deepseefan Oct 17 '17 at 08:44
  • @deepseefan You should use Python 3 or at very least the `__future__` module. – pacholik Oct 17 '17 at 08:53
  • Usually I ran the code with `python3`. Now it works; but could you please edit the line `if lineno % (int(nth) - 1) == 0:` to `if lineno % (int(nth)) == 0:` so the output will be as specified in the edit. You are correct it produces the output as specified before edit is made. – deepseefan Oct 17 '17 at 09:03
2

Here is a base R solution.
First, the data. I assume you read it in with dat <- read.csv("file.csv").

dat <-
structure(list(Id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L), .Label = c("a", "b"), class = "factor"), TagNo = structure(c(3L, 
1L, 4L, 2L, 6L, 5L, 9L, 8L, 7L, 10L), .Label = c("A-A-1", "A-A-2", 
"A-A-3", "A-A-5", "AX-33", "AX-45", "B-B-3", "B-B-4", "B-B-5", 
"BX-B2"), class = "factor"), Nth = c(3L, 3L, 3L, 3L, 3L, 3L, 
4L, 4L, 4L, 4L)), .Names = c("Id", "TagNo", "Nth"), class = "data.frame", row.names = c(NA, 
-10L))

Now the R code.

dat2 <- do.call(rbind, lapply(split(dat, dat$Nth), function(x)
            x[c(1 + (1:(nrow(x) %/% x[1, "Nth"]) - 1)*x[1, "Nth"]), ]))
row.names(dat2) <- NULL
dat2
#  Id TagNo Nth
#1  a A-A-3   3
#2  a A-A-2   3
#3  b B-B-5   4
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
2

awk one-liner

$ awk 'a!=$1{a=$1; n=$3; k=-1} FNR>1 && ++k%n!=0{next} 1' f1
Id  TagNo   Nth
a   A-A-3   3
a   A-A-2   3
b   B-B-5   4

a!=$1{a=$1; n=$3; k=-1}: a is a variable that keeps track of the first field/column. If a is not initialized or the first column is different than the previous one then this will satisfy and it will set a, n and k=-1.

FNR>1 && ++k%n!=0{next} : increment k with each line after first/header line and if the remainder with n doesn't give zero that means it's not the nth record and don't print it. Else It's nth and print it.

Below version to help you understand better :

$ awk 'FNR==1{print; next;}  a!=$1{a=$1; n=$3; k=0; print; next} ++k%n==0{print}' f1
Id  TagNo   Nth
a   A-A-3   3
a   A-A-2   3
b   B-B-5   4

FNR==1{print; next;}: Simply print header and do nothing

a!=$1{a=$1; n=$3; k=0; print; next} : a is a variable that keeps track of the first field/column. If a is not initialized or the first column is different than the previous one then set a, n and k=0.

++k%n==0{print} : Keep incrementing k with each new record and if the remainder with n gives zero that means it's the nth record.

Rahul Verma
  • 2,946
  • 14
  • 27
2

Using data.table

df <- data.table(read.table(text = "Id  TagNo   Nth
a   A-A-3   3
a   A-A-1   3
a   A-A-5   3
a   A-A-2   3
a   AX-45   3
a   AX-33   3
b   B-B-5   4
b   B-B-4   4
b   B-B-3   4
b   BX-B2   4", header = T))

df <- df[, id := seq_len(.N), by = Id]
df[id %% Nth == 1 , 1:3, by = Id]

  Id TagNo Nth
1:  a A-A-3   3
2:  a A-A-2   3
3:  b B-B-5   4
Hardik Gupta
  • 4,700
  • 9
  • 41
  • 83
0

Python solution:

with open('YOURFILENAME', 'r') as f:
    i = 1
    print('Id  TagNo   Nth')
    for line in f.readlines():
        if not i:
            print(line, end='')
            i = int(line.split()[-1])
        i -= 1

You can change the print() to write() or any other functions you want. Since the header is fixed, I didn't include it in my code.

Update: Print the header separately.

Diyi Wang
  • 5
  • 6
  • Thank you; but your solution is incomplete; and it has an error `ValueError: invalid literal for int() with base 10: 'Id,TagNo,Nth'`. What it means is discussed [here](https://stackoverflow.com/questions/30903967/invalid-literal-for-int-with-base-10-what-does-this-actually-mean). In other words, `int` doesn't know how to convert the argument passed via `line.split()[-1]` – deepseefan Oct 17 '17 at 11:24
  • @deepseefan Good to know. But I think the problem is you didn’t skip the first line since the last element in the header is “Nth” instead of a number. Cuz the header is fixed, I didn’t handle that in my code. – Diyi Wang Oct 17 '17 at 15:58