1

I want to repeat a particular query on a large dataset and I am sure the answer to my question is quite basic, but after reading various sources on 'for' loops, repeat and replicate functions for about 2 hours, I still can't find any examples which appear to do what I need to do.

The dataset contains survey data from particular sites which are split into plots and each plot contains multiple species entries so the data looks like this:

SITE    PLOT    SPECIES
1          1    a
1          1    b
1          2    a
1          2    c
1          3    b
1          3    c
1          3    d
1          4    a
1          5    a
1          5    b
2          1    b
2          1    c
2          3    a
2          3    b
2          4    b
2          4    c
2          4    d
2          5    e

The actual data is over 6500 rows as there are hundreds of sites and each should contain 20 plots - the issue is some plots are missing from some sites, so what I need to do is establish how many plots are missing in total. I can use the following code to query how many unique plots are on each site so in the example below I query how many unique plots are in site number 7:

NROW(unique(df$PLOT[df$SITE=="7"]))

   [20]

But I have hundreds of sites, so is there a function that will allow me to query each site automatically without manually changing the site number each time?

user197410
  • 31
  • 4
  • Can you please provide data with `NA`'s (missings)? Please use `dput(df[10:25, ])`, for the index 10:25 choose rows where at least one missing value occurs. – Pax May 26 '22 at 07:38
  • `with(df, tapply(PLOT, SITE, \(x) length(unique(x))))`. – Rui Barradas May 26 '22 at 07:52
  • Hi - the missing plots do not appear as NA's, they are simply not in the data at all - there are no NA's in the data. – user197410 May 26 '22 at 08:02
  • Hi Rui Barradas - brilliant, thank you - of the three solutions given, this is defnitely the most efficient, many thanks! – user197410 May 26 '22 at 08:32

2 Answers2

1

Here is a base R way with tapply.

x <- '
SITE    PLOT    SPECIES
1          1    a
1          1    b
1          2    a
1          2    c
1          3    b
1          3    c
1          3    d
1          4    a
1          5    a
1          5    b
2          1    b
2          1    c
2          3    a
2          3    b
2          4    b
2          4    c
2          4    d
2          5    e'
df1 <- read.table(textConnection(x), header = TRUE)

num_plots <- with(df1, tapply(PLOT, SITE, \(x) length(unique(x))))
which(num_plots != max(num_plots))
#> 2 
#> 2

Created on 2022-05-26 by the reprex package (v2.0.1)

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thank you - slightly diffferent to the above, giving the site numbers which are missing plots, very helpful thanks. – user197410 May 26 '22 at 08:31
0

Not quite sure what you're going for but does this help?

Using data.table:

df <- read.table(text='SITE    PLOT    SPECIES
1          1    a
1          1    b
1          2    a
1          2    c
1          3    b
1          3    c
1          3    d
1          4    a
1          5    a
1          5    b
2          1    b
2          1    c
2          3    a
2          3    b
2          4    b
2          4    c
2          4    d
2          5    e', header=TRUE)
library(data.table)
setDT(df)[, .(plots=uniqueN(PLOT)), by=.(SITE)]
##    SITE plots
## 1:    1     5
## 2:    2     4
jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • Thanks so much - this produces a table with site number and number of plots they contain, excellent, thank you – user197410 May 26 '22 at 08:30