6

Does anybody know what is the best R alternative to SAS first. or last. operators? I did find none.

SAS has the FIRST. and LAST. automatic variables, which identify the first and last record amongst a group with the same value with a particular variable; so in the following dataset FIRST.model and LAST.model are defined:

Model,SaleID,First.Model,Last.Model
Explorer,1,1,0
Explorer,2,0,0
Explorer,3,0,0
Explorer,4,0,1
Civic,5,1,0
Civic,6,0,0
Civic,7,0,1
Joe
  • 62,789
  • 6
  • 49
  • 67
Giorgio Spedicato
  • 2,413
  • 3
  • 31
  • 45
  • I have no access to SAS - what is .first or .last doing? Can you add an example? – EDi Dec 07 '12 at 15:29
  • `FIRST.` and `LAST.` are not operators; they are automatic SAS data step variables defined to indicate column value changes during `BY` statement processing. – BellevueBob Dec 07 '12 at 15:36
  • 1
    I don't think. but this link seems to have the answer. https://stat.ethz.ch/pipermail/r-help/2010-November/260997.html – agstudy Dec 07 '12 at 15:45
  • Since not many of us know SAS, if you can explain what you would like to do, it might get an answer more quickly. – Ricardo Saporta Dec 07 '12 at 18:28
  • there's probably a simple solution with `diff()`, too ... – Ben Bolker Dec 07 '12 at 22:41

6 Answers6

9

It sounds like you're looking for !duplicated, with the fromLast argument being FALSE or TRUE.

d <- datasets::Puromycin

d$state
# [1] treated   treated   treated   treated   treated   treated   treated  
# [8] treated   treated   treated   treated   treated   untreated untreated
#[15] untreated untreated untreated untreated untreated untreated untreated
#[22] untreated untreated
#Levels: treated untreated
!duplicated(d$state)
# [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[13]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
!duplicated(d$state,fromLast=TRUE)
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
#[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

There are some caveats and edge-case behaviors to this function, which you can find out through the help files (?duplicated).

Blue Magister
  • 13,044
  • 5
  • 38
  • 56
4

Update (to read first)

If you really are interested only in the row indexes, perhaps some straightforward use of split and range would be of use. The following assumes that the rownames in your dataset are sequentially numbered, but adaptations would probably also be possible.

irisFirstLast <- sapply(split(iris, iris$Species), 
                        function(x) range(as.numeric(rownames(x))))
irisFirstLast              ## Just the indices
#      setosa versicolor virginica
# [1,]      1         51       101
# [2,]     50        100       150
iris[irisFirstLast[1, ], ] ## `1` would represent "first"
#     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 1            5.1         3.5          1.4         0.2     setosa
# 51           7.0         3.2          4.7         1.4 versicolor
# 101          6.3         3.3          6.0         2.5  virginica
iris[irisFirstLast, ]      ## nothing would represent both first and last
#     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 1            5.1         3.5          1.4         0.2     setosa
# 50           5.0         3.3          1.4         0.2     setosa
# 51           7.0         3.2          4.7         1.4 versicolor
# 100          5.7         2.8          4.1         1.3 versicolor
# 101          6.3         3.3          6.0         2.5  virginica
# 150          5.9         3.0          5.1         1.8  virginica

d <- datasets::Puromycin   
dFirstLast <- sapply(split(d, d$state), 
                     function(x) range(as.numeric(rownames(x))))
dFirstLast
#      treated untreated
# [1,]       1        13
# [2,]      12        23
d[dFirstLast[2, ], ]       ## `2` would represent `last`
#    conc rate     state
# 12  1.1  200   treated
# 23  1.1  160 untreated

If working with named rows, the general approach is the same, but you have to specify the range yourself. Here's the general pattern:

datasetFirstLast <- sapply(split(dataset, dataset$groupingvariable), 
                           function(x) c(rownames(x)[1], 
                                         rownames(x)[length(rownames(x))]))

Initial answer (edited)

If you're interested in extracting the rows rather than needing the row number for other purposes, you can also explore data.table. Here are some examples:

library(data.table)
DT <- data.table(iris, key="Species")
DT[J(unique(Species)), mult = "first"]
#       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1:     setosa          5.1         3.5          1.4         0.2
# 2: versicolor          7.0         3.2          4.7         1.4
# 3:  virginica          6.3         3.3          6.0         2.5
DT[J(unique(Species)), mult = "last"]
#       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1:     setosa          5.0         3.3          1.4         0.2
# 2: versicolor          5.7         2.8          4.1         1.3
# 3:  virginica          5.9         3.0          5.1         1.8
DT[, .SD[c(1,.N)], by=Species]
#       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1:     setosa          5.1         3.5          1.4         0.2
# 2:     setosa          5.0         3.3          1.4         0.2
# 3: versicolor          7.0         3.2          4.7         1.4
# 4: versicolor          5.7         2.8          4.1         1.3
# 5:  virginica          6.3         3.3          6.0         2.5
# 6:  virginica          5.9         3.0          5.1         1.8

This last approach is pretty convenient. For instance, if you wanted the first three rows and last three rows of each group, you can use: DT[, .SD[c(1:3, (.N-2):.N)], by=Species] (Just for reference: .N represents the number of cases per group.

Other useful approaches include:

DT[, tail(.SD, 2), by = Species] ## last two rows of each group
DT[, head(.SD, 4), by = Species] ## first four rows of each group
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    I think the answer here is really just to use `by`. My understanding of how the `first.` and `last.` is used in practice is to set up procedures that work on the by group subsets. – mnel Dec 09 '12 at 12:10
  • @mnel, Never used SAS, and sort of hastily read the post. The `data.table` options were the first thing that came to mind because I've been exploring the package recently. Thanks for pointing this out to me. I've updated with something that might be more relevant, but still not sure exactly how `first.` and `last.` are used in practice. – A5C1D2H2I1M1N2O1R2T1 Dec 09 '12 at 16:57
4

The head and tail function with an n=1 option combined with by are a good way to go. See R for SAS and SPss Users** (Robert Muenchen) Make a data frame with by variables of interest i.e for last.

dfby<- data.frame(df$var1, df$var2)
mylastList<-by(df,dfby,tail, n=1)
#turn into a dataframe
mylastDF<-do.call(rbind,mylastList)
3

Here is a dplyr solution:

# input
dataset <- structure(list(Model = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L
), .Label = c("Civic", "Explorer"), class = "factor"), SaleID = 1:7), .Names = c("Model", 
"SaleID"), class = "data.frame", row.names = c(NA, -7L))


# code 
library(dplyr)

dataset %>% 

  group_by(Model) %>%

  mutate(
          "First"        = row_number() == min( row_number() ),
          "Last"         = row_number() == max( row_number() )
  )

# output:

     Model SaleID First  Last
    <fctr>  <int> <lgl> <lgl>
1 Explorer      1  TRUE FALSE
2 Explorer      2 FALSE FALSE
3 Explorer      3 FALSE FALSE
4 Explorer      4 FALSE  TRUE
5    Civic      5  TRUE FALSE
6    Civic      6 FALSE FALSE
7    Civic      7 FALSE  TRUE

PS: If you don't have dplyr installed run:

install.packages("dplyr")
Rasmus Larsen
  • 5,721
  • 8
  • 47
  • 79
1

The function below is based on @Joe's description of First / Last.
The function returns a list of vectors.

Each list entry corresponds to the columns of the dataframe (ie the features or variables of the data set)
Then, within a given list entry, there is the index that pertains to the First (or last) element for every observation category.

EXAMPLE USAGE:

# Pass in your data frame, and indicate whether or not you want to find Last or find First. 
# Assign to the appropriate variable
first <- findFirstLast(myDF)
last  <- findFirstLast(myDF, findFirst=FALSE)

Example using data(iris)

data(iris)
first <- findFirstLast(iris)
last  <- findFirstLast(iris, findFirst=FALSE)

which observation for each Species:

 first$Species
 #    setosa versicolor  virginica 
 #        1         51        101 

 last$Species
 #    setosa versicolor  virginica 
 #        50        100        150 

Grab the whole row for each first observation of a sepcies

iris[first$Species, ]
#      Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#  1            5.1         3.5          1.4         0.2     setosa
#  51           7.0         3.2          4.7         1.4 versicolor
#  101          6.3         3.3          6.0         2.5  virginica




CODE FOR FUNCTION findFirstLast():

  findFirstLast <- function(myDF, findFirst=TRUE) {
  # myDF should be a data frame or matrix 

    # By default, this function finds the first occurence of each unique value in a column
    # If instead we want to find last, set findFirst to FALSE.  This will give `maxOrMin` a value of -1
    #    finding the min of the negative indecies is the same as finding the max of the positive indecies. 
    maxOrMin <- ifelse(findFirst, 1, -1) 


    # For each column in myDF, make a list of all unique values (`levs`) and iterate over that list, 
    #   finding the min (or max) of all the indicies of where that given value appears within the column  
    apply(myDF, 2, function(colm) {
        levs <- unique(colm)
        sapply(levs, function(lev) {
          inds <- which(colm==lev)
          ifelse(length(inds)==0, NA, maxOrMin*min(inds*maxOrMin) ) 
        })   
      })
  }
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
0

You can try this function, it can create the first and last flag and deal with NA like sas.

R dplyr::arrange: NA are always sorted to the end for local data, even when wrapped with dplyr::desc().

SAS PROC SORT: a missing value for a numeric variable is smaller than all numbers. Missing values of character variables are smaller than any printable character value.

The function use to sort data like SAS that the missing value is smallest, and create variable first and last for each sort variable.

library(dplyr, warn.conflicts = FALSE)

#' Sort data rows and create first and last like SAS
#'
#' @param data input data
#' @param ... variables for sort
#' @param first_last logical value, whether or not create the first and last variables
#' @param first_prefix character string of prefix for creating first variable
#' @param last_prefix character string of prefix for creating last variable
sas_sort <- function(data, ...,
                     first_last = TRUE,
                     first_prefix = ".first.",
                     last_prefix = ".last.") {
  stopifnot(!missing(data), is.data.frame(data))
  
  if (dplyr::is.grouped_df(data)) {
    message(
      "the input data is grouped by `",
      dplyr::group_vars(data),
      "`, and wil be ungrouped "
    )
    data <- data %>% dplyr::ungroup()
  }
  
  dots <- rlang::enexprs(...)
  if (length(dots) == 0) {
    stop("argumetn `...` is empty")
  }
  sort <- vector("list")
  for (i in seq_along(dots)) {
    dot <- dots[[i]]
    dot_str <- deparse(dot)
    if (stringr::str_detect(dot_str, "((dplyr::)?desc\\()(.+)(\\))")) {
      sort <- append(sort, dot)
    } else {
      sort <- append(sort, rlang::parse_expr(paste0("!is.na(", dot_str, ")")))
      sort <- append(sort, dot)
    }
  }
  
  data <- data %>% dplyr::arrange(!!!sort)
  
  if (first_last) {
    for (i in seq_along(dots)) {
      dot <- dots[[i]]
      dot_str <- deparse(dot)
      if (stringr::str_detect(dot_str, "((dplyr::)?desc\\()(.+)(\\))")) {
        dot_str <- stringr::str_extract(dot_str, "((dplyr::)?desc\\()(.+)(\\))", group = 3)
        dot <- rlang::sym(dot_str)
      }
      
      first <- paste0(first_prefix, dot_str)
      last <- paste0(last_prefix, dot_str)
      
      data <- data %>%
        dplyr::group_by(!!dot, .add = TRUE) %>%
        dplyr::arrange(!!!sort) %>%
        dplyr::mutate(
          !!first := dplyr::row_number() == 1L,
          !!last := dplyr::row_number() == dplyr::n()
        )
    }
  }
  
  data %>% dplyr::ungroup()
}


# this data is from SAS Programmer’s Guide: Essentials
# FIRST. and LAST. DATA Step Variables
# Example 1: Grouping Observations by State, City, and ZIP Code

zip <- tibble::tribble(
  ~State,      ~City, ~ZipCode,      ~Street,
  "AZ",   "Tucson",   85730L, "Domenic Ln",
  "AZ",   "Tucson",   85730L, "Gleeson Pl",
  "FL", "Lakeland",   33801L, "French Ave",
  "FL", "Lakeland",   33809L,   "Egret Dr",
  "FL",    "Miami",   33133L,    "Rice St",
  "FL",    "Miami",   33133L, "Thomas Ave",
  "FL",    "Miami",   33133L,  "Surrey Dr",
  "FL",    "Miami",   33133L,  "Trade Ave",
  "FL",    "Miami",   33146L,  "Nervia St",
  "FL",    "Miami",   33146L, "Corsica St"
)


zip_sort_r <- sas_sort(zip, State, City, ZipCode,
                           first_prefix = "first_",
                           last_prefix = "last_")

zip_sort_r
#> # A tibble: 10 × 10
#>    State City     ZipCode Street     first_State last_State first_City last_City
#>    <chr> <chr>      <int> <chr>      <lgl>       <lgl>      <lgl>      <lgl>    
#>  1 AZ    Tucson     85730 Domenic Ln TRUE        FALSE      TRUE       FALSE    
#>  2 AZ    Tucson     85730 Gleeson Pl FALSE       TRUE       FALSE      TRUE     
#>  3 FL    Lakeland   33801 French Ave TRUE        FALSE      TRUE       FALSE    
#>  4 FL    Lakeland   33809 Egret Dr   FALSE       FALSE      FALSE      TRUE     
#>  5 FL    Miami      33133 Rice St    FALSE       FALSE      TRUE       FALSE    
#>  6 FL    Miami      33133 Thomas Ave FALSE       FALSE      FALSE      FALSE    
#>  7 FL    Miami      33133 Surrey Dr  FALSE       FALSE      FALSE      FALSE    
#>  8 FL    Miami      33133 Trade Ave  FALSE       FALSE      FALSE      FALSE    
#>  9 FL    Miami      33146 Nervia St  FALSE       FALSE      FALSE      FALSE    
#> 10 FL    Miami      33146 Corsica St FALSE       TRUE       FALSE      TRUE     
#> # ℹ 2 more variables: first_ZipCode <lgl>, last_ZipCode <lgl>

df <- tibble::tribble(
  ~x, ~y, ~z,
  "b", 2L, NA,
  NA, 1L, NA,
  NA, 2L, NA,
  NA, NA, NA,
  NA, NA, "a",
  "a", NA, "a",
  "a", 1L, "a",
  "a", 2L, "b",
  "b", NA, NA,
  "b", 1L, "b",
  "a", NA, NA,
  NA, 1L, "b",
  "b", NA, "b",
  "a", 2L, "a",
  "b", 2L, "b",
  NA, 2L, "b",
  NA, 1L, "a",
  "b", 1L, NA,
  "a", NA, "b",
  "b", NA, "a",
  "a", 2L, NA,
  "a", 1L, "b",
  "a", 1L, NA,
  "b", 1L, "a",
  "b", 2L, "a",
  NA, NA, "b",
  NA, 2L, "a"
)

sort1 <- sas_sort(df,x,y,z)
sort1
#> # A tibble: 27 × 9
#>    x         y z     .first.x .last.x .first.y .last.y .first.z .last.z
#>    <chr> <int> <chr> <lgl>    <lgl>   <lgl>    <lgl>   <lgl>    <lgl>  
#>  1 <NA>     NA <NA>  TRUE     FALSE   TRUE     FALSE   TRUE     TRUE   
#>  2 <NA>     NA a     FALSE    FALSE   FALSE    FALSE   TRUE     TRUE   
#>  3 <NA>     NA b     FALSE    FALSE   FALSE    TRUE    TRUE     TRUE   
#>  4 <NA>      1 <NA>  FALSE    FALSE   TRUE     FALSE   TRUE     TRUE   
#>  5 <NA>      1 a     FALSE    FALSE   FALSE    FALSE   TRUE     TRUE   
#>  6 <NA>      1 b     FALSE    FALSE   FALSE    TRUE    TRUE     TRUE   
#>  7 <NA>      2 <NA>  FALSE    FALSE   TRUE     FALSE   TRUE     TRUE   
#>  8 <NA>      2 a     FALSE    FALSE   FALSE    FALSE   TRUE     TRUE   
#>  9 <NA>      2 b     FALSE    TRUE    FALSE    TRUE    TRUE     TRUE   
#> 10 a        NA <NA>  TRUE     FALSE   TRUE     FALSE   TRUE     TRUE   
#> # ℹ 17 more rows

sort2 <- sas_sort(df, x, dplyr::desc(y), z)
sort2
#> # A tibble: 27 × 9
#>    x         y z     .first.x .last.x .first.y .last.y .first.z .last.z
#>    <chr> <int> <chr> <lgl>    <lgl>   <lgl>    <lgl>   <lgl>    <lgl>  
#>  1 <NA>      2 <NA>  TRUE     FALSE   TRUE     FALSE   TRUE     TRUE   
#>  2 <NA>      2 a     FALSE    FALSE   FALSE    FALSE   TRUE     TRUE   
#>  3 <NA>      2 b     FALSE    FALSE   FALSE    TRUE    TRUE     TRUE   
#>  4 <NA>      1 <NA>  FALSE    FALSE   TRUE     FALSE   TRUE     TRUE   
#>  5 <NA>      1 a     FALSE    FALSE   FALSE    FALSE   TRUE     TRUE   
#>  6 <NA>      1 b     FALSE    FALSE   FALSE    TRUE    TRUE     TRUE   
#>  7 <NA>     NA <NA>  FALSE    FALSE   TRUE     FALSE   TRUE     TRUE   
#>  8 <NA>     NA a     FALSE    FALSE   FALSE    FALSE   TRUE     TRUE   
#>  9 <NA>     NA b     FALSE    TRUE    FALSE    TRUE    TRUE     TRUE   
#> 10 a         2 <NA>  TRUE     FALSE   TRUE     FALSE   TRUE     TRUE   
#> # ℹ 17 more rows

# delete the first and last
delete_first_last <- sas_sort(df, x, dplyr::desc(y), z) %>%
  dplyr::select(
    -dplyr::starts_with(".first."),
    -dplyr::starts_with(".last.")
  )

delete_first_last
#> # A tibble: 27 × 3
#>    x         y z    
#>    <chr> <int> <chr>
#>  1 <NA>      2 <NA> 
#>  2 <NA>      2 a    
#>  3 <NA>      2 b    
#>  4 <NA>      1 <NA> 
#>  5 <NA>      1 a    
#>  6 <NA>      1 b    
#>  7 <NA>     NA <NA> 
#>  8 <NA>     NA a    
#>  9 <NA>     NA b    
#> 10 a         2 <NA> 
#> # ℹ 17 more rows

Created on 2023-07-19 with reprex v2.0.2

bgxq
  • 31
  • 5