1

Objective

My solution below seems too messy and low performant. There must be a simpler approach, as this operation already exists in filters for image processing to enlarge/widen selection mask.

I started this code to manually inspect regions of a table without switching to an Spreadsheet which in many cases is not possible due to the large amount of data. My goal was to pass filtered rows to a function that would return one dataframe per row with the surrounding/neighbor rows.

It first played well for duplicated but it doesn't plays well with filter in a pipe.

For example, when printing a DF in Rstudio, having 1 row selected, with depth=2, I want to inspect the 5 contiguous rows surrounding the selected one.

row 
row
row <------- neighbor row
row <------- neighbor row
row <-- selected by filter condition
row <------- neighbor row
row <------- neighbor row
row 
row 

Image processing analogy : dilate/erode filter in image processing, widens light/dark areas of the active selection filter.

keywords : increase, expand, enlarge, enhance, widen, dilate, broaden, neighbours, context, surroundings, region

Current Approach

(Reproducible code)

df_inspect_context <- 
df_inspect_surroundings <- 
df_inspect_neighbors <- 
df_inspect_region <- 
helper_df_inspect_region <- 
  function( DF, logicals, depth=5, limit=4 ){

      looplst  <- logicals %>% which %>% head(limit) %>% na.omit

      filter
      regions  <- lapply( looplst ,    function(rnum){ 
                  from = max( 1, (rnum-depth) )
                  to   = min( (rnum+depth-1), nrow(DF) )
                  indexes     = from:to
                  highlightX  = c( rep('', rnum-from ), 'X', rep('', to-rnum ) ) 

                  return( list( idxs=indexes, X=highlightX ) )
                  } )
      lapply(    regions,   function(region) {  cbind(X=region$X, DF[ region$idxs, ])   } )
}

#TEST
helper_df_inspect_region( iris, duplicated(iris) )

See result, the X marks the inspected row

EXPECTATION

  • Standard R-ish method for the same.
  • I want this to play well with normal filter operations.
  • It should return either a list of dataframes as current, or one enlarged filtered dataframe.
  • It must respect the passed arranging.

Example calls:

df %>% arrange(..) %>% filter(..) %>% dilate(5)
df %>% arrange(..) %>% filter(..) %>% surrounding_rows(5)
df %>% arrange(..) %>% filter(..) %>% neighbor_rows(5)
Constraints of current solution

In order to check for CEROes or Outliers at any cell of all rows, I would filter with a dplyr context like next, which is not compatible with the which function that my function uses to calculate the regions.

dat %>% filter( if_any( everything(), ~.==0 ) )

The filter condition (~.==0,~.=='',is.na,is.empty) has to apply to entire rows, and return TRUE if any row is TRUE.

To work around this, I used apply to apply the condition row by row and return one logical per row. As apply coerces to chr I had to take care of filtering numeric columns as well.

The result looks messy and still doesn't plays well with filter.

numericcols = lapply(df, is.numeric) %>% unlist
logicals = apply( df[,numericcols], 1, function(x) sum(x==0)>0 )

See result for a CEROs lookup

Note: lag/lead: I found lag and lead as suggested in other questions, but doesn't do the same thing, or they return NA where there must be data.

DiegoJArg
  • 101
  • 10
  • 1
    Sorry but for in the end it is not clear what you want. Try to give a minimal reproducible example with the expected output. Many thanks. – TarJae Jul 27 '22 at 18:43
  • 1
    I am sorry. According to the many many rules of this site, I added as many detail as I have. The reproducible CODE is under the section `CODEs`, and is reproducible. I will edit the 2nd one to make it copy&paste reproducible as well I Both were prepared to post it here as reproducible codes. The question here is just to make it simpler and/or faster according to section *Expected approach*. – DiegoJArg Jul 27 '22 at 20:08
  • everything's ok. I just expressed this because others may felt the same and therefore you would not get appropriate help. I see that the question is good structured, but try to keep it as simple as possible. – TarJae Jul 27 '22 at 20:11
  • 1
    I just edited and added the neighbor/surrounding rows representation. I hope it helps. – DiegoJArg Jul 27 '22 at 20:20

2 Answers2

0

So here is a slightly modified code from here Returning above and below rows of specific rows in r dataframe: Credits to @flodel (2012) :-)

Here we identify the row by rownumber and select n above and n below:

extract.with.context <- function(x, rows, after = 0, before = 0) {
  
  match.idx  <- which(1:nrow(mtcars) %in% rows)
  span       <- seq(from = -before, to = after)
  extend.idx <- c(outer(match.idx, span, `+`))
  extend.idx <- Filter(function(i) i > 0 & i <= nrow(x), extend.idx)
  extend.idx <- sort(unique(extend.idx))
  
  return(x[extend.idx, , drop = FALSE])
}


extract.with.context(mtcars, 6, after=2, before = 2)
                  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • 1
    Hi TarJae. Its actually the same procedure. That `match.idx` is my `places`, except that it just makes the logicals inside. `span` is the same as my `indexes` variable. Then 2 lines to strip outbounding values. Although I like the naming of extract.with.context – DiegoJArg Jul 27 '22 at 21:22
  • Ah. OK. Was just a try to help. Maybe one of the other solutions on that site. Good luck :-) – TarJae Jul 27 '22 at 21:26
  • 1
    Thank you!! I would give you a point, but I am restricted by the reputation system. – DiegoJArg Jul 28 '22 at 11:15
0

I happen to answer my own question, sorry for that, but I came with this solution. I am not sure how to name it yet. So far, it does what I need, and most of what the question posted asks.

I am still interested in more performant options.

dilate_filter <- 
  function( X, df_original, by, depth=5, limit=NA ){
    X <- if ( is.data.frame(X) )  list(X) 
    else if (is.list(X)) X
    else NULL
    
    result = list()
    limit = min(coalesce(limit,length(X)),length(X))
    
    for( k in 1:limit ){
    
        DF <- df_original %>% rowid_to_column( "idx"  )
        indexes <- left_join( X[[k]] , DF, by=by )$idx
    
        indexes <-  sapply( indexes, function(rnum) {
          from <-  max( 1, (rnum-depth) );
          to   <-  min( (rnum+depth-1), nrow(df_original  ) );
          X=c( rep('', rnum-from ), 'X', rep('', to-rnum ) ) 
          return( list( X = X, idx = from:to ) )
          } )
        
        highlightXs = indexes['X',] %>% unlist
        indexes = indexes['idx',] %>% unlist
        DF <- cbind( X=c(highlightXs), DF[ c(indexes), ])
        
        result[[ length(result)+1 ]] <- DF
    }
    if (length(result)==1) return(result[[1]])
    else return(result)
}

The primary use scenario is

by <- c('session_id', 'datetime')

db$minuteCaloriesNarrow[Calories==0] %>% 
  dilate_filter( db$minuteCaloriesNarrow, by ) 

But, it can be called from a piped sequence as:

db$minuteCaloriesNarrow[Calories==0] %>% 
  split( row.names(.) ) %>% 
  dilate_filter( db$minuteCaloriesNarrow, by )

And perform calculations in a pipe afterwards

db$minuteCaloriesNarrow[Calories==0] %>% 
  split( .[['.datetime']]  ) %>% 
  dilate_filter( db$minuteCaloriesNarrow, by )%>%
  map( ~.[['Calories']] %>% median ) %>%
  unlist

db$minuteCaloriesNarrow[Calories==0] %>% 
  dilate_filter( db$minuteCaloriesNarrow, by ) %>% 
  .$Calories %>% 
  median

or

median_list <- 
  db$minuteCaloriesNarrow[Calories==0] %>% 
    split( row.names(.) ) %>% 
    dilate_filter( db$minuteCaloriesNarrow, by )%>%
    map( ~.[['Calories']] %>% median ) %>%
    unlist

db$minuteCaloriesNarrow[Calories==0]$Calories <- median_list

db$minuteCaloriesNarrow[Calories==0]

The function gets either a DF or a list of DF (after using split).

Also requires the original DF, previous filtering, and the keys to perform left_join in order to get the row indexes.

Once indexes are encountered for each row on the input, it just adds more indexes to the list related to the neighbor rows or surroundings, by using depth and seq.

Additionally, it also adds a column with an X, marking the original row value that we are interested in.

This function is meant to avoid switching to Spreadsheet when trying to inspect the context of a certain row, but it can be used to perform calculations as well.

For last, a test with builtin dataset:

DF <- iris
keys = c('Species', 'Sepal.Length')
DF %>% filter(duplicated(.)) %>% dilate_filter( DF, keys, limit=10 )
DiegoJArg
  • 101
  • 10