Objective
My solution below seems too messy and low performant. There must be a simpler approach, as this operation already exists in filters for image processing to enlarge/widen selection mask.
I started this code to manually inspect regions of a table without switching to an Spreadsheet which in many cases is not possible due to the large amount of data. My goal was to pass filtered rows to a function that would return one dataframe per row with the surrounding/neighbor rows.
It first played well for duplicated
but it doesn't plays well with filter
in a pipe.
For example, when printing a DF in Rstudio, having 1 row selected, with depth=2, I want to inspect the 5 contiguous rows surrounding the selected one.
row
row
row <------- neighbor row
row <------- neighbor row
row <-- selected by filter condition
row <------- neighbor row
row <------- neighbor row
row
row
Image processing analogy : dilate/erode filter in image processing, widens light/dark areas of the active selection filter.
keywords : increase, expand, enlarge, enhance, widen, dilate, broaden, neighbours, context, surroundings, region
Current Approach
(Reproducible code)
df_inspect_context <-
df_inspect_surroundings <-
df_inspect_neighbors <-
df_inspect_region <-
helper_df_inspect_region <-
function( DF, logicals, depth=5, limit=4 ){
looplst <- logicals %>% which %>% head(limit) %>% na.omit
filter
regions <- lapply( looplst , function(rnum){
from = max( 1, (rnum-depth) )
to = min( (rnum+depth-1), nrow(DF) )
indexes = from:to
highlightX = c( rep('', rnum-from ), 'X', rep('', to-rnum ) )
return( list( idxs=indexes, X=highlightX ) )
} )
lapply( regions, function(region) { cbind(X=region$X, DF[ region$idxs, ]) } )
}
#TEST
helper_df_inspect_region( iris, duplicated(iris) )
See result, the X marks the inspected row
EXPECTATION
- Standard R-ish method for the same.
- I want this to play well with normal
filter
operations. - It should return either a list of dataframes as current, or one enlarged filtered dataframe.
- It must respect the passed arranging.
Example calls:
df %>% arrange(..) %>% filter(..) %>% dilate(5)
df %>% arrange(..) %>% filter(..) %>% surrounding_rows(5)
df %>% arrange(..) %>% filter(..) %>% neighbor_rows(5)
Constraints of current solution
In order to check for CEROes or Outliers at any cell of all rows, I would filter with a dplyr context like next, which is not compatible with the which
function that my function uses to calculate the regions.
dat %>% filter( if_any( everything(), ~.==0 ) )
The filter condition (~.==0
,~.==''
,is.na
,is.empty
) has to apply to entire rows, and return TRUE if any row is TRUE.
To work around this, I used apply
to apply the condition row by row and return one logical per row.
As apply coerces to chr I had to take care of filtering numeric columns as well.
The result looks messy and still doesn't plays well with filter
.
numericcols = lapply(df, is.numeric) %>% unlist
logicals = apply( df[,numericcols], 1, function(x) sum(x==0)>0 )
Note: lag/lead: I found lag
and lead
as suggested in other questions, but doesn't do the same thing, or they return NA where there must be data.