0

I have a large multidimensional array (approx. 19 million elements) which contains joint probabilities across a number of different attributes.

The array is very sparse and I am only interested in the cells with non-zero probabilities.

However, when filtering the array for non-zero elements, I am unable to retrieve the dimension names (which correspond to various attribute values) of the filtered values.

Here is a toy example:

array_dim <- c(2,5,5,4)

array_fill <- runif(prod(array_dim))

array_dimnames <- list(
                    c('strawberry', 'blackberry'),
                    c('cranberry', 'banana', 'pineapple', 'apple', 'tangerine'), 
                    c('orange', 'blueberry', 'kiwi', 'grapes', 'guava'),
                    c('plum', 'fig', 'grapefruit', 'lemon')
                    )


fruits <- array(array_fill, dim=array_dim, dimnames=array_dimnames)

I can obtain the index values of cells matching a certain criterion (here, > 0.9) as follows:

> which(fruits %in% fruits[fruits>0.9], arr.ind = TRUE)
 [1]   8  23  25  32  33  35  37  76  77  85  90 101 117 121 123 135 154 197

But I am unable to use the above index values and find out what combinations of fruits they are as the dimnames get dropped when looking for a specific cell value:

> fruits[8]
[1] 0.9590207
> fruits[8, drop=FALSE]
[1] 0.9590207
> dimnames(fruits[8])
NULL
> names(fruits[8])
NULL

I have tried to convert the array into a data.frame and make use of the drop=FALSE parameter :

> fruits.df <- as.data.frame(fruits)
> 
> fruits.df[1,2,drop=FALSE]

           banana.orange.plum
strawberry          0.4003854

but adding the conditional filter fails as fruits.df[fruits.df > 0.9,,drop=FALSE] returns a bunch of NA.

As a last resort, I could construct the array_index -> dimnames mapping myself in a separate data structure but it would be good to know if there is a more elegant/efficient solution.

I am also looking into the listarrays package.

Thanks in advance

AAA
  • 13
  • 1
  • 4
  • See one possible answer below; also see https://stackoverflow.com/questions/21074240/methodology-of-high-dimensional-data-structuring-in-r-vs-matlab – user12728748 Feb 07 '20 at 22:50
  • can't believe I didn't find that thread..., thank you – AAA Feb 10 '20 at 11:25

1 Answers1

0

I also did not find a simple way to get the dimnames directly from the array. An easy way to transform your data structure would be to use as_tbl_cube from dplyr and transform that into a data.frame (or data.table) to see the dimnames:

set.seed(3)
array_dim <- c(2,5,5,4)
array_fill <- runif(prod(array_dim))
array_dimnames <- list(
    dim1=c('strawberry', 'blackberry'),
    dim2=c('cranberry', 'banana', 'pineapple', 'apple', 'tangerine'), 
    dim3=c('orange', 'blueberry', 'kiwi', 'grapes', 'guava'),
    dim4=c('plum', 'fig', 'grapefruit', 'lemon')
)
fruits <- array(array_fill, dim=array_dim, dimnames=array_dimnames)
which(fruits %in% fruits[fruits>0.9], arr.ind = TRUE)
#>  [1]  28  54  56  73  74  85  90 115 161 198
fruits[198]
#> [1] 0.9065314

library(dplyr)
arr.cube <- as.tbl_cube(fruits)
tail(as.data.frame(arr.cube))
#>           dim1      dim2  dim3  dim4    fruits
#> 195 strawberry pineapple guava lemon 0.7057146
#> 196 blackberry pineapple guava lemon 0.3907374
#> 197 strawberry     apple guava lemon 0.8242374
#> 198 blackberry     apple guava lemon 0.9065314
#> 199 strawberry tangerine guava lemon 0.4171170
#> 200 blackberry tangerine guava lemon 0.2791320

In this example, fruits[198] would have the dimnames blackberry, apple, guava, and lemon.

user12728748
  • 8,106
  • 2
  • 9
  • 14