I have a large multidimensional array (approx. 19 million elements) which contains joint probabilities across a number of different attributes.
The array is very sparse and I am only interested in the cells with non-zero probabilities.
However, when filtering the array for non-zero elements, I am unable to retrieve the dimension names (which correspond to various attribute values) of the filtered values.
Here is a toy example:
array_dim <- c(2,5,5,4)
array_fill <- runif(prod(array_dim))
array_dimnames <- list(
c('strawberry', 'blackberry'),
c('cranberry', 'banana', 'pineapple', 'apple', 'tangerine'),
c('orange', 'blueberry', 'kiwi', 'grapes', 'guava'),
c('plum', 'fig', 'grapefruit', 'lemon')
)
fruits <- array(array_fill, dim=array_dim, dimnames=array_dimnames)
I can obtain the index values of cells matching a certain criterion (here, > 0.9) as follows:
> which(fruits %in% fruits[fruits>0.9], arr.ind = TRUE)
[1] 8 23 25 32 33 35 37 76 77 85 90 101 117 121 123 135 154 197
But I am unable to use the above index values and find out what combinations of fruits they are as the dimnames get dropped when looking for a specific cell value:
> fruits[8]
[1] 0.9590207
> fruits[8, drop=FALSE]
[1] 0.9590207
> dimnames(fruits[8])
NULL
> names(fruits[8])
NULL
I have tried to convert the array into a data.frame and make use of the drop=FALSE
parameter :
> fruits.df <- as.data.frame(fruits)
>
> fruits.df[1,2,drop=FALSE]
banana.orange.plum
strawberry 0.4003854
but adding the conditional filter fails as fruits.df[fruits.df > 0.9,,drop=FALSE]
returns a bunch of NA
.
As a last resort, I could construct the array_index -> dimnames
mapping myself in a separate data structure but it would be good to know if there is a more elegant/efficient solution.
I am also looking into the listarrays
package.
Thanks in advance