1

I was searching for a piece of code that does Information Gain Ratio (IGR), in R or Python. I have found a handy R package, but it is not maintained, and has been removed from CRAN. However, I have found some old version and I took the liberty and "borrowed" critical functions. I made some changes and, also, added some new function. Algorithm expects 2x2 matrix of two cues/features and their (co)occurrence and total number of events. It gives back two IGRs, one for each cue/feature.

However, I think that it is not well optimized, and I would like to learn better way of implementing. In particular, I think there must be a way to make functions cueRE and getIGRs nicer. Below, is an example and functions.

I would appreciate any advice and comment. Many thanks!

safelog2 <- function (x) {
    if (x <= 0) return(0)
    else return(log2(x))
}

binaryMatrix <- function(m, t) {
    return(matrix(c(m[1,2], m[1,1]-m[1,2], m[2,2]-m[1,2], t-(m[1,1]+m[2,2]-m[1,2])),
        nrow=2, byrow=TRUE, dimnames=list(c(1,0),c(1,0))))
}

H <- function (p) {
    return(-(sum(p * sapply(p, safelog2))))
}

cueH <- function(m, t) {
    p1 = c(m[1,1]/t, (t-m[1,1])/t)
    p2 = c(m[2,2]/t, (t-m[2,2])/t)
    return(c(H(p1), H(p2)))
}

cueRE <- function (tbl) {
    normalize <- function(v) {
        if (sum(v) == 0) v
        else v/sum(v)
    }
    nis <- apply(t(apply(tbl, 1, normalize)), 1, H)
    return(sum(tbl * nis) / sum(tbl))
}

getIGRs <- function(m, t) {
    ent = cueH(m, t)
    rent = cueRE(binaryMatrix(m, t))
    igr1 = (ent[2] - rent) / ent[1]
    d = diag(m)
    m[1,1] = d[2]
    m[2,2] = d[1]
    ent = cueH(m, t)
    rent = cueRE(binaryMatrix(m, t))
    igr2 = (ent[2] - rent) / ent[1]
    return(c(igr1, igr2))
}

This would be used as

M <-matrix(c(20,15,15,40), nrow=2, byrow=TRUE,
    dimnames=list(c('a','b'),c('a','b')))
total <- 120

getIGRs(M, total)
Aziz Shaikh
  • 16,245
  • 11
  • 62
  • 79
striatum
  • 1,428
  • 3
  • 14
  • 31
  • 1
    I'm wondering if the reason this package has not been maintained is that the statistical premises that were underlying it were criticized as either questionable or invalid? Some of the measures that have been proposed in the machine learning community have not been met with full acceptance in the statistical community. – IRTFM Jul 25 '13 at 01:42
  • 1
    That may be completely true. However, I am not using this for machine learning problem. What I am, then, interested in is whether those procedures are sound formally. Many thanks! – striatum Jul 25 '13 at 07:38

0 Answers0