4

I was wondering if there is a way to convert decimal numbers to ternary, given that there is a function intToBits for converting to binary.

I actually need to convert a character string like

> S0 <- c("Hello Stac")

to base 3. I thought to first convert it to decimal with

> S01 <- utf8ToInt(S0)
> S01
## [1]  72 101 108 108 111  32  83 116  97  99

then convert the result to base 3. I want to obtain something like this:

> S1
## [1] 2200 10202 11000 11010  11022 1012 10002 11022 10121 10200
Maël
  • 45,206
  • 3
  • 29
  • 67

3 Answers3

6

For practice, I guess you can try to write your own converter function like below

f <- function(x, base = 3) {
  q <- c()
  while (x) {
    q <- c(x %% base, q)
    x <- x %/% base
  }
  # as.numeric(paste0(q, collapse = ""))
  sum(q * 10^(rev(seq_along(q) - 1)))
}

or with recursion

f <- function(x, base = 3) {
  ifelse(x < base, x, f(x %/% base) * 10 + x %% base)
}

then you can run

> sapply(utf8ToInt(S0),f)
 [1]  2200 10202 11000 11000 11010  1012 10002 11022 10121 10200
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
3

Nice programming exercise. I have vectorized @ThomasIsCoding's answer to avoid expensive loops over strings and characters within strings. The idea is to loop over digits instead, since Unicode code points do not exceed 21 digits in any base, whereas the total number of characters in a character vector can be orders of magnitude greater.

The function below takes as arguments a character vector x, a base b (from 2 to 10), and a logical flag double. It returns a list res such that res[[i]] is an nchar(x[i])-length vector giving the base-b representation of x[i]. The list elements are double vectors or character vectors depending on double.

utf8ToBase <- function(x, b = 10, double = TRUE) {
    ## Do some basic checks
    stopifnot(is.character(x), !anyNA(x), 
              is.numeric(b), length(b) == 1L, 
              b %% 1 == 0, b >= 2, b <= 10)
    
    ## Require UTF-8 encoding
    x <- enc2utf8(x)
    
    ## Operate on concatenation to avoid loop over strings
    xx <- paste(x, collapse = "")
    ixx <- utf8ToInt(xx)
    
    ## Handle trivial case early
    if (length(ixx) == 0L) {
        el <- if (double) base::double(0L) else character(0L)
        res <- rep.int(list(el), length(x))
        names(res) <- names(x)
        return(res)
    }
    
    ## Use common field width determined from greatest integer
    width <- as.integer(floor(1 + log(max(ixx, 1), base = b)))
    res <- rep.int(strrep("0", width), length(ixx))
    
    ## Loop over digits
    pos <- 1L
    pow <- b^(width - 1L)
    while (pos <= width) {
        quo <- ixx %/% pow
        substr(res, pos, pos) <- as.character(quo)
        ixx <- ixx - pow * quo
        pos <- pos + 1L
        pow <- pow %/% b
    }
    
    ## Discard leading zeros
    if (double) {
        res <- as.double(res)
        if (b == 2 && any(res > 0x1p+53)) {
            warning("binary result not guaranteed due to loss of precision")
        }
    } else {
        res <- sub("^0+", "", res)
    }
    
    ## Return list
    res <- split(res, rep.int(gl(length(x), 1L), nchar(x)))
    names(res) <- names(x)
    res
}
x <- c(foo = "Hello Stack Overflow!", bar = "Hello world!")
utf8ToBase(x, 2)
$foo
 [1] 1001000 1100101 1101100 1101100 1101111  100000
 [7] 1010011 1110100 1100001 1100011 1101011  100000
[13] 1001111 1110110 1100101 1110010 1100110 1101100
[19] 1101111 1110111  100001

$bar
 [1] 1001000 1100101 1101100 1101100 1101111  100000
 [7] 1110111 1101111 1110010 1101100 1100100  100001
utf8ToBase(x, 3)
$foo
 [1]  2200 10202 11000 11000 11010  1012 10002 11022 10121 10200
[11] 10222  1012  2221 11101 10202 11020 10210 11000 11010 11102
[21]  1020

$bar
 [1]  2200 10202 11000 11000 11010  1012 11102 11010 11020 11000
[11] 10201  1020
utf8ToBase(x, 10)
$foo
 [1]  72 101 108 108 111  32  83 116  97  99 107  32  79 118 101
[16] 114 102 108 111 119  33

$bar
 [1]  72 101 108 108 111  32 119 111 114 108 100  33

Some caveats:

  • For efficiency, the function concatenates the strings in x rather than looping over them. It throws an error if the concatenation would exceed 2^31-1 bytes, which is the maximum string size allowed by R.

    x <- strrep(letters[1:2], 0x1p+30)
    log2(sum(nchar(x))) # 31
    utf8ToBase(x, 3)
    
    Error in paste(x, collapse = "") : result would exceed 2^31-1 bytes
    
  • The largest Unicode code point is 0x10FFFF. The binary representation of this number exceeds 2^53 when interpreted as decimal, so it cannot be stored in a double vector without loss of precision:

    x <- sub("^0+", "", paste(rev(as.integer(intToBits(0x10FFFF))), collapse = ""))
    x
    ## [1] "100001111111111111111"
    sprintf("%.0f", as.double(x))
    ## [1] "100001111111111114752"
    

    As a defensive measure, the function warns if 2^53 is exceeded when b = 2 and double = TRUE.

    utf8ToBase("\U10FFFF", b = 2, double = TRUE)
    
    [[1]]
    [1] 1.000011e+20
    
    Warning message:
    In utf8ToBase("\U{10ffff}", b = 2, double = TRUE) :
      binary result not guaranteed due to loss of precision
    
    utf8ToBase("\U10FFFF", b = 2, double = FALSE)
    
    [[1]]
    [1] "100001111111111111111"
    
Mikael Jagan
  • 9,012
  • 2
  • 17
  • 48
2

You can use cwhmisc::int2B:

library(cwhmisc)
int2B(utf8ToInt(S0), 3)[[1]] |> as.numeric()
# [1]  2200 10202 11000 11000 11010  1012 10002 11022 10121 10200
Maël
  • 45,206
  • 3
  • 29
  • 67