0

I am looking at twitter data which I am then feeding into an html document. Often the text contains special characters like emojis that aren't properly encoded for html. For example the tweet:

If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be

would become:

If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥

when fed into an html document.

Working manually I could use a tool like https://www.textfixer.com/html/html-character-encoding.php to encode the tweet to look like:

If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be "&#55357";"&#56613"; "&#55357";"&#56613"; "&#55357";"&#56613";

which I could then feed to an html document and have the emojis show up. Is there a package or function in R that could take text and html encode it similarly to the web tool above?

Noah Olsen
  • 271
  • 1
  • 14

1 Answers1

4

Here's a function which will encode non-ascii characters as HTML entities.

entity_encode <- function(x) {
  cp <- utf8ToInt(x)
  rr <- vector("character", length(cp))
  ucp <- cp>128
  rr[ucp] <- paste0("&#", as.character(cp[ucp]), ";")
  rr[!ucp] <- sapply(cp[!ucp], function(z) rawToChar(as.raw(z)))
  paste0(rr, collapse="")
}

This returns

[1] "If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be &#128293; &#128293; &#128293;"

for your input but those seem to be equivalent encodings.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • 1
    Referenced in https://stackoverflow.com/questions/74020467/string-replacing-utf-special-characters-like-f-in-a-dataframe/74021191#74021191 . There still doesn't seem to be a package that can substitute all the human-readable html entities like changing `&` to `&`, even though this seems like a very easy and obvious thing to create. Or is it more niche than I imagine? – Allan Cameron Oct 10 '22 at 22:31
  • @AllanCameron I just today stumbled upon the [textutils package](https://github.com/enricoschumann/textutils) with the function `HTMLencode()` that seems to do just this (though have not investigated it thoroughly) – rove Oct 12 '22 at 06:21