Handling count of characters with diacritics in R

Question

I'm trying to get the number of characters in strings with characters with diacritics, but I can't manage to get the right result.

> x <- "n̥ala"
> nchar(x)
[1] 5

What I want to get is is 4, since n̥ should be considered one character (i.e. diacritics shouldn't be considered characters on their own, even with more than one diacritic stacked on a base character).

How can I get this kind of result?

In which language is used this diacritics? Maybe you can find the right encoding and set it. — SabDeM, May 30 '15 at 19:22
It is in International Phonetic Alphabet, so no particular language and almost any combination is virtually possible. — Stefano, May 30 '15 at 19:30
This might work, but I'm not experienced with encodings and I have no idea if it's suitable for other special characters... `nchar(gsub("", "", enc2native("n̥ala")))` — Molx, May 30 '15 at 19:33
Same comment as above, this looks better and doesn't read the stuff: `nchar(iconv("n̥ala", to="ASCII", sub=""))` — Molx, May 30 '15 at 19:41

SabDeM · Accepted Answer · 2015-05-30T20:32:50.183

Here is my solution. The idea is that phonetic alphabets can have an unicode representation and then:

Use Unicode package; it provide the function Unicode_alphabetic_tokenizer that:

Tokenization first replaces the elements of x by their Unicode character sequences. Then, the non- alphabetic characters (i.e., the ones which do not have the Alphabetic property) are replaced by blanks, and the corresponding strings are split according to the blanks.

After this I used the nchar but because the splitting it two substrings of the previous function I used a sum.

sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 4

I believe this package can be very useful in such cases, but I am not an expert and I do not know if my solution works for all problems that involve phonetic alphabets. Maybe other examples might be useful to state the validity of my solution.

It works well

Here is another example:

> x <- "e̯ ʊ̯"
> x
[1] "e̯ ʊ̯"
> nchar(x)
[1] 5
> sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 2

p.s. there is only one " in the code but copying and pasting it, the second one appears. I do not know why this happens.

Thanks for this. Actually, this solution wouldn't fit my bigger problem, which is tabulating the occurrences of phones (a phone is here a character plus diacritics if present) in a string. With this method, [n] and [n̥] would be counted as two instances of the same phone, which is not desirable. I'll open a new question stating exactly the tabulating problem. — Stefano, May 30 '15 at 21:16

score 1 · Answer 2 · answered May 31 '15 at 03:18

1

Here's a solution using the qdap package that I maintain:

x <- "n?ala"

library(qdap)
character_count(word)
## [1] 4

answered May 31 '15 at 03:18

Tyler Rinker

108,132
65
322
519

score 0 · Answer 3 · edited Jun 20 '20 at 09:12

0

You could do workarounds. Here's one:

dia.count <- function(string) {
  y <- unlist(strsplit(string, ''))
  length(grep('[A-Za-z0-9]', y, value=T))
}
dia.count(x)
[1] 4

Methods for dealing directly with character encoding is preferable. This is again, a workaround. In the general case, there may be packages or combinations of functions to address your issue comprehensively.

Update

Here is another workaround provided by comment:

nchar(sub('[^A-Za-z]+', '', x))
[1] 4

The dia.count function looks for capital and lowercase letters along with numbers in the string. The added script does the opposite; it eliminates all string tokens that are not letters, capital or otherwise. credit @akrun

The best I could find in the package stringi is str_enc_toascii which gives:

stri_enc_toascii(x)
[1] "n\032ala"

Given that output, subbing out everything but letters will provide the desired output.

nchar(sub('[^A-Za-z]', '', stri_enc_toascii(x)))
[1] 4

A nice balance between a general answer and a quick script is found in the comments:

nchar(iconv("n̥ala", to="ASCII", sub=""))
[1] 4

This uses the base R function iconv, that converts the string for you. credit @Molx

edited Jun 20 '20 at 09:12

Community

1
1

answered May 30 '15 at 19:41

Pierre L

28,203
6
47
69

Though, I am not sure if this is a general workaround. I think `stringi` have some options, also I felt the `iconv` in the comments may be more general. – akrun May 30 '15 at 19:52
I will add, but if the commenter can still add it as an answer if they choose. – Pierre L May 30 '15 at 20:03
This won't work if the base characters aren't ASCII – like eg. `ŋ̥ala` –, which is quite common in IPA. – lenz May 30 '15 at 22:46
the examples are all based on non-ASCII cases @lenz – Pierre L May 31 '15 at 01:21
yes sure, @plafort, but `nchar(iconv("ŋ̥ala", to="ASCII", sub=""))` will give you 3 instead of 4 (note the small hook on the first character `ŋ`) – lenz May 31 '15 at 06:48
The output and code is right there. Maybe we're talking past each other. The script you mentioned above provides the desired output as posted. Why would I show `4` as the output if it is `3` in truth as you say? Are we talking about the same thing? – Pierre L May 31 '15 at 09:57
Your code works fine for the example in the question, where all base characters are plain ASCII. However, a typical IPA string may very well contain non-ASCII base characters, as shown in my modified example in the comment. I'm trying to say that your solution is not generally applicable. – lenz Jun 02 '15 at 07:58
Just try this example: `nchar(iconv("bæd", to="ASCII", sub=""))`. It returns 2, not 3. – lenz Jun 02 '15 at 11:25
When I said 'addresses' I was not implying that it solves general cases. Please reread my answer in its entirety. @lenz – Pierre L Jun 02 '15 at 11:47
When you say a regex like `[^A-Za-z]` will sub out "everything but letters", then you are implying that characters like `ŋ`, `æ`, or `ɛ` are not letters, which is clearly wrong. I see that you mention the need of other packages for a more comprehensive solution; if that's what you mean by "addressing" general cases, then I understand your point. – lenz Jun 02 '15 at 12:22
Yes exactly @lenz, "Methods for dealing directly with character encoding is preferable. This is again, a workaround." I understand completely that there are many situations that a hack will not solve, especially when it is written to solve a specific problem. I find character encoding interesting. How would you get a character count with the special characters? – Pierre L Jun 02 '15 at 12:48
The answer by SabDeM seems to solve this. Or you can use [character categories](https://en.wikipedia.org/wiki/Unicode_character_property#General_Category), as in the answer to [this question](http://stackoverflow.com/questions/30551549/tabulating-characters-with-diacritics-in-r), which uses them in a regex. – lenz Jun 02 '15 at 13:11

Handling count of characters with diacritics in R

3 Answers3

It works well

Update