Retrieve broken Vietnamese string variables in R

Asked Aug 24 '20 at 12:07

Active Aug 24 '20 at 13:18

Viewed 86 times

I have a dataset from Vietnam. But when I read it in R, the string variables are imported incorrectly. I used stri_trans_general from the stringi package, however it works on only a few columns.
I checked the raw dataset and it seems those few columns were broken when the dataset was exported, looking e.g. like this:

"Du?c ch?t m?i"

I thus obtain strings with ? or > characters instead of actual words.

How I can repair these words in order to obtain the correct Vietnamese words?

edited Aug 24 '20 at 13:18

Adriaan

17,741
7
42
75

asked Aug 24 '20 at 12:07

drhnis

2

If the data is lost on the source, there's nothing you can do. – Braiam Aug 24 '20 at 13:16
1

You could try to fix it, by validating the words with missing characters against a dictionary and guestimate the correct Vietnamese word. My hunch is that the characters which have fallen out are "special" ones, i.e. with all kinds of Vietnamese accents. This is not fool-proof and requires a very extensive dictionary (to get all cases of a verb for instance) though. – Adriaan Aug 24 '20 at 13:19
@Braiam- Yes, I agree with you – drhnis Aug 25 '20 at 13:26
@Adriaan- That is a great advice. that is a good way around. – drhnis Aug 25 '20 at 13:26

Retrieve broken Vietnamese string variables in R

0 Answers0