How to convert non-English characters into a string that can be read and on which string similarity can be performed?

Question

This is a javascript/node/Express problem. I am working on the backend to make speech recognition, and specifically scoring that speech for similarity to some reference sentence, work. When I use English for the language, there is no problem and I can obtain a proper score. The front end where the actual speech recognition is done also operates as it should. I have tested it without a backend part, that is, with just the speech recognizing part, and the algorithm gives the right score for all languages.

However, I run into problems when I use this front end script with a backend script that does the scoring part for languages that use non-Latin characters, and using data from text files for the reference for comparison. (When testing in the front end, I did not use data from text files, I used data that was in the HTML on that page. I stored the reference string as a data attribute and pointed to that data attribute in the script.)

This is the backend script:

app.post('/k2', ensureAuthenticated, async function(req, res) {
  const client = await mongoose.connect(MONGO_URI, function (err, db){
     if (err) throw new Error(err); 
     var stringMain = req.body.transcript;
     var username = req.body.username;
     var field = req.body.fieldName;  
     var task = req.body.question; 
     var updateField = {$set:{}};
     updateField.$set[field+'.'+task]= {"question":task, "value":100};
     var textGet = task+"i.txt";
     var finalAns;
     fs.readFile(textGet, function(err, data) {
            if (err) console.log("error reading the text file");
            var secondString = data.toString(); //THIS MAY BE WHERE THE PROBLEMS START
            var thresh = Math.ceil((stringSimilarity.compareTwoStrings(stringMain, secondString))*10)*10; 
            var textGet2 = task+"ii.txt";
            fs.readFile(textGet2, function(err, data) {
                  if (err) console.log("error opening and reading second file");
                  var secondString2 = data.toString();
                  var thresh2 = Math.ceil((stringSimilarity.compareTwoStrings(stringMain, secondString2))*10)*10;
                  if (thresh >=thresh2){
                        finalAns = thresh;
                      } else {
                         finalAns = thresh2
                        }; 
                  if (finalAns === 100) {
                     db
                        .collection("users")
                        .findOne({username: username}, function (errors, result){
                          if (errors) throw errors;
                          if (result){
                           db.collection("users")
                            .updateOne({username:username}, updateField);
                           res.end(JSON.stringify({"transcript":finalAns}));
                          }
                            else {console.log("Player not found")};
                        });    
                   } else {
                     res.end(JSON.stringify({"transcript":finalAns}));
                     }
            });                   
     });
  });                           
});

I have text files that contain the reference sentence to compare the transcript with. These files have ".txt" as the extension. The transcript comes from the front end and is compared to the data in two text files and the best match is taken as the final answer (a score). Then the final answer (score) is sent to the front end using AJAX. I made these files using Apple's TextEdit (I'm using a mac), and as far as I know, the encoding is utf-8.

For English, it works perfectly and I am able to get a score, as I have said. However, when I test it out with non-English languages that use a different character set, for example, Japanese or Korean, I keep getting scores of 0.

The problem is I don't know where to begin to solve the problem. I suspect it has to do with the system interpreting non-Latin characters as Latin characters, so when it does the string similarity, the result is 0.

I looked this up and came across some references:

JavaScript has a Unicode problem
JavaScript String fromCharCode() Method
What every JavaScript developer should know about Unicode
Pan Lab This is for encoding Korean characters into UTF-8.

However, I am getting more and more confused, as things I've tried are not working. I tried to use the String.normalize() function on the sentence in the .txt files (the "data" in the script) and added that to the script just before I do the string similarity comparison part, and I used the Pan Lab online encoder to turn Korean letters in the .txt files into UTF-8 and then stored these characters in the text file.

(The Korean characters look like this after the encoding: %�%��%��)

I need some pointers about where to go if a neat solution that answers everything can't be provided.

I don't know where to start.

I think vaguely that the data.toString(); part is where the problem area is. And I don't know what format, "utf-8", "EUC", etc, I should be using for the non-English (Japanese and Korean for now but other languages later) language files.

Either I am storing them in the wrong format, or I have to encode the strings using some function before I can do the string similarity comparison, or both.

I am using two text files for a comparison of a single transcript. Both the text files are the same in most cases. I mention this because there is a "languagei.txt" and a "languageii.txt" in the script. I take the higher score as the final score ("finalAns").

By the way, I have set the correct language in the front end speech recognition script, so that is not the problem. Additionally the transcript I send to the backend appears to be in the right language, as the transcript (including non-English transcripts) shows up properly in the front end (I display the transcript by using innerHTML).

Honeybear65 · Answer 1 · 2021-03-28T04:09:16.147

I found the solution by accident. I put in the HEX (utf-8) encoding (from the Pan Lab site) for a Korean string into a text file to test if this would work. It didn't work. But when I was saving the text files with the strange-looking characters, eg, "%�%��%��", contained in them, a message popped up which said,

This document can no longer be saved using its original Korean (Windows, DOS) encoding. Please choose another encoding (such as UTF-8).

I saved the text file as an UTF-8 encoded file and then I put in the normal Korean string into the file and saved the file again.

This time it worked fine.

I tested with the text files that had not been changed, and the scoring function still didn't work, and the console.log showed output from the text files that looked like this: "�ȳ�".

So it looked very much like the encoding of the text files was the problem.

This site (If characters aren’t displayed correctly in TextEdit on Mac) tells you (if you're a Mac user) how to change the encoding to the correct one. In my case, changing the encoding to utf-8 solved the problem.

In the TextEdit app on your Mac, choose File > Open, then select the file (don’t open it).

Click Options in the lower-left corner of the window.

Click the Plain Text Encoding pop-up menu and choose an encoding.

If you don’t see the encoding you want, choose Customize Encodings List, then select the encodings to include.

Click Open.

I don't remember how they got encoded to Windows DOS but the system I have might do that by default when saving .txt (TextEdit) files.

From the Apple site above:

By default, TextEdit uses its automatic text encoding to display documents. If characters aren’t appearing correctly, try choosing a different encoding when you open the file.

I don't have to do any conversion of the characters using Pan Lab, etc (although testing with conversion using Pan Lab was how I stumbled onto the solution), I just have to make sure the .txt files are encoded correctly (with utf-8 encoding) when I save the files.

Tip: For mac users, if you want TextEdit files to be saved with utf-8 encoding by default, do this: TextEdit > Preferences > Open and Save > Plain Text Encoding > Opening/Saving

Inserting console.log checks of output in many steps of the algorithm helped isolate where the problem was. When the strings were encoded incorrectly in the TextEdit file, console.log showed this kind of output: "%�%��%��".

After correcting the encoding, console.log showed human readable language output for the text files, including before any alterations to the text file data were made and afterwards.

How to convert non-English characters into a string that can be read and on which string similarity can be performed?

1 Answers1