This is a javascript/node/Express problem. I am working on the backend to make speech recognition, and specifically scoring that speech for similarity to some reference sentence, work. When I use English for the language, there is no problem and I can obtain a proper score. The front end where the actual speech recognition is done also operates as it should. I have tested it without a backend part, that is, with just the speech recognizing part, and the algorithm gives the right score for all languages.
However, I run into problems when I use this front end script with a backend script that does the scoring part for languages that use non-Latin characters, and using data from text files for the reference for comparison. (When testing in the front end, I did not use data from text files, I used data that was in the HTML on that page. I stored the reference string as a data attribute and pointed to that data attribute in the script.)
This is the backend script:
app.post('/k2', ensureAuthenticated, async function(req, res) {
const client = await mongoose.connect(MONGO_URI, function (err, db){
if (err) throw new Error(err);
var stringMain = req.body.transcript;
var username = req.body.username;
var field = req.body.fieldName;
var task = req.body.question;
var updateField = {$set:{}};
updateField.$set[field+'.'+task]= {"question":task, "value":100};
var textGet = task+"i.txt";
var finalAns;
fs.readFile(textGet, function(err, data) {
if (err) console.log("error reading the text file");
var secondString = data.toString(); //THIS MAY BE WHERE THE PROBLEMS START
var thresh = Math.ceil((stringSimilarity.compareTwoStrings(stringMain, secondString))*10)*10;
var textGet2 = task+"ii.txt";
fs.readFile(textGet2, function(err, data) {
if (err) console.log("error opening and reading second file");
var secondString2 = data.toString();
var thresh2 = Math.ceil((stringSimilarity.compareTwoStrings(stringMain, secondString2))*10)*10;
if (thresh >=thresh2){
finalAns = thresh;
} else {
finalAns = thresh2
};
if (finalAns === 100) {
db
.collection("users")
.findOne({username: username}, function (errors, result){
if (errors) throw errors;
if (result){
db.collection("users")
.updateOne({username:username}, updateField);
res.end(JSON.stringify({"transcript":finalAns}));
}
else {console.log("Player not found")};
});
} else {
res.end(JSON.stringify({"transcript":finalAns}));
}
});
});
});
});
I have text files that contain the reference sentence to compare the transcript with. These files have ".txt" as the extension. The transcript comes from the front end and is compared to the data in two text files and the best match is taken as the final answer (a score). Then the final answer (score) is sent to the front end using AJAX. I made these files using Apple's TextEdit (I'm using a mac), and as far as I know, the encoding is utf-8.
For English, it works perfectly and I am able to get a score, as I have said. However, when I test it out with non-English languages that use a different character set, for example, Japanese or Korean, I keep getting scores of 0.
The problem is I don't know where to begin to solve the problem. I suspect it has to do with the system interpreting non-Latin characters as Latin characters, so when it does the string similarity, the result is 0.
I looked this up and came across some references:
- JavaScript has a Unicode problem
- JavaScript String fromCharCode() Method
- What every JavaScript developer should know about Unicode
- Pan Lab This is for encoding Korean characters into UTF-8.
However, I am getting more and more confused, as things I've tried are not working. I tried to use the String.normalize() function on the sentence in the .txt files (the "data" in the script) and added that to the script just before I do the string similarity comparison part, and I used the Pan Lab online encoder to turn Korean letters in the .txt files into UTF-8 and then stored these characters in the text file.
(The Korean characters look like this after the encoding: %�%��%��)
I need some pointers about where to go if a neat solution that answers everything can't be provided.
I don't know where to start.
I think vaguely that the data.toString();
part is where the problem area is. And I don't know what format, "utf-8", "EUC", etc, I should be using for the non-English (Japanese and Korean for now but other languages later) language files.
Either I am storing them in the wrong format, or I have to encode the strings using some function before I can do the string similarity comparison, or both.
I am using two text files for a comparison of a single transcript. Both the text files are the same in most cases. I mention this because there is a "languagei.txt" and a "languageii.txt" in the script. I take the higher score as the final score ("finalAns").
By the way, I have set the correct language in the front end speech recognition script, so that is not the problem. Additionally the transcript I send to the backend appears to be in the right language, as the transcript (including non-English transcripts) shows up properly in the front end (I display the transcript by using innerHTML
).