-3

I'm reading text from a big file and write some parts into a new text file:

var ws = fs.createWriteStream('output.txt', {flags: 'w',encoding: 'utf8'});
    for (var i = 0; i < words.length; i++) {         
        ws.write(words[i][0].toString() + "\t" + words[i][1].toString() + "\n");       
    }
    ws.close()

However, if I open the created file, the editor (EDIT: xed on linux) refuses to open it. It says that there is something with the encoding. What can I do? Sanitize the string before writing? But how would I do that? Which symbols are problematic for a write stream?

user3776738
  • 214
  • 2
  • 10
  • 27
  • Can you post the error message? – Dennis Apr 12 '23 at 16:31
  • The opened file contains some invalid characters. If you continue editing, you may render the document unusable. You can also choose another character encoding and try again. (choosing another encoding does nothing) – user3776738 Apr 12 '23 at 16:32
  • (The above error message has been translated with deepl.com) If I log with console.log(words[i][0].toString() + "\t" + words[i][1].toString() + "\n") everything looks fine. So don't know where to look. – user3776738 Apr 12 '23 at 16:38
  • 2
    This seems like elemental debugging is needed to examine exactly what is in the data you are writing, even if you have to look at each individual byte to see what's going on. Clearly you have something other than simple characters in the content. And, calling `.toString()` isn't cleaning up the data for you. – jfriend00 Apr 12 '23 at 16:58
  • ok, how would I clean the data from all "forbidden" symbols? I need all printable ASCII symbols. everything else I don't care. – user3776738 Apr 12 '23 at 17:17
  • That depends upon what the data actually is. You need to know what the data is and how its encoded in order to know how to interpret it properly and convert it to what you need it to be. There is no reliable way to blindly convert unknown data to something meaningful. You should go back to the source and find out what exactly it is sending you and then that will make it clearer what the best and reliable way is to convert it to just text. – jfriend00 Apr 12 '23 at 18:48
  • For example, you would treat the data completely differently if it was 16-bit unicode vs. UTF-8 vs. true binary data that contains some text strings and contains other binary data and you just want to keep the text strings. – jfriend00 Apr 12 '23 at 18:51
  • the input is encoded in US-ASCII, it's the pile in jsonl looking like this: {"text": "It is done, and submitted. You can play \u201cSurvival of the Tastiest\u201d on Android, and on the web. Playing on the web works, but you have to simulate multi-touch for table... "\u201c" for example should just be written as it is not as the symbol it represents – user3776738 Apr 12 '23 at 19:05
  • Uhhh, `\u201c` signifies that the actual final content is NOT US-ASCII. That's a unicode escape sequence trying to represent something that is not US-ASCII. In fact `\u201c` is a left double quote character (which isn't something present in US-ASCII). US-ASCII contains only straight double quotes that aren't left or right leaning. – jfriend00 Apr 12 '23 at 21:04
  • yes, but \,u,2,0,1 and c are all ASCII symbols. so there should be no problem ,right? I just want the text not what it represents in utf-8 – user3776738 Apr 12 '23 at 21:37

1 Answers1

0

By default, fs.createWriteStream() uses the utf8 encoding, which supports most Unicode characters. However, if you are writing a string that contains characters that are not supported by this encoding, those characters may be replaced by some other character, which could appear as Chinese symbols or other unintelligible characters.

To avoid this problem, you can try using a different encoding that supports the specific Unicode characters you want to write. For example, you can use the utf16le encoding to write Unicode characters that are not supported by utf8.

Bonus: You can check if your string has non ASCII characters using the below code snippet.

function hasNonASCIIChars(str) {
  for (let i = 0; i < str.length; i++) {
    const code = str.charCodeAt(i);
    if (code > 127) {
      return true;
    }
  }
  return false;
}
Dennis
  • 3,962
  • 7
  • 26
  • 44
  • No, my text does not contain any chinese symbols. The editor just shows them as such. It is just a plain english text. – user3776738 Apr 12 '23 at 16:43
  • OK, it seems that there are non ASCII characters. I used str.replace(/[^\x00-\x7F]/g, " ") ,but there is still a problem with xed now. – user3776738 Apr 12 '23 at 17:03