5

I want to read user's file and gave him modified version of this file. I use input with type file to get text file, but how I can get charset of loaded file, because in different cases it can be various... Uploaded file has format .txt or something similar and isn't .html :)

var handler = document.getElementById('handler');
var reader = new FileReader();

handler.addEventListener('click', function() {
    reader.readAsText(firstSub.files[0], /* Here I need use a correctly charset */);
});

reader.addEventListener("loadend", function() {
    console.dir(reader.result.split('\n'));
});

3 Answers3

5

In my case (I made a small web app that accepts subtitle .srt files and removes time codes and line breaks, making a printable text), it was enough to foresee 2 types of encoding: UTF-8 and CP1251 (in all cases I tried – with both Latin and Cyrillic letters – these two types are enough). At first I try encoding with UTF-8, and if it is not successful, some characters are replaced by '�'-signs. So, I check the result for presence of these signs, and, if found, the procedure is repeated with CP1251 encoding. So, here is my code:

function onFileInputChange(inputDomElement, utf8 = true) {
    const file = inputDomElement.files[0];
    const reader = new FileReader();
    reader.readAsText(file, utf8 ? 'UTF-8' : 'CP1251');
    reader.onload = () => {
        const result = reader.result;
        if (utf8 && result.includes('�')) {
            onFileInputChange(inputDomElement, false);
            console.log('The file encoding is not utf-8! Trying CP1251...');
        } else {
            document.querySelector('#textarea1').value = file.name.replace(/\.(srt|txt)$/, '').replace(/_+/g, '\ ').toUpperCase() + '\n' + result;
        }
    }
}
Roman Karagodin
  • 740
  • 2
  • 11
  • 16
2

You should check out this library encoding.js

They also have a working demo. I would suggest you first try it out with the files that you'll typically work with to see if it detects the encoding correctly and then use the library in your project.

Chetan Jadhav CD
  • 1,116
  • 8
  • 14
0

The other solutions didn't work for what I was trying to do, so I decided to create my own module that can detect the charset and language of any file loaded via input[type='file'] / FileReader API.

You load it via the <script> tag and then use the languageEncoding function to retrieve the charset/encoding:

// index.html

<script src="https://unpkg.com/detect-file-encoding-and-language/umd/language-encoding.min.js"></script>
// app.js

languageEncoding(file).then(fileInfo => console.log(fileInfo));
// Possible result: { language: english, encoding: UTF-8, confidence: { language: 0.96, encoding: 1 } }

For a more complete example/instructions check out this part of the documentation!

gignu
  • 1,763
  • 1
  • 14
  • 24