1

so here´s my problem. I'm trying to read a file encoded in Windows-1252 that contains characters that are not valid with that encoding, if we look at this:

https://en.wikipedia.org/wiki/Windows-1252

We can observe that codepoints 129, 141, 143, 144 and 157 are not valid, that is, they don't represent any character. But the characters (the bytes) are still there and I need to read them.

In VB.NET, if I read the file like so:

Dim str As String = File.ReadAllText(filePath,System.Text.Encoding.GetEncoding("Windows-1252"))

Then I get something like:

‘0*qYªI" & ChrW(141) & ChrW(141) & "#´xXVzAÍ" & ChrW(157) & "Ä’¾Ä" & ChrW(141) & "e5b2©wÔ¤x–&¥®-1­]¬ŠvVco‡|kC®i

Where you can see that characters that are not valid are represented by their real values (ChrW(141) and ChrW(157)) in the file, even if they are not printable. But if I do this in Java:

String str = FileUtils.readFileToString(new File(pathToFile), "Windows-1252");

The value that I obtain for those characters when reading the file is "63", which is the character "?". According to what I understand from this "https://stackoverflow.com/a/2147968" it seems Java notices the character is not valid for that encoding and just puts a replacement character ("?") instead of it.

My question is, how can I get the real values when reading the text even if they are not valid, is there a way to avoid that Java inserts replacement characters when reading invalid characters? Am I missing something else?

starkspc
  • 11
  • 3
  • 3
    Read bytes, and transform each byte to a char (or anything you want) by yourself. – JB Nizet Aug 30 '19 at 21:36
  • I thought of that, but the file is a .ini file with a lot of content that has nothing to do with me, I just need to read one key of the .ini file. Also, I can't modify the file, just read it. I was trying to avoid to read everything as a byte() if possible. – starkspc Aug 30 '19 at 21:48
  • Still if not correct encoding read bytes, until you find your key, it's not slower – Petter Friberg Aug 30 '19 at 21:53
  • It's not a matter of "noticing that the character is not valid". You want a String - this demands that the input be converted from its declared encoding into UTF16, and **there is no conversion** defined for the -1252 codepoints you mention. E.g., byte 140 in the input is converted to the character U+0152; byte 142 is converted to U+017D. What should be done with byte 141? –  Aug 30 '19 at 22:10
  • The problem I'm having is that the information is lost, or more like replaced by "63", so when I tell Java to put those bytes into a Windows-1252 string it replaces the byte 141, so I can't know it was originally a 141, now I only have 63. I expected something like what VB.NET does, it doesn't represents the character graphically, becasuse it's not possible, but the original information is still there so I can operate with it. – starkspc Aug 30 '19 at 22:27
  • Are you **sure** the file is encoded in [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252#Character_set)? If it was, it shouldn't have had those bytes, because, as you said, they are not valid characters. Perhaps the file is encoded in code page [**437**](https://en.wikipedia.org/wiki/Code_page_437#Character_set), or maybe [**850**](https://en.wikipedia.org/wiki/Code_page_850#Character_set), where those bytes *are* valid. In Java, specific charset `"IBM437"`, or `"IBM850"`, to load with one of those code pages. – Andreas Aug 30 '19 at 22:49
  • To my understanding, it is, if I open the file with notepad or notepad++ it says it's ANSI, besides, as I said, if I read the file in VB.NET with" File.ReadAllText(filePath,System.Text.Encoding.GetEncoding("Windows-1252"))" the characters are reconized correctly. And you're right the file shouldn't have those bytes, actually the characters are not even visible, but as I said they are still there, and I still need to read them. – starkspc Aug 30 '19 at 23:10
  • I could do like @JBNizet suggested and read the whole file as a byte() and parse the piece that I need but I was wondering if there is a way to avoid Java from doing that replacement when reading text. – starkspc Aug 30 '19 at 23:12

0 Answers0