4

As far as I know, Excel use UTF-16 to represent string literals. I read from a console (Mac) / file (Windows), and in both cases the character encoding is messed up. I have to find a solution which works on both platforms, so ADO stream is not an option. I made some debugging and I see that actual bytes are:

Bytes     | Displayed as | Should be | Correct byte
258,129   | Ă           | Á         | 193
258,356   | ĂŤ           | Í         | 205
313,176   | Ű           | Ű         | 219
313,144   | Ĺ           | Ő         | 213
258,347   | Ăś           | Ü         | 220
258,8211  | Ă–           | Ö         | 214
258,353   | Ăš           | Ú         | 218
258,8220  | Ă“           | Ó         | 211
258,8240  | É           | É         | 201

(Comes from the good-old hungarian test-phrase, ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP which contains all of our special characters). I am looking for an algorithm which results in the correct string both on Mac and Windows. Thanks!

Attila
  • 118
  • 1
  • 8
  • 1
    Mojibake: file is UTF-8, interpreted in `cp1250` (Central Europe). Proof in Python: `'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP'.encode('utf-8').decode('cp1250','replace')` yields `'Ă�RVĂŤZTŰRĹ� TĂśKĂ–RFĂšRĂ“GÉP'`. Reverse way is impossible, IMHO. Open file using proper encoding. – JosefZ Oct 28 '20 at 23:58
  • On Windows I could use ADODB.Stream which works well. On Mac I use popen/fread from libc.dylib which returns the string like this. Any idea how to read UTF-8 from the console? – Attila Oct 29 '20 at 04:22

4 Answers4

2

None of the answers posted so far will correctly transcode an input string containing codepoints from the full Unicode range like for example "‍‍‍‍‍‍‍‍‍UnicodeSupportest‍‍‍‍‍♀️‍♂️‍❤️‍‍♀️".

That's why I wrote the following function, using only VBA inbuilt functions/statements that are available on both, Windows and MacOS.

This function works cross-platform and cross-app, and for the entire Unicode range.
codePoints > 65535 are also supported, even though VBAs inbuilt ChrW() and AscW don't support them, because the transcoding is done entirely "manually", including surrogate pairs. Performance should also be relatively good since the function works on a single byte-array buffer. If someone finds a bug or improvement, please let me know!

This code was improved as a result of this answer on CodeReview, many thanks to Cristian Buse for that!

'Function transcoding an UTF-8 encoded string to the VBA-native UTF-16-LE
'Author: Guido Witt-Dörring, https://stackoverflow.com/a/75787820/12287457
'                            https://github.com/guwidoe/VBA-StringTools
Public Function DecodeUTF8(ByVal utf8str As String, _
                  Optional ByVal raiseErrors As Boolean = False) As String
    Const methodName As String = "DecodeUTF8"
    Dim i As Long, j As Long, k As Long, numBytesOfCodePoint As Byte
    Static numBytesOfCodePoints(0 To 255) As Byte
    Static mask(2 To 4) As Long, minCp(2 To 4) As Long
    If numBytesOfCodePoints(0) = 0 Then
        For i = &H0& To &H7F&: numBytesOfCodePoints(i) = 1: Next i '0xxxxxxx
        '110xxxxx - C0 and C1 are invalid (overlong encoding)
        For i = &HC2& To &HDF&: numBytesOfCodePoints(i) = 2: Next i
        For i = &HE0& To &HEF&: numBytesOfCodePoints(i) = 3: Next i '1110xxxx
       '11110xxx - 11110100, 11110101+ (= &HF5+) outside of valid Unicode range
        For i = &HF0& To &HF4&: numBytesOfCodePoints(i) = 4: Next i
        For i = 2 To 4: mask(i) = (2 ^ (7 - i) - 1): Next i
        minCp(2) = &H80&: minCp(3) = &H800&: minCp(4) = &H10000
    End If
    
    Dim utf8() As Byte, utf16() As Byte, codePoint As Long, currByte As Byte
    utf8 = utf8str
    ReDim utf16(0 To (UBound(utf8) - LBound(utf8) + 1) * 2)
    
    i = LBound(utf8): j = 0
    Do While i <= UBound(utf8)
        codePoint = utf8(i)
        numBytesOfCodePoint = numBytesOfCodePoints(codePoint)
        
        If numBytesOfCodePoint = 0 Then
            If raiseErrors Then Err.Raise 5, methodName, "Invalid byte"
            GoTo insertErrChar
        ElseIf numBytesOfCodePoint = 1 Then
            utf16(j) = codePoint
            j = j + 2
        ElseIf i + numBytesOfCodePoint - 1 > UBound(utf8) Then
            If raiseErrors Then _
                Err.Raise 5, methodName, _
                    "Incomplete UTF-8 codepoint at end of string."
            GoTo insertErrChar
        Else
            codePoint = utf8(i) And mask(numBytesOfCodePoint)
            For k = 1 To numBytesOfCodePoint - 1
                currByte = utf8(i + k)
                If (currByte And &HC0&) = &H80& Then
                    codePoint = (codePoint * &H40&) + (currByte And &H3F)
                Else
                    If raiseErrors Then _
                        Err.Raise 5, methodName, "Invalid continuation byte"
                    GoTo insertErrChar
                End If
            Next k
            'Convert the Unicode codepoint to UTF-16LE bytes
            If codePoint < minCp(numBytesOfCodePoint) Then
                If raiseErrors Then _
                    Err.Raise 5, methodName, "Overlong encoding"
                GoTo insertErrChar
            ElseIf codePoint < &HD800& Then
                utf16(j) = CByte(codePoint And &HFF&)
                utf16(j + 1) = CByte(codePoint \ &H100&)
                j = j + 2
            ElseIf codePoint < &HE000& Then
                If raiseErrors Then _
                    Err.Raise 5, methodName, _
                "Invalid Unicode codepoint.(Range reserved for surrogate pairs)"
                GoTo insertErrChar
            ElseIf codePoint < &H10000 Then
                If codePoint = &HFEFF& Then GoTo nextCp '(BOM - will be ignored)
                utf16(j) = codePoint And &HFF&
                utf16(j + 1) = codePoint \ &H100&
                j = j + 2
            ElseIf codePoint < &H110000 Then 'Calculate surrogate pair
                Dim m As Long, lowSurrogate As Long, highSurrogate As Long
                m = codePoint - &H10000 '(m \ &H400&) =most sign. 10 bits of m
                highSurrogate = &HD800& Or (m \ &H400&)
                lowSurrogate = &HDC00& Or (m And &H3FF) 'least sig. 10 bits of m
                utf16(j) = highSurrogate And &HFF&
                utf16(j + 1) = highSurrogate \ &H100&
                utf16(j + 2) = lowSurrogate And &HFF&
                utf16(j + 3) = lowSurrogate \ &H100&
                j = j + 4
            Else
                If raiseErrors Then _
                    Err.Raise 5, methodName, _
                        "Codepoint outside of valid Unicode range"
insertErrChar:  utf16(j) = &HFD: utf16(j + 1) = &HFF: j = j + 2
                If numBytesOfCodePoint = 0 Then numBytesOfCodePoint = 1
            End If
        End If
nextCp: i = i + numBytesOfCodePoint 'Move to the next UTF-8 codepoint
    Loop
    DecodeUTF8 = MidB$(utf16, 1, j)
End Function
GWD
  • 3,081
  • 14
  • 30
0

As I had to solve this, I came up with the following function which can successfully transform the characters which are between 128 and 255

Private Function utf8ToUTF16(ByVal strText As String) As String
    Dim i&, l1%, l2%, l3%
    For i = 1 To Len(strText)
        l1 = Asc(Mid(strText, i, 1))
        If i <> Len(strText) Then l2 = Asc(Mid(strText, i + 1, 1))
        Select Case l1
        Case 194
            utf8ToUTF16 = utf8ToUTF16 & WorksheetFunction.Unichar(l2): i = i + 1
        Case 195
            utf8ToUTF16 = utf8ToUTF16 & WorksheetFunction.Unichar(l2 + &H40): i = i + 1
        Case 197
            utf8ToUTF16 = utf8ToUTF16 & WorksheetFunction.Unichar(l2 + &HC0): i = i + 1
        Case 203
            utf8ToUTF16 = utf8ToUTF16 & WorksheetFunction.Unichar(l2 + &H240): i = i + 1
        Case 226
            If l2 = 128 Then
                l3 = Asc(Mid(strText, i + 2, 1))
                utf8ToUTF16 = utf8ToUTF16 & WorksheetFunction.Unichar(l3 + &H1F80)
                i = i + 2
            ElseIf l2 = 130 Then
                l3 = Asc(Mid(strText, i + 2, 1))
                utf8ToUTF16 = utf8ToUTF16 & WorksheetFunction.Unichar(l3 + &H2000)
                i = i + 2
            End If
        Case Else
            utf8ToUTF16 = utf8ToUTF16 & Chr(l1)
        End Select
    Next i
End Function

Now passing "ĂRVĂŤZTŰRĹ TĂśKĂ–RFĂšRĂ“GÉP" to this function (which is read from a standard UTF-8 encoded file) will return "ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP".

  • Note: this is not be the most efficient code, for sure. Whenever I call it I always use it on the shortest string possible. Currently I use it to decode a result from cURL, and passing the whole HTML froze it.

EDIT

Now I had some time to clean up this.

Private Function utf8ToUTF16(ByVal strText As String) As String
    Dim i&, l1&, l2&, l3&, l4&, l&
    For i = 1 To Len(strText)
        l1 = Asc(Mid(strText, i, 1))
        If i + 1 <= Len(strText) Then l2 = Asc(Mid(strText, i + 1, 1))
        If i + 2 <= Len(strText) Then l3 = Asc(Mid(strText, i + 2, 1))
        If i + 3 <= Len(strText) Then l4 = Asc(Mid(strText, i + 3, 1))
        Select Case l1
        Case 1 To 127
            l = l1
        Case 194 To 223
            l = ((l1 And &H1F) * 2 ^ 6) Or (l2 And &H3F)
            i = i + 1
        Case 224 To 239
            l = ((l1 And &HF) * 2 ^ 12) Or ((l2 And &H3F) * 2 ^ 6) Or (l3 And &H3F)
            i = i + 2
        Case 240 To 255
            l = ((l1 And &H7) * 2 ^ 18) Or ((l2 And &H3F) * 2 ^ 12) Or ((l3 And &H3F) * 2 ^ 6) Or (l4 And &H3F)
            i = i + 4
        Case Else
            l = 63 ' question mark
        End Select
        utf8ToUTF16 = utf8ToUTF16 & IIf(l < 55296, WorksheetFunction.Unichar(l), "?")
    Next i
End Function

As I realized, characters above 55295 (D7FF) will not appear, so it will output a question mark instead as a placeholder.

Attila
  • 118
  • 1
  • 8
0

I found this simple VBA code worked well for Spanish accented characters in a text file. Given a string with the UTF-8 double-characters it returns a string with accented characters:

Function UTF8to16(str As String) As String
Dim position As Long, strConvert As String, codeReplace As Integer, strOut As String

strOut = str
position = InStr(strOut, Chr(195))

If position > 0 Then
    Do Until position = 0
        strConvert = Mid(strOut, position, 2)
        codeReplace = Asc(Right(strConvert, 1))
        If codeReplace < 255 Then
            strOut = Replace(strOut, strConvert, Chr(codeReplace + 64))
        Else
            strOut = Replace(strOut, strConvert, Chr(34))
        End If
        position = InStr(strOut, Chr(195))
    Loop
End If

UTF8to16 = strOut

End Function
Tim_A
  • 9
  • 1
-1

This works for me

Function utf8ToUTF16(ByVal strText As String) As String
    Dim binText() As Byte, i As Long
    'size the binary buffer
    ReDim binText(Len(strText) - 1 + 3)
    'insert BOM in 0,1,2 positions
    binText(0) = &HEF
    binText(1) = &HBB
    binText(2) = &HBF
    'append text characters
    For i = 1 To Len(strText)
        binText(i + 2) = Asc(Mid(strText, i, 1))
    Next
    'write to a binary stream
    With CreateObject("ADODB.Stream")
        .Type = 1
        .Open
        .write binText
        'convert to the text stream
        .Flush
        .Position = 0
        .Type = 2
        .Charset = "utf-8"
     'read the result skipping BOM
        .Position = 3
        utf8ToUTF16 = .ReadText
        .Close
    End With
End Function
em pe
  • 1
  • 1
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 16 '22 at 11:06
  • 1
    As I stated in the original post: "I have to find a solution which works on both platforms, so ADO stream is not an option.". It works on Windows indeed, but this is not available on Mac. – Attila Aug 21 '22 at 12:02