Why is Haskell/unpack messing with my bytes?

Question

I've built a tiny UDP/protobuf transmitter and receiver. I've spent the morning trying to track down why the protobuf decoding was producing errors, only to find that it was the transmitter (Spoke.hs) which was sending incorrect data.

The code used unpack to turn Lazy.ByteStrings into Strings that the Network package will send. I found unpack in Hoogle. It may not be the function I'm looking for, but its description looks suitable: "O(n) Converts a ByteString to a String."

Spoke.hs produces the following output:

chris@gigabyte:~/Dropbox/haskell-workspace/hub/dist/build/spoke$ ./spoke
45
45
["a","8","4a","6f","68","6e","20","44","6f","65","10","d2","9","1a","10","6a","64","6f","65","40","65","78","61","6d","70","6c","65","2e","63","6f","6d","22","c","a","8","35","35","35","2d","34","33","32","31","10","1"]

While wireshark shows me that the data in the packet is:

0a:08:4a:6f:68:6e:20:44:6f:65:10:c3:92:09:1a:10:6a:64:6f:65:40:65:78:61:6d:70:6c:65:2e:63:6f:6d:22:0c:0a:08:35:35:35:2d:34:33:32:31:10

The length (45) is the same from Spoke.hs and Wireshark.

Wireshark is missing the last byte (value Ox01) and a stream of central values is different (and one byte larger in Wireshark).

"65","10","d2","9" in Spoke.hs vs 65:10:c3:92:09 in Wireshark.

As 0x10 is DLE, it struck me that there's probably some escaping going on, but I don't know why.

I have many years of trust in Wireshark and only a few tens of hours of Haskell experience, so I've assumed that it's the code that's at fault.

Any suggestions appreciated.

-- Spoke.hs:

module Main where

import Data.Bits
import Network.Socket -- hiding (send, sendTo, recv, recvFrom)
-- import Network.Socket.ByteString
import Network.BSD
import Data.List
import qualified Data.ByteString.Lazy.Char8 as B
import Text.ProtocolBuffers.Header (defaultValue, uFromString)
import Text.ProtocolBuffers.WireMessage (messageGet, messagePut)
import Data.Char (ord, intToDigit)
import Numeric

import Data.Sequence ((><), fromList)

import AddressBookProtos.AddressBook
import AddressBookProtos.Person
import AddressBookProtos.Person.PhoneNumber
import AddressBookProtos.Person.PhoneType

data UDPHandle = 
     UDPHandle {udpSocket  :: Socket,
                udpAddress :: SockAddr}
opensocket :: HostName             -- ^ Remote hostname, or localhost
           -> String               -- ^ Port number or name
           -> IO UDPHandle         -- ^ Handle to use for logging
opensocket hostname port =
    do -- Look up the hostname and port.  Either raises an exception
       -- or returns a nonempty list.  First element in that list
       -- is supposed to be the best option.
       addrinfos <- getAddrInfo Nothing (Just hostname) (Just port)
       let serveraddr = head addrinfos

       -- Establish a socket for communication
       sock <- socket (addrFamily serveraddr) Datagram defaultProtocol

       -- Save off the socket, and server address in a handle
       return $ UDPHandle sock (addrAddress serveraddr)

john = Person {
  AddressBookProtos.Person.id = 1234,
  name = uFromString "John Doe",
  email = Just $ uFromString "jdoe@example.com",
  phone = fromList [
    PhoneNumber {
      number = uFromString "555-4321",
      type' = Just HOME
    }
  ]
}

johnStr = B.unpack (messagePut john)

charToHex x = showIntAtBase 16 intToDigit (ord x) ""

main::IO()
main = 
    do udpHandle <- opensocket "localhost" "4567"
       sent <- sendTo (udpSocket udpHandle) johnStr (udpAddress udpHandle)
       putStrLn $ show $ length johnStr
       putStrLn $ show sent
       putStrLn $ show $ map charToHex johnStr
       return ()

The documentation I see for the bytestring package lists `unpack` as converting a `ByteString` to `[Word8]`, which is not the same as a `String`. I would expect some byte difference between `ByteString` and `String` because `String` is Unicode data while `ByteString` is just an efficient array of bytes, but `unpack` shouldn't be able to produce a `String` in the first place. — Matthew Walton, Jun 15 '12 at 12:29
Can you use network-bytestring, to avoid the redundant data conversions? — Don Stewart, Jun 15 '12 at 12:41
@MatthewWalton: `unpack` from `Data.ByteString.Char8`, or the lazy variant, output `String`s. They aren't Unicode-aware though. — John L, Jun 15 '12 at 12:54
The Network package seems to be taking the characters you give it and utf-8-encoding them into a stream of bytes (and then truncating). — dave4420, Jun 15 '12 at 12:58
@Matthew Walton - Thanks, can you re-post your comment as an answer so that I can accept it? — fadedbee, Jun 15 '12 at 13:31
@Don Stewart - I just realised that I'm talking to the creator of the ByteString code and an author of RWH. Many thanks for taking the time to help me. — fadedbee, Jun 15 '12 at 14:19
@chrisdew I can and have, I just never expected it to be the actual answer really. More of an idle theory. — Matthew Walton, Jun 15 '12 at 15:11

score 3 · Accepted Answer · answered Jun 15 '12 at 15:11

3

The documentation I see for the bytestring package lists unpack as converting a ByteString to [Word8], which is not the same as a String. I would expect some byte difference between ByteString and String because String is Unicode data while ByteString is just an efficient array of bytes, but unpack shouldn't be able to produce a String in the first place.

So you're probably falling foul of Unicode conversion here, or at least something's interpreting it as Unicode when the underlying data really isn't and that seldom ends well.

answered Jun 15 '12 at 15:11

Matthew Walton

9,809
3
27
36

No, it also says `unpack :: ByteString -> [Char]` (I think String is an alias for [Char]). http://hackage.haskell.org/packages/archive/bytestring/latest/doc/html/Data-ByteString-Char8.html#v:unpack – fadedbee Jun 15 '12 at 15:13
1

That's `Data.ByteString.Char8` - I was looking in `Data.ByteString.Lazy`. Nonetheless, as John L pointed out in the comments on the question that's still not Unicode-aware. – Matthew Walton Jun 15 '12 at 15:24
2

It's definitely Unicode conversion: e.g. [code point D8 is C3 98 in UTF-8](http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=d8&mode=hex). That's why any value under 0x7F gets through unscathed. – rxg Jun 16 '12 at 13:38

score 1 · Answer 2 · answered Sep 09 '15 at 11:04

1

I think you'll want toString and fromString from utf8-string instead of unpack and pack. This blog post was very helpful for me.

answered Sep 09 '15 at 11:04

Beerend Lauwers

872
7
14

Why is Haskell/unpack messing with my bytes?

2 Answers2