Simplest way to remove BOM from Haskell ByteString

Question

I have LazyByteString which possibly starts with BOM. What is the easiest and preferable efficient way to remove BOM from this ByteString?

Isn't the BOM just a special character? Does [`tail`](https://hackage.haskell.org/package/bytestring-0.10.8.2/docs/Data-ByteString-Lazy.html#v:tail) not work for this? — Alec, Nov 18 '17 at 16:31
@Alec Well, first I need to check whether my string is started with BOM or not. BOM is 3 bytes (i.e. size-3 list of `Word8`) and `head` has type `head :: ByteString -> Word8`. It's really strange that `head` returns only one byte while `tail` can remove several bytes. So I guess just `tail` won't work. Also, `tail` throws pure exception if given `ByteString` is empty, which is not what I want :) — Shersh, Nov 18 '17 at 16:47
@Shersh So you know exactly what to do. Why not try that before asking this question? — AJF, Nov 18 '17 at 16:55
Oops. Yeah - I see your problem. Check out the `utf8-string` package. You can check if the bytestring is empty then, if it isn't `uncons`. Based on the first character you get back, you either return the tail (which you also get from `uncons`) or the initial bytestring. — Alec, Nov 18 '17 at 16:56

Thomas M. DuBuisson · Accepted Answer · 2017-11-18T18:10:09.730

6

I feel like I must be misunderstanding the problem. Doesn't this boil down to checking the first three bytes of a bytestring and conditionally dropping those bytes?

To get the first 3 bytes use take.
To check bytestring equality use (==).
To drop the first 3 bytes use drop.

Putting these together we get:

import Data.ByteString.Lazy as BS
dropBOM bs | BS.take 3 bs == BS.pack [0xEF,0xBB,0xBF] = BS.drop 3 bs
           | otherwise = bs

However, even after dealing with lots of utf8 I never felt as though I needed to explicitly deal with BOM thanks to packages like Text that provide most the desired operations. Perhaps you can solve your problem in another way than manually munging the bytestring.

edited Nov 18 '17 at 18:10

answered Nov 18 '17 at 17:44

Thomas M. DuBuisson

64,245
7
109
166

Thanks for your answer! Your solution is really simple and efficient (because `take` and `drop` doesn't allocate memory). I've come personally with worse solution... I ran into problem with `cassava` package while I needed to parse CSV files. Unfortunately, this library cannot handle BOM :( https://github.com/hvr/cassava/issues/106 – Shersh Nov 18 '17 at 17:53
Though, your solution doesn't quite work because `unpack "BOM" = [66, 79, 77]` while byte order mark is `[239,187,191]`. – Shersh Nov 18 '17 at 17:55
Heh, yeah that was a guess. Use whatever constants you desire. EDIT: Fixed (?) – Thomas M. DuBuisson Nov 18 '17 at 18:06
@Shersh doesn't say what encoding they are expecting. This scheme is correct for UTF-8 (it might be nice to call that out explicitly in this answer), but not e.g. for UTF-16 where the magic bytes are different and crucially serve to indicate endianness (so even if you just wanted to strip them you couldn't do a single equality check). – jberryman Nov 18 '17 at 19:11
1

@jberryman Yes, you're right. It's easy to mess up with encodings... Hopefully, I have `utf-8` encoding :) – Shersh Nov 18 '17 at 20:11

Simplest way to remove BOM from Haskell ByteString

1 Answers1