I have LazyByteString
which possibly starts with BOM. What is the easiest and preferable efficient way to remove BOM from this ByteString
?
Asked
Active
Viewed 510 times
4

Shersh
- 9,019
- 3
- 33
- 61
-
1Isn't the BOM just a special character? Does [`tail`](https://hackage.haskell.org/package/bytestring-0.10.8.2/docs/Data-ByteString-Lazy.html#v:tail) not work for this? – Alec Nov 18 '17 at 16:31
-
@Alec Well, first I need to check whether my string is started with BOM or not. BOM is 3 bytes (i.e. size-3 list of `Word8`) and `head` has type `head :: ByteString -> Word8`. It's really strange that `head` returns only one byte while `tail` can remove several bytes. So I guess just `tail` won't work. Also, `tail` throws pure exception if given `ByteString` is empty, which is not what I want :) – Shersh Nov 18 '17 at 16:47
-
@Shersh So you know exactly what to do. Why not try that before asking this question? – AJF Nov 18 '17 at 16:55
-
Oops. Yeah - I see your problem. Check out the `utf8-string` package. You can check if the bytestring is empty then, if it isn't `uncons`. Based on the first character you get back, you either return the tail (which you also get from `uncons`) or the initial bytestring. – Alec Nov 18 '17 at 16:56
1 Answers
6
I feel like I must be misunderstanding the problem. Doesn't this boil down to checking the first three bytes of a bytestring and conditionally dropping those bytes?
- To get the first 3 bytes use
take
. - To check bytestring equality use
(==)
. - To drop the first 3 bytes use
drop
.
Putting these together we get:
import Data.ByteString.Lazy as BS
dropBOM bs | BS.take 3 bs == BS.pack [0xEF,0xBB,0xBF] = BS.drop 3 bs
| otherwise = bs
However, even after dealing with lots of utf8 I never felt as though I needed to explicitly deal with BOM thanks to packages like Text that provide most the desired operations. Perhaps you can solve your problem in another way than manually munging the bytestring.

Thomas M. DuBuisson
- 64,245
- 7
- 109
- 166
-
Thanks for your answer! Your solution is really simple and efficient (because `take` and `drop` doesn't allocate memory). I've come personally with worse solution... I ran into problem with `cassava` package while I needed to parse CSV files. Unfortunately, this library cannot handle BOM :( https://github.com/hvr/cassava/issues/106 – Shersh Nov 18 '17 at 17:53
-
Though, your solution doesn't quite work because `unpack "BOM" = [66, 79, 77]` while byte order mark is `[239,187,191]`. – Shersh Nov 18 '17 at 17:55
-
Heh, yeah that was a guess. Use whatever constants you desire. EDIT: Fixed (?) – Thomas M. DuBuisson Nov 18 '17 at 18:06
-
@Shersh doesn't say what encoding they are expecting. This scheme is correct for UTF-8 (it might be nice to call that out explicitly in this answer), but not e.g. for UTF-16 where the magic bytes are different and crucially serve to indicate endianness (so even if you just wanted to strip them you couldn't do a single equality check). – jberryman Nov 18 '17 at 19:11
-
1@jberryman Yes, you're right. It's easy to mess up with encodings... Hopefully, I have `utf-8` encoding :) – Shersh Nov 18 '17 at 20:11