1

I'm writing a program which would take a list of text files as arguments and outputs a file in which each row is the intercalation of tabs between the corresponding rows in the files.

Assume all characters are ASCII encoded

import GHC.IO.Handle
import System.IO
import System.Environment
import Data.List

main = do
    (out:files) <- getArgs
    hs <- mapM (`openFile` ReadMode) files
    txts <- mapM B.hGetContents hs
    let final = map (B.intercalate (B.singleton '\t')) . transpose 
                . map (B.lines . B.filter (/= '\t')) $ txts
    withFile out WriteMode $ \out -> 
        B.hPutStr out (B.unlines final)
    putStrLn "Completed successfully"

The problem is that it outputs:

file1row1
    file2row1
file1row2
    file2row2
file1row3
    file2row3

instead of:

file1row1    file2row1
file1row2    file2row2
file1row3    file2row3

The same logic works correctly when tested by manually defining the functions in ghci. And the same code works correctly when using Data.Text.Lazy instead of lazy Bytestrings.

What's wrong with my approach?

haskelline
  • 1,116
  • 7
  • 15

2 Answers2

2

When I tested Data.ByteString.Lazy.UTF8.lines on a sample string, it didn't remove the '\r'....

ghci -XOverloadedStrings

> import Data.ByteString.Lazy.UTF8 as B

> B.lines "ab\n\rcd"
  ["ab","\rcd"]

> B.lines "ab\r\ncd"
  ["ab\r","cd"]

I am guessing this is your problem.

(to verify, you can look at the output using "xxd" or any other hex editor.... See if the extra character is in fact a "\r").

jamshidh
  • 12,002
  • 17
  • 31
2

There is a known bug in Data.ByteString.Lazy.UTF8 where newline conversion doesn't take place properly, even though the documentation says that it should. (See Data.ByteString.Lazy.Char8 newline conversion on Windows---is the documentation misleading?) This could be the cause of your problem.

Community
  • 1
  • 1
circular-ruin
  • 2,834
  • 1
  • 24
  • 30
  • I'm using `Data.ByteString.Lazy.Char8` not `UTF8`. Could you elaborate more on the problem, I don't seem to understand what's going on. – haskelline Mar 04 '14 at 12:34
  • Newline characters are all ASCII characters and should work just fine when read as single bytes. – haskelline Mar 04 '14 at 12:35
  • Ok, I took a look at your workaround in the other question and I kind of got an idea of what's happening. Isn't it very weird that there's no solution till now for this problem? – haskelline Mar 04 '14 at 13:22
  • Yes, it really has been a while since the maintainer promised a fix 'real soon now' (see the answer in the other question). I guess the workaround is not too tricky though. The biggest issue (at least for me) is that this behavior is such a surprise... if there was a warning in the docs, it would be better! – circular-ruin Mar 04 '14 at 13:26