Turtle: dealing with non-utf8 input

Question

In my path to learning Pipes, I've run into problems when dealing with non-utf8 files. That is why I've took a detour into the Turtle library to try to understand how to solve the problem there, at higher level of abstraction.

The exercise I want to do is quite simple: find the sum of all the lines of all regular files reachable from a given directory. This is readily implemented by the following shell command:

find $FPATH -type f -print | xargs cat | wc -l

I've come up with the following solution:

import qualified Control.Foldl as F
import qualified Turtle        as T

-- | Returns true iff the file path is not a symlink.
noSymLink :: T.FilePath -> IO Bool
noSymLink fPath = (not . T.isSymbolicLink) <$> T.stat fPath

-- | Shell that outputs the regular files in the given directory.
regularFilesIn :: T.FilePath -> T.Shell T.FilePath
regularFilesIn fPath = do
  fInFPath <- T.lsif noSymLink fPath
  st <- T.stat fInFPath
  if T.isRegularFile st
    then return fInFPath
    else T.empty

-- | Read lines of `Text` from all the regular files under the given directory
-- path.
inputDir :: T.FilePath -> T.Shell T.Line
inputDir fPath = do
  file <- regularFilesIn fPath
  T.input file

-- | Print the number of lines in all the files in a directory.
printLinesCountIn :: T.FilePath -> IO ()
printLinesCountIn fPath = do
  count <- T.fold (inputDir fPath) F.length
  print count

This solution gives the correct result, as long as there are no non-utf8 files in the directory. If this is not the case, the program will raise an exception like the following one:

*** Exception: test/resources/php_ext_syslog.h: hGetLine: invalid argument (invalid byte sequence)

Which is to be expected since:

$ file -I test/resources/php_ext_syslog.h
test/resources/php_ext_syslog.h: text/x-c; charset=iso-8859-1

I was wondering how to solve the problem of reading different encodings into Text, so that the program can deal with this. For the problem at hand I guess I could avoid the conversion to Text, but I'd rather know how to do this, since you could imagine a situation in which, for instance, I would like to make a set with all the words under a certain directory.

EDIT

For what is worth so far the only solution I could come up with is the following:

mDecodeByteString :: T.Shell ByteString -> T.Shell T.Text
mDecodeByteString = gMDecodeByteString (streamDecodeUtf8With lenientDecode)
  where gMDecodeByteString :: (ByteString -> Decoding)
                             -> T.Shell ByteString
                             -> T.Shell T.Text
        gMDecodeByteString f bss = do
          bs <- bss
          let Some res bs' g = f bs
          if BS.null bs'
            then return res
            else gMDecodeByteString g bss

inputDir' :: T.FilePath -> T.Shell T.Line
inputDir' fPath = do
  file <- regularFilesIn fPath
  text <- mDecodeByteString (TB.input file)
  T.select (NE.toList $ T.textToLines text)

-- | Print the number of lines in all the files in a directory. Using a more
-- robust version of `inputDir`.
printLinesCountIn' :: T.FilePath -> IO ()
printLinesCountIn' fPath = do
  count <- T.fold (inputDir' fPath) T.countLines
  print count

The problem is that this will count one more line per file, but at least allows to decode non-utf8 ByteStrings.

Counter question: What's a line? Sounds obvious, but you need to know what a line is to count them. You could treat the entire file as bytes and say any occurrence of the byte "10" is a new line, so you count those. If that's your goal, you're better off reading with `ByteString` instead of `Text`. If you want something more sophisticated, you can't really get around someone _telling_ you some information about what's in the files, because you can't, in general, guess the meaning of bytes just from seeing them. — Cubic, Jun 15 '17 at 12:10
Indeed, in this case I could probably look around for bytes that correspond to line terminators, and work with `ByteString`s directly. However I would like to cover the case in which I want to try the files as character files (if I'm gathering words for instance), and then I need a robust solution for decoding different encodings into text. — Damian Nadales, Jun 15 '17 at 12:13
You need to know what encoding a file uses before decoding. You can guess, but that's brittle. If you're lucky, there's a [BOM](http://unicode.org/faq/utf_bom.html#BOM) at the start of the file and then it's _probably_ Unicode, and if it is you even know which unicode encoding. But BOMs are not required, and even if you see one that doesn't _necessarily_ mean the file is encoded in some unicode encoding. There's no truly robust way to get the encoding of a file without having metadata about the file available. — Cubic, Jun 15 '17 at 12:18
I'm wondering how `find $FPATH -type f -print | xargs cat | wc -l` deals with this... — Damian Nadales, Jun 15 '17 at 12:22
FYI if you already know the encoding and want to decode it now, that's what the `Data.Text.Encoding` module is for. — Cubic, Jun 15 '17 at 12:24
Actually if I'm scanning an arbitrary directory I can find any encoding, so I won't know the encoding in advance. I'm looking at `Data.Text.Encoding` indeed, and see if it is possible to use `streamDecodeUtf8With` ignoring the errors. — Damian Nadales, Jun 15 '17 at 12:28
`wc l` _doesn't_ deal with this. It just looks for the byte '10' in the file. " so I won't know the encoding in advance" see, and that's the problem. There's no magic way to know the encoding of a file. You can try some the most common ones and see if any of them give reasonable-seeming results, but you can't, in general, figure out if some arbitrary blob is encoded with some arbitrary encoding. That's why the content-type header in HTML exists. — Cubic, Jun 15 '17 at 12:31

Turtle: dealing with non-utf8 input

0 Answers0