In my path to learning Pipes, I've run into problems when dealing with non-utf8 files. That is why I've took a detour into the Turtle library to try to understand how to solve the problem there, at higher level of abstraction.
The exercise I want to do is quite simple: find the sum of all the lines of all regular files reachable from a given directory. This is readily implemented by the following shell command:
find $FPATH -type f -print | xargs cat | wc -l
I've come up with the following solution:
import qualified Control.Foldl as F
import qualified Turtle as T
-- | Returns true iff the file path is not a symlink.
noSymLink :: T.FilePath -> IO Bool
noSymLink fPath = (not . T.isSymbolicLink) <$> T.stat fPath
-- | Shell that outputs the regular files in the given directory.
regularFilesIn :: T.FilePath -> T.Shell T.FilePath
regularFilesIn fPath = do
fInFPath <- T.lsif noSymLink fPath
st <- T.stat fInFPath
if T.isRegularFile st
then return fInFPath
else T.empty
-- | Read lines of `Text` from all the regular files under the given directory
-- path.
inputDir :: T.FilePath -> T.Shell T.Line
inputDir fPath = do
file <- regularFilesIn fPath
T.input file
-- | Print the number of lines in all the files in a directory.
printLinesCountIn :: T.FilePath -> IO ()
printLinesCountIn fPath = do
count <- T.fold (inputDir fPath) F.length
print count
This solution gives the correct result, as long as there are no non-utf8 files in the directory. If this is not the case, the program will raise an exception like the following one:
*** Exception: test/resources/php_ext_syslog.h: hGetLine: invalid argument (invalid byte sequence)
Which is to be expected since:
$ file -I test/resources/php_ext_syslog.h
test/resources/php_ext_syslog.h: text/x-c; charset=iso-8859-1
I was wondering how to solve the problem of reading different encodings into Text
, so that the program can deal with this. For the problem at hand I guess I could avoid the conversion to Text
, but I'd rather know how to do this, since you could imagine a situation in which, for instance, I would like to make a set with all the words under a certain directory.
EDIT
For what is worth so far the only solution I could come up with is the following:
mDecodeByteString :: T.Shell ByteString -> T.Shell T.Text
mDecodeByteString = gMDecodeByteString (streamDecodeUtf8With lenientDecode)
where gMDecodeByteString :: (ByteString -> Decoding)
-> T.Shell ByteString
-> T.Shell T.Text
gMDecodeByteString f bss = do
bs <- bss
let Some res bs' g = f bs
if BS.null bs'
then return res
else gMDecodeByteString g bss
inputDir' :: T.FilePath -> T.Shell T.Line
inputDir' fPath = do
file <- regularFilesIn fPath
text <- mDecodeByteString (TB.input file)
T.select (NE.toList $ T.textToLines text)
-- | Print the number of lines in all the files in a directory. Using a more
-- robust version of `inputDir`.
printLinesCountIn' :: T.FilePath -> IO ()
printLinesCountIn' fPath = do
count <- T.fold (inputDir' fPath) T.countLines
print count
The problem is that this will count one more line per file, but at least allows to decode non-utf8 ByteString
s.