2

Hello i am trying to write a ~1GB file in a timely manner.Is there any recommended method.Up until now the process takes somewhere in the order of tens of minutes . Am i wrong in using Text should i use ByteString ? (I have also used String)

    pt="d:\\data2.csv"
    cnt=400000000

    main::IO()
    main=do
        let payload=dat
        writeWithHandle pt dat


    dat::Text
    dat=Data.Text.pack "0744442339"


    writeWithHandle::FilePath->Text->IO()
    writeWithHandle path tx=do
        handle<-openFile path WriteMode
        writeTimes cnt handle dat

    writeTimes::Int->Handle->Text->IO()
    writeTimes cnt handle payload= forM_ ([0..cnt])  (\x->Data.Text.IO.hPutStrLn handle payload)

I do not understand why it is taking so much in the order of tens of minutes.Initially i was using writeFile but i thought that would mean continously opening and closing the file for each row so i used appendFile to no avail.

Bercovici Adrian
  • 8,794
  • 17
  • 73
  • 152
  • 1
    Is it faster if you batch the writes? Something like `T.replicate 10 dat` or even 1000x. – bergey Mar 12 '19 at 11:59
  • Did you compile with optimisations turned on? Writing a 14GB file with more or less what you're doing here takes me about 5 mins on my machine. (Also, you forgot to close your file after writing to it, you should just use `withFile` rather than `openFile` tbh) – Cubic Mar 12 '19 at 12:14
  • I have not compiled with optimizations turned on. (I 'm still a beginner). – Bercovici Adrian Mar 12 '19 at 12:26
  • Using bytestrings and with optimizations on, your code created a 4GB file in 43s. A main issue here is that `Text` is stored as UTF16 while file output is usually is UTF8, so conversion will require some time. Building a bytestring once, I converted `dat` only once for the whole program. In a real-world scenario, however, you might have a lot of different lines. If skipping `Text` is an option, I'd try to go for that. – chi Mar 12 '19 at 12:32
  • Apparently after i closed the `handle` it worked a lot faster , as **@Cubic** pointed out. – Bercovici Adrian Mar 12 '19 at 12:33
  • General advice: text files bigger than a few MB are a bad idea. Use binary formats for all significant amounts of data. – leftaroundabout Mar 13 '19 at 15:18

1 Answers1

3

I would recommend using a Builder for this, which is an efficient way to fill up buffers and can be written directly to a Handle.

#!/usr/bin/env stack
-- stack --resolver ghc-8.6.4 script
{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString.Builder (Builder, hPutBuilder)
import Data.Foldable (fold)
import System.IO (IOMode (WriteMode), withBinaryFile)

pt :: FilePath
pt = "data2.csv"

cnt :: Int
cnt = 400000000

main :: IO ()
main = writeWithHandle pt dat

dat :: Builder
dat = "0744442339"

writeWithHandle :: FilePath -> Builder -> IO ()
writeWithHandle path tx =
  withBinaryFile path WriteMode $ \h ->
  hPutBuilder h $ makeBuilder cnt tx

makeBuilder :: Int -> Builder -> Builder
makeBuilder cnt payload = fold $ replicate cnt $ payload <> "\n"

You can keep payload as a Text value instead if you'd like, and convert to a Builder using encodeUtf8Builder.

Michael Snoyman
  • 31,100
  • 3
  • 48
  • 77
  • How does `dat :: Builder; dat = "0744442339"` work? In GHCi, the same is met with the error `Couldn't match expected type ‘Builder’ with actual type ‘[Char]’` – cobra Oct 08 '21 at 05:48