1

Consider a problem:

  • split file by lines
  • write lines to a result file
  • if a result file exceeds some size create a new result file

For example, if I have a file which weights 4gb and split size is equal 1gb. The result is four files weights 1gb.

I'm looking for a solution with something like Rx*/Bacon or any other similar library in any language.

kharandziuk
  • 12,020
  • 17
  • 63
  • 121
  • 1
    Bacon doesn't support back-pressure, so it's a bad fit for I/O work like this. Without back-pressure, you can end up reading faster than you can write and buffering without limit. – Macil Jun 02 '15 at 17:20
  • 1
    You can look on this http://xgrommx.github.io/rx-book/content/guidelines/when/index.html – xgrommx Jun 02 '15 at 22:01
  • Thanks. I looked at the article earlier. It looses the hardest part: changing the resulting stream(`appendAsync` which is a pseudocode implementation of `fs.readStream`) and feedback from a file when it exceeds size limit. – kharandziuk Jun 02 '15 at 22:29
  • 1
    Why? There is no point using an (F)RP library for this task, there is no async problem involved. – André Staltz Jun 03 '15 at 07:29
  • I use the problem just to illustrate the concern. the question is how do I dynamically change the place where I write? – kharandziuk Jun 03 '15 at 07:39
  • Hi, I added a possible solution as an answer. @AndréStaltz Isnt it a good way to use FRP? – kharandziuk Jun 03 '15 at 13:29

1 Answers1

0

My solution in Coffee with Highland.js:

_ = require('underscore')
H = require('highland')
fs = require('fs')
debug = require('debug')
log = debug('main')
assert = require('assert')

readS = H(fs.createReadStream('walmart.dump')).map((buffer) ->
  { buffer: buffer }
)
MAX_SIZE = 10 ** 7
counter = 0
nextStream = ()->
  stream = fs.createWriteStream("result/data#{counter}.txt")
  wrapper = H.wrapCallback(stream.write.bind(stream))
  counter += 1
  return wrapper


debug('profile')('start')
s = readS.scan({
    size: 0
    stream: nextStream()
  }, (acc, {buffer}) ->
  debug('scan')(acc, buffer)
  acc.size += buffer.length
  acc.buffer = buffer
  if acc.size > MAX_SIZE
      debug('notify')(counter - 1, acc.size)
      acc.size = 0
      acc.stream = nextStream()
  log(acc)
  return acc
).filter((x)->x.buffer?)

s.parallel 4
s.flatMap((x) ->
  debug('flatMap')(x)
  x.stream(x.buffer)
)
.done -> debug('profile')('finish')

walmart.dump is a text file which contains 6gb of text. Splitting for 649 files takes:

  profile start +0ms
  profile finish +53s
kharandziuk
  • 12,020
  • 17
  • 63
  • 121