5

Summary

Is functional programming in node.js general enough? can it be used to do a real-world problem of handling small bulks of db records without loading all records in memory using toArray (thus going out of memory). You can read this criticism for background. We want to demonstrate Mux and DeMux and fork/tee/join capabilities of such node.js libraries with async generators.

Context

I'm questioning the validity and generality of functional programming in node.js using any functional programming tool (like ramda, lodash, and imlazy) or even custom.

Given

Millions of records from a MongoDB cursor that can be iterated using await cursor.next()

You might want to read more about async generators and for-await-of.

For fake data one can use (on node 10)

function sleep(ms) {
    return new Promise((resolve) => setTimeout(resolve, ms));
}
async function* getDocs(n) {
  for(let i=0;i<n;++i) {
     await sleep(1);
     yield {i: i, t: Date.now()};
  }
}
let docs=getDocs(1000000);

Wanted

We need

  • first document
  • last document
  • number of documents
  • split into batches/bulks of n documents and emit a socket.io event for that bulk

Make sure that first and last documents are included in the batches and not consumed.

Constraints

The millions of records should not be loaded into ram, one should iterate on them and hold at most only a batch of them.

The requirement can be done using usual nodejs code, but can it be done using something like applyspec as in here.

R.applySpec({
  first: R.head(),
  last: R.last(),
  _: 
    R.pipe(
      R.splitEvery(n),
      R.map( (i)=> {return "emit "+JSON.stringify(i);})
    ) 
})(input)
Muayyad Alsadi
  • 1,506
  • 15
  • 23
  • 2
    Could you clarify the actual question? Is it more about "can this be done functionally" or "how to do this functionally" or "is it bad to do this functionally"? – Sami Hult Jan 16 '19 at 12:24
  • sure, I'll edit the question. – Muayyad Alsadi Jan 16 '19 at 12:26
  • 1
    That said, the problem referred to does suggest words like streams and lazy evaluation (I'm deliberately vague here), and there's nothing holding you back doing that in functional manner. – Sami Hult Jan 16 '19 at 12:26
  • the assumed database cursor has millions of records, you should not resolve it at once, but you can `for-await-of` it one records at a time (repeat `await db.next()`). the other part of the problem that the the pipe that fetches first document should not consume it from batch pipe. – Muayyad Alsadi Jan 16 '19 at 12:34
  • 3
    Does make functional programming in Javascript the same fun as in clojure or Haskell? No. It is more laborious, because you can't utilize a rich toolset. If you don't like this, use JS as a compile target. –  Jan 16 '19 at 13:21
  • exactly! in python for example, you can feed list-like to any thing! there is no need for imlazy for generators ..etc. – Muayyad Alsadi Jan 16 '19 at 14:30
  • 2
    How are you proposing to do this, regardless if your paradigm is functional: "Make sure that first and last documents are included in the batches and not consumed?" Getting the first would be reasonable; you simply need to hold the reference throughout But the last is substantially harder. If you can show how you do that in some other paradigm, perhaps we can suggest a functional alternative. – Scott Sauyet Jan 16 '19 at 22:12
  • 2
    Even hardcore functional programmers will offer sacrifices at the alter of performance when the situation warrants. Bending the degenerate case into a contrived functional solution might be a fun puzzle, but it's not what we do in production. – Jared Smith Jan 17 '19 at 11:58
  • on the contrary, it's based on a real-world problem were we sync records between server database and client database by sending bulks of database records and at the end we fire an event with start and end ids/times. – Muayyad Alsadi Jan 18 '19 at 11:36

5 Answers5

4

To show how this could be modeled with vanilla JS, we can introduce the idea of folding over an async generator that produces things that can be combined together.

const foldAsyncGen = (of, concat, empty) => (step, fin) => async asyncGen => {
  let acc = empty
  for await (const x of asyncGen) {
    acc = await step(concat(acc, of(x)))
  }
  return await fin(acc)
}

Here the arguments are broken up into three parts:

  • (of, concat, empty) expects a function to produce "combinable" thing, a function that will combine two "combinable" things and an empty/initial instance of a "combinable" thing
  • (step, fin) expects a function that will take a "combinable" thing at each step and produce a Promise of a "combinable" thing to be used for the next step and a function that will take the final "combinable" thing after the generator has exhausted and produce a Promise of the final result
  • async asyncGen is the async generator to process

In FP, the idea of a "combinable" thing is known as a Monoid, which defines some laws that detail the expected behaviour of combining two of them together.

We can then create a Monoid that will be used to carry through the first, last and batch of values when stepping through the generator.

const Accum = (first, last, batch) => ({
  first,
  last,
  batch,
})

Accum.empty = Accum(null, null, []) // an initial instance of `Accum`

Accum.of = x => Accum(x, x, [x])    // an `Accum` instance of a single value

Accum.concat = (a, b) =>            // how to combine two `Accum` instances together
  Accum(a.first == null ? b.first : a.first, b.last, a.batch.concat(b.batch))

To capture the idea of flushing the accumulating batches we can create another function that takes an onFlush function that will perform some action in a returned Promise with the values being flushed, and a size n of when to flush the batch.

Accum.flush = onFlush => n => acc =>
  acc.batch.length < n ? Promise.resolve(acc)
                       : onFlush(acc.batch.slice(0, n))
                           .then(_ => Accum(acc.first, acc.last, acc.batch.slice(n)))

We can also now define how we can fold over the Accum instances.

Accum.foldAsyncGen = foldAsyncGen(Accum.of, Accum.concat, Accum.empty)

With the above utilities defined, we can now use them to model your specific problem.

const emit = batch => // This is an analog of where you would emit your batches
  new Promise((resolve) => resolve(console.log(batch)))

const flushEmit = Accum.flush(emit)

// flush and emit every 10 items, and also the remaining batch when finished
const fold = Accum.foldAsyncGen(flushEmit(10), flushEmit(0))

And finally run with your example.

fold(getDocs(100))
  .then(({ first, last })=> console.log('done', first, last))
Scott Christopher
  • 6,458
  • 23
  • 26
  • thank you for your proposal, I was asking about the completeness and generality of FP packages not to build our own package. – Muayyad Alsadi Jan 24 '19 at 13:28
  • Perhaps you should update your question to remove the request for "custom" tooling under the Context section. – Scott Christopher Jan 24 '19 at 22:57
  • You have an outstanding answer. I meant a custom npm package other than the named one but I want it to be generic so that we question how generic it's. My question to you, if you publish the above code in the form of npm package, do you think it would be generic? – Muayyad Alsadi Jan 25 '19 at 21:41
  • Most of it is entirely generic, to the extent that the Monoid instance of the `Accum` object could be derived from the three independent Monoid instance of First, Last and Batch. The only logic that would be specific to this example is defining the `step` function for `foldSyncGen` that calls the `flushEmit` function. – Scott Christopher Jan 27 '19 at 12:38
2

I'm not sure it's fair to imply that functional programming was going to offer any advantages over imperative programming in term of performance when dealing with huge amount of data.

I think you need to add another tool in your toolkit and that may be RxJS.

RxJS is a library for composing asynchronous and event-based programs by using observable sequences.

If you're not familiar with RxJS or reactive programming in general, my examples will definitely look weird but I think it would be a good investment to get familiar with these concepts

In your case, the observable sequence is your MongoDB instance that emits records over time.

I'm gonna fake your db:

var db = range(1, 5);

The range function is a RxJS thing that will emit a value in the provided range.

db.subscribe(n => {
  console.log(`record ${n}`);
});

//=> record 1
//=> record 2
//=> record 3
//=> record 4
//=> record 5

Now I'm only interested in the first and last record.

I can create an observable that will only emit the first record, and create another one that will emit only the last one:

var db = range(1, 5);
var firstRecord = db.pipe(first());
var lastRecord = db.pipe(last());

merge(firstRecord, lastRecord).subscribe(n => {
  console.log(`record ${n}`);
});
//=> record 1
//=> record 5

However I also need to process all records in batches: (in this example I'm gonna create batches of 10 records each)

var db = range(1, 100);
var batches = db.pipe(bufferCount(10))
var firstRecord = db.pipe(first());
var lastRecord = db.pipe(last());

merge(firstRecord, batches, lastRecord).subscribe(n => {
  console.log(`record ${n}`);
});

//=> record 1
//=> record 1,2,3,4,5,6,7,8,9,10
//=> record 11,12,13,14,15,16,17,18,19,20
//=> record 21,22,23,24,25,26,27,28,29,30
//=> record 31,32,33,34,35,36,37,38,39,40
//=> record 41,42,43,44,45,46,47,48,49,50
//=> record 51,52,53,54,55,56,57,58,59,60
//=> record 61,62,63,64,65,66,67,68,69,70
//=> record 71,72,73,74,75,76,77,78,79,80
//=> record 81,82,83,84,85,86,87,88,89,90
//=> record 91,92,93,94,95,96,97,98,99,100
//=> record 100

As you can see in the output, it has emitted:

  1. The first record
  2. Ten batches of 10 records each
  3. The last record

I'm not gonna try to solve your exercise for you and I'm not too familiar with RxJS to expand too much on this.

I just wanted to show you another way and let you know that it is possible to combine this with functional programming.

Hope it helps

customcommander
  • 17,580
  • 5
  • 58
  • 84
  • It's not a homework. I'm questioning the generality because someone might suggest using resolve toArray() on those millions of records while one only need a small bulk (multiply that by many requests). And no your answer is not valid because if one pipe consume a document the other won't see it. [read this](https://hackernoon.com/functional-programming-in-javascript-is-an-antipattern-58526819f21e) – Muayyad Alsadi Jan 16 '19 at 11:32
  • I'll edit the question to provide a valid fake data. – Muayyad Alsadi Jan 16 '19 at 11:32
  • 1
    Sorry I know it wasn't a homework. Processing millions of records in memory isn't going to work hence why I wanted to let you know about RxJS. Please note that in RxJS `pipe` will not execute anything until one subscribes to an observable. I don't know how much you know about RxJS so I deliberately kept my answer as light as possible. Also I'm not an expert in RxJS myself. I take that the answer may not be what you're looking for but I think it does provide some insights. – customcommander Jan 16 '19 at 11:53
  • I've added fake data. and thank you for suggesting `RxJs` I need to look into it. – Muayyad Alsadi Jan 16 '19 at 12:15
  • as you can [see here](https://imgur.com/a/RBT1aVA), `first()` did consume the document from the batch – Muayyad Alsadi Jan 16 '19 at 15:19
  • As well as the suggestion of Scramjet in another answer and RSJS here, you might also look at [Bacon](https://baconjs.github.io/) and [Flyd](https://github.com/paldepind/flyd). – Scott Sauyet Jan 16 '19 at 22:16
2

I think I may have developed an answer for you some time ago and it's called scramjet. It's lightweight (no thousands of dependencies in node_modules), it's easy to use and it does make your code very easy to understand and read.

Let's start with your case:

DataStream
    .from(getDocs(10000))
    .use(stream => {
        let counter = 0;

        const items = new DataStream();
        const out = new DataStream();

        stream
            .peek(1, async ([first]) => out.whenWrote(first))
            .batch(100)
            .reduce(async (acc, result) => {
                await items.whenWrote(result);

                return result[result.length - 1];
            }, null)
            .then((last) => out.whenWrote(last))
            .then(() => items.end());

        items
            .setOptions({ maxParallel: 1 })
            .do(arr => counter += arr.length)
            .each(batch => writeDataToSocketIo(batch))
            .run()
            .then(() => (out.end(counter)))
        ;

        return out;
    })
    .toArray()
    .then(([first, last, count]) => ({ first, count, last }))
    .then(console.log)
;

So I don't really agree that javascript FRP is an antipattern and I don't think I have the only answer to that, but while developing the first commits I found that the ES6 arrow syntax and async/await written in a chained fashion makes the code easily understandable.

Here's another example of scramjet code from OpenAQ specifically this line in their fetch process:

return DataStream.fromArray(Object.values(sources))
  // flatten the sources
  .flatten()
  // set parallel limits
  .setOptions({maxParallel: maxParallelAdapters})
  // filter sources - if env is set then choose only matching source,
  //   otherwise filter out inactive sources.
  // * inactive sources will be run if called by name in env.
  .use(chooseSourcesBasedOnEnv, env, runningSources)
  // mark sources as started
  .do(markSourceAs('started', runningSources))
  // get measurements object from given source
  // all error handling should happen inside this call
  .use(fetchCorrectedMeasurementsFromSourceStream, env)
  // perform streamed save to DB and S3 on each source.
  .use(streamMeasurementsToDBAndStorage, env)
  // mark sources as finished
  .do(markSourceAs('finished', runningSources))
  // convert to measurement report format for storage
  .use(prepareCompleteResultsMessage, fetchReport, env)
  // aggregate to Array
  .toArray()
  // save fetch log to DB and send a webhook if necessary.
  .then(
    reportAndRecordFetch(fetchReport, sources, env, apiURL, webhookKey)
  );

It describes everything that happens with every source of data. So here's my proposal up for questioning. :)

Michał Karpacki
  • 2,588
  • 21
  • 34
  • How does that result first and last documents without consuming them from the batches stream? In Ramda I used 'applySpec' – Muayyad Alsadi Jan 16 '19 at 17:38
  • BTW for me multi-threading is no go for this problem. We are just looping over documents not inventing a rocket – Muayyad Alsadi Jan 16 '19 at 17:40
  • @MuayyadAlsadi hmm... it doesn't as I didn't read the **Wanted** section clearly enough. Let me rewrite a bit. – Michał Karpacki Jan 16 '19 at 17:48
  • Interesing - @MuayyadAlsadi did you read from the docs, scramjet is multithreaded by default? As this is not the case I'd need to rewrite the docs and the intro a little. – Michał Karpacki Jan 16 '19 at 17:52
  • @MuayyadAlsadi seems `scramjet` is missing a method to slice an item off the end of the stream, hence the strange code which should be in the library. It does what you want, but if you'd have a couple minutes to spare, maybe we could use the chat - I'd happily hear some constructive criticism... – Michał Karpacki Jan 16 '19 at 18:55
  • thank you for clearing that the default is async not multi-threading, but the code seems to consume first and last which is not what I want. we want to send small batches of document via socket.io and at the end send the id/time of first and last document. Your code seems to send `first=d0, batch0=[d1,d2,d3]` but it should be `first=d0, batch0=[d0,d1,d2]` (that is `d0` is not consumer) – Muayyad Alsadi Jan 16 '19 at 20:54
  • Ah I see... So you want the count to be 10k in the end... I'm afk now and don't want to edit this doc from my mobile, but that will be easier. You could replace the shift with peek, but then most of the code above is excessive. I'll rewrite tomorrow. – Michał Karpacki Jan 16 '19 at 22:24
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/186839/discussion-between-michal-kapracki-and-muayyad-alsadi). – Michał Karpacki Jan 17 '19 at 09:21
1

here are two solutions using RxJs and scramjet.

here is an RxJs solution

the trick was to use share() so that first() and last() won't consumer from the iterator, forkJoin was used to combine them to emit the done event with those values.

function ObservableFromAsyncGen(asyncGen) {
  return Rx.Observable.create(async function (observer) {
    for await (let i of asyncGen) {
      observer.next(i);
    }
    observer.complete();
  });  
}
async function main() {
  let o=ObservableFromAsyncGen(getDocs(100));
  let s = o.pipe(share());
  let f=s.pipe(first());
  let e=s.pipe(last());
  let b=s.pipe(bufferCount(13));
  let c=s.pipe(count());
  b.subscribe(log("bactch: "));
  Rx.forkJoin(c, f, e, b).subscribe(function(a){console.log(
    "emit done with count", a[0], "first", a[1], "last", a[2]);})
}

here is a scramjet but that is not pure (functions have side effects)

async function main() {
  let docs = getDocs(100);
  let first, last, counter;
  let s0=Sj.DataStream
    .from(docs)
    .setOptions({ maxParallel: 1 })
    .peek(1, (item)=>first=item[0])
    .tee((s)=>{
        s.reduce((acc, item)=>acc+1, 0)
        .then((item)=>counter=item);
    })
    .tee((s)=>{
        s.reduce((acc, item)=>item)
        .then((item)=>last=item);
    })
    .batch(13)
    .map((batch)=>console.log("emit batch"+JSON.stringify(batch));
  await s0.run();
  console.log("emit done "+JSON.stringify({first: first, last:last, counter:counter}));
}

I'll work with @michał-kapracki to develop a pure version of it.

Muayyad Alsadi
  • 1,506
  • 15
  • 23
0

For this exact kind of problems I made this library: ramda-generators

Hopefully it's what you are looking for: lazy evaluation of streams in functional JavaScript

Only problem is that I have no idea on how to take the last element and the amount of elements from a stream without re-running the generators

A possible implementation that compute the result without parsing the whole DB in memory could be this:

Try it on repl.it

const RG = require("ramda-generators");
const R  = require("ramda");

const sleep = ms => new Promise(resolve => setTimeout(resolve, ms));

const getDocs = amount => RG.generateAsync(async (i) => {
    await sleep(1);
    return { i, t: Date.now() };
}, amount);

const amount = 1000000000;

(async (chunkSize) => {
    const first = await RG.headAsync(getDocs(amount).start());
    const last  = await RG.lastAsync(getDocs(amount).start()); // Without this line the print of the results would start immediately 

    const DbIterator = R.pipe(
        getDocs(amount).start,
        RG.splitEveryAsync(chunkSize),
        RG.mapAsync(i => "emit " + JSON.stringify(i)),
        RG.mapAsync(res => ({ first, last, res })),
    );

    for await (const el of DbIterator()) 
        console.log(el);

})(100);