0

Say I have an async generator, like this:

// This could be records from an expensive db call, for example...
// Too big to buffer in memory
const events = (async function* () {
    await new Promise(r => setTimeout(r, 0));
    yield {type:'bar', ts:'2021-01-01 00:00:00', data:{bar:"bob"}};
    yield {type:'foo', ts:'2021-01-02 00:00:00', data:{num:2}};
    yield {type:'foo', ts:'2021-01-03 00:00:00', data:{num:3}};
})();

How can I copy it to acheive something like:

function process(events) {

    async function* filterEventsByName(events, name) {
        for await (const event of events) {
            if (event.type === name) continue;
            yield event;
        }
    }

    async function* processFooEvent(events) {
        for await (const event of events) {
            yield event.data.num;
        }
    }

    // How to implement this fork function?
    const [copy1, copy2] = fork(events);

    const foos = processFooEvent(filterEventsByName(copy1, 'foo'));
    const bars = filterEventsByName(copy2, 'bar');

    return {foos, bars};
}

const {foos, bars} = process(events);

for await (const event of foos) console.log(event);
// 2
// 3

for await (const event of bars) console.log(event);
// {type:'bar', ts:'2021-01-01 00:00:00', data:{bar:"bob"}};
surj
  • 4,706
  • 2
  • 25
  • 34
  • 2
    Being a generator, it makes very little sense instantiate it once only. Just leave the generator function as is, and pass it in the `filterByName`. – Mario Vernari Sep 18 '21 at 04:47
  • @MarioVernari I'm just giving a simple generate for the sake of this example, in reality the generator is created from an expensive API call that I only want to make once. – surj Sep 18 '21 at 16:41
  • Then maybe only the actual API call should be made once and cached, and the generator should _still_ be instantiated twice? – CherryDT Sep 18 '21 at 16:42
  • @CherryDT The API result set is too big to fit in memory. – surj Sep 18 '21 at 16:48
  • 2
    But that's not logical. Let's assume there _was_ a way to copy the state of a generator (there isn't). Then what would that mean? If it's all one request, then where would the response be stored, if not in memory? If it's multiple requests done as needed in paginated fashion, then what should happen if one of the generator copies got `next` called 100 times already and the other one 500? Thinking this through, you'll arrive at the conclusion that the only thing that makes sense would be having to separate generators. – CherryDT Sep 18 '21 at 17:11
  • This seems like [an XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). If you have a generator and want to split the results into two, you don't need to split the generator. You take each result and process it via different criteria. So, the original generator is consumed once, the elements end up in two different places. I do not see how cloning the generator helps much. Surely that means you still need the expensive call to happen twice. – VLAZ Sep 18 '21 at 19:27
  • Perhaps a better title would've been how to have multiple consumers of a generator. I can't buffer the generator in memory. Why use a generator instead of an array then? I can however, copy a single even to forward it to multiple downstream consumers. This is what I'm trying to achieve. There should be a way to do a next() call on the initial generator, and forward it to 2+ new generators (the output of the fork). – surj Sep 18 '21 at 22:20
  • @CherryDT I get what you're saying about the 100 vs 500. I don't see how you absolutely need to re-create the upstream generator though. I guess assuming each fork unrolls fully (e.g. no downstream errors), you could block until every fork has called next. Or you could have a buffer with fixed size, and only block when it fills up. You could set a timeout on the block, to unblock the other consumers in case of downstream fail. – surj Sep 18 '21 at 22:41
  • @surjikal "*you could block until every fork has called next*" - no, you cannot block. Especially since the generator is synchronous. You could achieve this if it was an asychronous generator or an event stream. – Bergi Sep 18 '21 at 23:18
  • @Bergi ok fair point, I am dealing w/ async generators so I will update the example accordingly. – surj Sep 18 '21 at 23:20
  • @surjikal Ah, thanks! I hope you don't mind me closing this as a duplicate. You might want to post your highland solution as an answer there as well. – Bergi Sep 18 '21 at 23:35
  • @Bergi no worries, cheers – surj Sep 18 '21 at 23:46

1 Answers1

0

I have a solution using Highland as an intermediary.

Note that (from the docs):

A stream forked to multiple consumers will pull values, one at a time, from its source as only fast as the slowest consumer can handle them.

import _ from 'lodash'
import H from 'highland'

export function fork<T>(generator: AsyncGenerator<T>): [
    AsyncGenerator<T>,
    AsyncGenerator<T>
] {
    const source = asyncGeneratorToHighlandStream(generator).map(x => _.cloneDeep(x));
    return [
        highlandStreamToAsyncGenerator<T>(source.fork()),
        highlandStreamToAsyncGenerator<T>(source.fork()),
    ];
}

async function* highlandStreamToAsyncGenerator<T>(
    stream: Highland.Stream<T>
): AsyncGenerator<T> {
    for await (const row of stream.toNodeStream({ objectMode: true })) {
        yield row as unknown as T;
    }
}

function asyncGeneratorToHighlandStream<T>(
    generator: AsyncGenerator<T>
): Highland.Stream<T> {
    return H(async (push, next) => {
        try {
            const result = await generator.next();
            if (result.done) return push(null, H.nil);
            push(null, result.value);
            next();
        } catch (error) {
            return push(error);
        }
    });
}

Would love to see alternative solutions without a library, or with another library.

surj
  • 4,706
  • 2
  • 25
  • 34