File consumption in multinode hazelcast setup

Question

I see examples where CSV files can be consumed using jet eg.

BatchSource<SalesRecordLine> source = Sources.filesBuilder(sourceDir)
             .glob("*.csv")
             .build(path -> Files.lines(path).skip(1).map(SalesRecordLine::parse));

In a multinode setup, will all the nodes start picking up the file (on say a shared NFS) or does it employ some smart locking (like Apache Camel's idempotent file consumer method?). How does Jet know the file has been completely flushed to disk before reading?

thanks

score 0 · Answer 1 · edited Mar 11 '20 at 08:29

If you are using an NFS then set the sharedFileSystem property to true:

BatchSource<SalesRecordLine> source = Sources.filesBuilder(sourceDir)
    .glob("*.csv")
    .sharedFileSystem(true)
    .build(path -> Files.lines(path).skip(1).map(SalesRecordLine::parse));

From the method javadoc:

Sets if files are in a shared storage visible to all members. Default value is false. If sharedFileSystem is true, Jet will assume all members see the same files. They will split the work so that each member will read a part of the files. If sharedFileSystem is false, each member will read all files in the directory, assuming the are local.

For the batch source, Jet assumes the files are not modified while they are read. If they are, the result is undefined.

If you want to monitor files as they are written to, use FileSourceBuilder.buildWatcher() instead of build() - this will create a streaming job. But the watcher processes only lines appended since the job started. Again, if the files are modified in any other way than appending at the end, the result is undefined. For example, many text editors delete and write the entire file, even when you just appended a line at the end - for testing it's easiest to use

echo "text" >> your_file"

Thank you. I saw that option, however I don't want to split up the file amongst nodes. One file one node. Is that possible? A solution I have in mind takes care of all my problems but outside of the hazelcast ecosystem. — gurpal2000, Mar 11 '20 at 08:03

score 0 · Answer 2 · answered Mar 12 '20 at 15:05

You can place the file on just one node and have Jet distribute the data to all members. Jet currently lacks first-class support for stream rebalancing, but you can achieve it in this, a bit roundabout, way:

pipeline.readFrom(source)
        .groupingKey(x -> x)
        .mapStateful(() -> null, (state, key, item) -> item)
        .restOfYourPipeline();

groupingKey(x -> x) specifies the partitioning function. I used a plain identity function, but you can put anything else that makes sense for your data.

File consumption in multinode hazelcast setup

2 Answers2