0

I am processing messages from a vendor as a stream of data and want to store msgSeqNum locally in a local file. Reason:

They send msgSeqNum to uniquely identify each message. And they provide a 'sync-and-stream' functionality to stream messages on reconnecting from a given sequence number. Say if the msgSeqNum starts from 1 and my connection went down at msgSeqNum 50 and missed the next 100 messages (vendor server's current msgSeqNum is now 150), then when I reconnect to the vendor, I need to call 'sync-and-stream' with msgSeqNum=50 to get the missed 100 messages.

So I want to understand how I can persist the msgSeqNum locally for fast access. I assume

1) Since the read/writes happen frequently i.e. while processing every message (read to ignore dups, write to update msgSeqNum after processing a msg), I think it's best to use Java NIO's 'MappedByteBuffer'?

2) Could someone confirm if the below code is best for this where I expose the mapped byte buffer object to be reused for reads and writes and leave the FileChannel open for the lifetime of the process? Sample Junit code below:

I know this could be achieved with general Java file operations to read and write into a file but I need something fast which is equivalent to non-IO as I am using a single writer patten and want to be quick in processing these messages in a non-blocking manner.

private FileChannel fileChannel = null;
private MappedByteBuffer mappedByteBuffer = null;
private Charset utf8Charset = null;
private CharBuffer charBuffer = null;

@Before
public void setup() {
    try {
        charBuffer = CharBuffer.allocate( 24 ); // Long max/min are till 20 bytes anyway
        System.out.println( "charBuffer length: " + charBuffer.length() );

        Path pathToWrite = getFileURIFromResources();
        FileChannel fileChannel = (FileChannel) Files
                .newByteChannel( pathToWrite, EnumSet.of(
                        StandardOpenOption.READ,
                        StandardOpenOption.WRITE,
                        StandardOpenOption.TRUNCATE_EXISTING ));

            mappedByteBuffer = fileChannel
                    .map( FileChannel.MapMode.READ_WRITE, 0, charBuffer.length() );

        utf8Charset = Charset.forName( "utf-8" );
        //charBuffer = CharBuffer.allocate( 8 );

    } catch ( Exception e ) {
        // handle it
    }
}


@After
public void destroy() {
    try {
        fileChannel.close();
    } catch ( IOException e ) {
        // handle it
    }
}

@Test
public void testWriteAndReadUsingSharedMappedByteBuffer() {
    if ( mappedByteBuffer != null ) {
        mappedByteBuffer.put( utf8Charset.encode( charBuffer.wrap( "101" ) )); // TODO improve this and try reusing the same buffer instead of creating a new one
    } else {
        System.out.println( "mappedByteBuffer null" );
        fail();
    }

    mappedByteBuffer.flip();
    assertEquals( "101", utf8Charset.decode(mappedByteBuffer).toString() );
}
Yash R
  • 247
  • 2
  • 19
saurabh.in
  • 389
  • 3
  • 13
  • 1
    Do you need random access, or why did you think `MappedByteBuffer` would be best here? – Kayaman Oct 02 '19 at 12:42
  • @Kayaman - sorry what do you mean by random access? I need to persist msgSeqNum for any given message and then override that entry with next message’s sequence number. I am using this file as a persistence store for a java long variable called ‘lastProcessedMsgSeqNum’. This is so that if my application crashes, I could read from this file the last msg sequence number and request for the rest from vendor’s server. Hope that makes sense.. – saurabh.in Oct 02 '19 at 12:52
  • 2
    Well memory mapping is useful when random access to the file is needed. If you are planning to use it because "it's fast", then your assumptions are wrong from the beginning, and you should probably use standard IO instead. – Kayaman Oct 02 '19 at 12:55
  • Kayaman - thanks! I thought since the file is mapped to in-memory buffer, it’s going to be fast. For my use case, what do reckon is best? A cache like ehcache backed by a file? – saurabh.in Oct 02 '19 at 13:00
  • I'd use a database. – Kayaman Oct 02 '19 at 13:04
  • What I have seen in previous projects is people using a file to store most recent sequence number of processed message when dealing with a message bus. I guess they didn’t use database because if there are multiple instances, then we need a table with a column for instance name, along with msgSeqNum. This would cause contention when multiple instances would want to update their corresponding record with the new sequence number after processing every single message. – saurabh.in Oct 02 '19 at 13:07
  • 1
    Maybe they didn't use a database because they're not very familiar with them. Databases aren't slow, but if you're inexperienced you can sure make them slow. I mean databases are **designed** to work on large amounts of data parallelly, and do it efficiently. Your claim about contention is not a reason not to use a database. – Kayaman Oct 02 '19 at 13:13
  • 1
    No reason of allocating 24 bytes, allocate multiple of page size bytes. Memory mapping the file would save you from additional memory copy in comparison to the "standard" i/o. The flush operation (`fsync`) is required and in any case it is equally slow. I would not bother with such a low-level optimization unless u r designing some low-latency critical system and follow @Kayaman advice – Some Name Oct 02 '19 at 13:15
  • Kayaman - I doubt that the guys who chose local file for persistence of msg seq number didn’t know about database. These are equity teams of the biggest American banks where latency has direct impact on money. I could consider DB with a level 2 cache (hibernate etc) but again there will be inherent contention in this approach as multiple cloned instances(in an elastic env) will be writing to the same table. I reckon, using a cache backed by a file store could be a good option if I want to store the state of the java variable lastProcessedMsgSeqNum? – saurabh.in Oct 02 '19 at 13:31
  • @Kayaman - I am looking for the best solution of course. There are multiple options, and I would want to hear from the experts in this area. I didn't know about memory mapped buffer and that's why posted the question here to undersatnd if that's the right thing. You suggested using DB as an alternative, in which case, I would prefer some level of caching for faster access. Anyway, thanks and let's see if others have anything to add further. Based on the comments, I feel write-through cache is my preference so far. – saurabh.in Oct 02 '19 at 13:54
  • 1
    @Kayaman IMHO databases *are* slow when compared to memory mapped files. In general, they need an inter-process communication and parsing of the commands etc. All these operations are heavily optimized, but it's not as good as writing directly. Moreover, they provide transactions and ACID, which have quite some cost, too. – maaartinus Oct 02 '19 at 14:41
  • I guess, that's about what https://github.com/OpenHFT/Chronicle-Map was invented for. – maaartinus Oct 02 '19 at 14:43
  • 2
    You need to define, whether you require `msgSeqNum` to be reliably stored after each processed message, or whether you want to store it in non-blocking manner. In either case, `MappedByteBuffer` is a poor choice, as it is neither non-blocking nor reliable. Furthermore, `ByteBuffer.put(byte[])` is not atomic - this means, there is a chance to see an invalid value written if a fatal error happens in the middle of writing. – apangin Oct 02 '19 at 15:34
  • @apangin I do want msgSeqNum to be stored reliably and in a non-blocking manner. I am getting inclined to using MapDB / any other cache backed by file system. – saurabh.in Oct 02 '19 at 15:51
  • 1
    There is no non-blocking file I/O in Java, and `MappedByteBuffer` is no exception. – user207421 Oct 02 '19 at 23:19
  • 1
    @maaartinus that's apples and oranges. The issue is not between memory mapping and databases, but in the proper solution. Since they're doing streaming, maybe Kafka would be the best solution. We don't know their environment, so this is an exercise in futility. You give a suggestion, and the OP starts talking about how they used files at a bank (which apparently means files are a "Bank Grade" solution). You suggest "database" and you get "hibernate and ehcache". – Kayaman Oct 03 '19 at 06:04
  • @saurabh.in In any banking application correctness is infinitely more important than latency, and latency does no cost money. Errors cost money, – user207421 Oct 03 '19 at 11:56

0 Answers0