It isn't clear how much memory is assigned to each node - is it 256GB or 128GB? Either way, as I understand it, setting a max-heap size less than the amount of memory assigned to the node will usually mean the application stays affined to a single node. This is true under Windows, Solaris and Linux, as far as I'm aware.
Even if you allocate a JVM max heap size greater then the assigned memory to a node, if your heap doesn't grow beyond that size, the process won't spill because the JVM object allocator will always try to create a new object in the same memory pool as the creating thread - and that includes new thread objects.
The primary design goal of the NUMA architecture is to enable different processes to operate on different CPUs with each CPU having localised memory access, rather than having all CPUs contend for the same global shared memory. Having the same process running across multiple nodes is not necessarily that efficient, unless you can arrange for a particular thread to always use the local memory associated with a specific node (thread affinity). Otherwise, remote memory access will slow you down.
I suspect that to use more than one node in your example you will need to either assign different tasks to different nodes, or parallelise the same task across multiple nodes. In the latter case you'll need to ensure that each node has a copy of the same data in local memory. There are libraries available to manage thread affinity from your Java code.
https://github.com/peter-lawrey/Java-Thread-Affinity