Garbage-collected languages with efficient numeric data types

Question

I am searching for a language/library (preferably JVM-based) that handles numeric values (integer and floating point numbers) in both convenient and efficient manner.

Convenient: supported by the collection framework and generics.
Efficient: incurs no noticeable overhead when the primitives are the building block in a data-heavy data-processing software
(specifically, processing multiple GB of texts with >100,000,000
items).

Deficiencies of the current languages:

Plain Java: auto-boxing is quite convenient, but it has substantial overhead.
Scala and Kotlin: seem to rely also on Java's boxed primitives, so no efficiency advantage here.
Python: again, seems to box all numeric values, and we ran into prohibitive performance problems with vanilla Python. Numpy, which provides a different implementation, does not support the needed features.

Is there a language that handles primitives with the same convenience but efficiently (compared to that language general performance)?

Thanks for the clarification - from the docs it seems that the implementation uses JVM's primitive when possible, and boxed primitives when needed. Therefore, it provides on implementation benefit over the JVM. See The Scala Standard Library, chapter 12 https://scala-lang.org/files/archive/spec/2.12/12-the-scala-standard-library.html — Tom Korach, Jan 27 '19 at 22:30
That depends totally on the implementation. Neither Scala-native nor Scala.js use JVM primitives, and Scala-JVM gets better and better at eliding boxing. — Jörg W Mittag, Jan 27 '19 at 22:31
Edited the question to emphasize the need for numeric types (regardless of being primitive or objects). — Tom Korach, Jan 27 '19 at 22:41

Matt Timmermans · Accepted Answer · 2019-01-28T02:26:57.137

1

C# fits the criteria, depending on what you mean by the efficiency requirement. It doesn't run on the JVM, of course.

Unlike Java, which implements generics with type erasure, C# implements generics via reification like C++ does. That means that when you make a List<int>, the underlying array will be an array of int, not an array of objects. Also the code that implements all the List methods will be compiled specifically for List<int>, and can take advantage of int-specific optimizations.

For this reason, data processing with primitive types is generally faster in C# than it is in Java when you're using all the convenient language features. It can still be far from what you can get with C++, however, because the runtime checks that prevent buffer overrun, etc., are not free.

edited Jan 28 '19 at 02:26

answered Jan 28 '19 at 02:15

Matt Timmermans

53,709
3
46
87

Efficiency in this context means 1) speed and 2) RAM requirements. The JVM's method to accommodate numeric types to the collection framework is auto-boxing which has substantial overhead (see https://docs.oracle.com/javase/8/docs/technotes/guides/language/autoboxing.html), and was found to be a bottleneck in our practical use-cases. . – Tom Korach Jan 28 '19 at 03:38
@TomKorach you may also want to consider a library of specialized primitive collections like gnu trove: https://bitbucket.org/trove4j/trove – Matt Timmermans Jan 28 '19 at 03:46
Primitive collections (e.g. Trove and Eclipse Collections) provide Collections alternatives, but they solve only this specific issue. Any other place where an Object is expected, the problem still remains. For example, Java arrays are compared by reference rather than by content, and therefore HashSet cannot detect two identical instances of Object[]. The solution is to use List (which does override equals() to compare by content), but for int[] this will require auto-boxing. – Tom Korach Jan 28 '19 at 04:11
@TomKorach You can simply wrap your `int[]` array via [`IntBuffer.wrap(…)`](https://docs.oracle.com/javase/8/docs/api/java/nio/IntBuffer.html#wrap-int:A-) and get an object with a proper `hashCode` and `equals`. Further, you can perform efficient processing via `IntStream`, e.g. `Arrays.stream(largeIntArray).parallel().sum()`. There are also a lot of useful convenience method, including parallel processing in [`Arrays`](https://docs.oracle.com/javase/8/docs/api/?java/util/Arrays.html), which have no counterpart in the Collection API at all. A `List` still doesn’t provide numeric methods. – Holger Jan 29 '19 at 11:11
In other words, there is no sense in judging the efficiency of data handling by looking at the performance of an API not suitable for handling that data. If you want to do numerics, use numeric APIs, if you want to do text processing, use the text processing APIs. The Collection API is neither, hence doesn’t even fit the convenience requirement, as it provides no methods for the tasks described in the question. – Holger Jan 29 '19 at 11:17
The mentioned issues stem from the language and thus appear in libraries we use. For example we need to perform range calculations. Guava's range implementation requires a Comparable object and thus numeric ranges (a natural and common use-case for range data) require auto-boxing, affecting performance. This is a commonly used library that suit our specific needs quite well, except for the implementation details of auto-boxing. The same goes for text-processing dedicated libraries etc. The same happens for NLP libraries that resort to auto-boxing. – Tom Korach Jan 31 '19 at 04:05
The point is that the way the language handles numeric values creates limitations that either result in a noticeable performance overhead, or require specific solutions that are partial and harbor their own issues (e.g. Eclipse Collections primitive-object maps do not implement the Map interface, requiring writing custom code instead of reusing code that accepts Map instances). Thus, a language that handle numeric values in a different could mitigate these issues – Tom Korach Jan 31 '19 at 04:13

Garbage-collected languages with efficient numeric data types

Deficiencies of the current languages:

1 Answers1