0

I'm experimenting with the Java 17 Vector API Incubator and I decided to see if I can create a zero-cost syntactic sugar for it. Here is a small snippet of what I wrote:

import jdk.incubator.vector._

object VectorOps {
  implicit final class FloatVectorOps @inline() (val _this: FloatVector) extends AnyVal {
    @inline def +(that: FloatVector): FloatVector = _this.add(that)
    @inline def apply(i: Int): Float = _this.lane(i)
  }
}

class Test {
  def test(x: Float, y: Float): Float = {
    import VectorOps._
    val SSE = FloatVector.SPECIES_128
    val xv = FloatVector.broadcast(SSE, x)
    val yv = FloatVector.broadcast(SSE, y)
    (xv + yv)(0) // sugar for xv.add(yv).lane(0)
  }
}

I'm using Scala 2.13.5 and Java 17.

Scala compiler is ran with -optimize -opt:inline -opt-warnings:at-inline-failed -Yopt-inline-heuristics:at-inline-annotated -opt:nullness-tracking -opt:box-unbox -opt:copy-propagation -opt:unreachable-code -language:implicitConversions -opt:closure-invocations

JVM is ran with --add-modules jdk.incubator.vector.

However, the Scala compiler compiles the final line of the test method to

    GETSTATIC VectorOps$FloatVectorOps$.MODULE$ : LVectorOps$FloatVectorOps$;
    POP
    GETSTATIC VectorOps$.MODULE$ : LVectorOps$;
    GETSTATIC VectorOps$FloatVectorOps$.MODULE$ : LVectorOps$FloatVectorOps$;
    POP
    GETSTATIC VectorOps$.MODULE$ : LVectorOps$;
    ALOAD 4
    INVOKEVIRTUAL VectorOps$.FloatVectorOps (Ljdk/incubator/vector/FloatVector;)Ljdk/incubator/vector/FloatVector;
    ALOAD 5
    INVOKEVIRTUAL jdk/incubator/vector/FloatVector.add (Ljdk/incubator/vector/Vector;)Ljdk/incubator/vector/FloatVector;
    INVOKEVIRTUAL VectorOps$.FloatVectorOps (Ljdk/incubator/vector/FloatVector;)Ljdk/incubator/vector/FloatVector;
    ICONST_0
    INVOKEVIRTUAL jdk/incubator/vector/FloatVector.lane (I)F
    FRETURN

Those calls to the implicit class constructor completely throw Hotspot off and it's unable to unbox the vector variables, killing performance. Note that the implicit class constructor is, bytecode-wise, an identity function, which effectively means that it's a no-op. All the stuff with MODULE$ is also unnecessary. But Hotspot does not see it.

(Note that the method calls to + and apply were successfully inlined.)

Adding -Yopt-inline-heuristics:everything removes both the constructor calls and MODULE$, and fixes performance, but it's like using a sledgehammer to crack a nut. And like a sledgehammer, it doesn't feel safe.

Of course, writing the entire code in Java style also fixes the performance, but that's not the point.

So my questions:

  1. Can the calls be eliminated without -Yopt-inline-heuristics:everything and without rewriting everything in the original Java syntax?

  2. Scala 3 has some new inlining features. Can this be done in Scala 3 without aggressive optimization options?

Karol S
  • 9,028
  • 2
  • 32
  • 45
  • 1
    I am pretty sure that you just need to write the `implicit class` without all those `@Inline` and without using optimizer options from the compiler. For the compiler to transform those extension methods into function calls. - Now, if you also want to inline the function call to its body then I would recommend rather using `extension` and `inline` from **Scala 3**: https://scastie.scala-lang.org/BalmungSan/9NZdsiShRFqoqx4pkOOJ3Q – Luis Miguel Mejía Suárez Jan 10 '22 at 17:13
  • @LuisMiguelMejíaSuárez In Scala 2, without `@inline` and optimization options, the code is even more peppered with unnecessary method calls, I want _zero_ calls to the implicit classes, all calls must be directly to the Vector API. Scala 3 looks promising. I gave it a try and there were no calls to inline extension methods, just like I wanted, however the vector performance is still awfully bad for some different, unexplained reason. – Karol S Jan 10 '22 at 18:05
  • In **Scala 2** if you do: `implicit class FooOps(private val foo: Foo) extends AnyVal { def bar(x: Int): Int = foo.baz * x }` then doing: `foo.bar(10)` should be expanded by the compiler into `someFooOpsGeneratedBarMethod(foo, 10)` if it does not, please submit a bug. - And, as I said, if you also want to avoid that extra method call and just have `foo.baz * 10` then you do need `inline` from **Scala 3**. – Luis Miguel Mejía Suárez Jan 10 '22 at 18:21

1 Answers1

0

I've figured out a solution: macros.

object VectorOps {
  implicit final class FloatVectorOps @inline() (val _this: FloatVector) extends AnyVal {
    @inline def +(that: FloatVector): FloatVector = macro MacroVectorOps.add
    @inline def apply(i: Int): Float = macro MacroVectorOps.lane
  }
}

object MacroVectorOps {
  import scala.reflect.macros.blackbox
  def add(c: blackbox.Context)(that: c.Tree): c.Tree = {
    import c.universe._
    // deconstruct the implicit conversion:
    val q"$conv($in)" = c.prefix.tree
    q"$in.add($that)"
  }  
  def lane(c: blackbox.Context)(i: c.Tree): c.Tree = {
    import c.universe._
    val q"$conv($in)" = c.prefix.tree
    q"$in.lane($i)"
  }
}

Code that uses this compiles to as efficient bytecode as if I wrote it using the original Java API, there are no traces of the FloatVectorOps class or VectorOps object in the bytecode at all.

Karol S
  • 9,028
  • 2
  • 32
  • 45