0

I know very little about inline assembly, codes(see here for details) are as follows:

JNIEXPORT void JNICALL
Java_com_xingin_xarengine_RGBAToGrayRenderer_nCopy(JNIEnv *env, jclass clazz, jobject dstBuf,
                                                   jobject srcBuf, jint sz) {
    if(sz & 63){
        sz = (sz & -64) + 64;
    }

    auto dst = (uint8_t volatile*)env->GetDirectBufferAddress(dstBuf);
    auto src = (uint8_t volatile*)env->GetDirectBufferAddress(srcBuf);
    asm volatile (
    "NEONCopyPLD: \n"
    " VLDM %[src]!,{d0-d7} \n"
    " VSTM %[dst]!,{d0-d7} \n"
    " SUBS %[sz],%[sz],#0x40 \n"
    " BGT NEONCopyPLD \n"
    : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory");
    LOGD("Use neon registers for memory copy");
}

It's basically used to copy memory by NEON registers. While the compiler complaint when building my application:

Build command failed.
Error while executing process /Users/user/Library/Android/sdk/cmake/3.10.2.4988404/bin/ninja with arguments {-C /Users/user/Projects/XarEngine/android/arview/.cxx/Release/5s3f6f2r/arm64-v8a XarEngine}
ninja: Entering directory `/Users/user/Projects/XarEngine/android/arview/.cxx/Release/5s3f6f2r/arm64-v8a'
[1/2] Building CXX object CMakeFiles/XarEngine.dir/XarEngine/details.cpp.o
FAILED: CMakeFiles/XarEngine.dir/XarEngine/details.cpp.o 
/Users/user/Library/Android/sdk/ndk/21.1.6352462/toolchains/llvm/prebuilt/darwin-x86_64/bin/clang++ --target=aarch64-none-linux-android21 --gcc-toolchain=/Users/user/Library/Android/sdk/ndk/21.1.6352462/toolchains/llvm/prebuilt/darwin-x86_64 --sysroot=/Users/user/Library/Android/sdk/ndk/21.1.6352462/toolchains/llvm/prebuilt/darwin-x86_64/sysroot  -DXarEngine_EXPORTS -D__GIT_TAG__=\"1.3.3-7-g59b0706\" -I../../../../../../components/PlaneTracker/include -I../../../../../../thirdparty/rapidjson -I../../../../../../thirdparty/filament/include -I../../../../../../thirdparty/opencv_4.5.3/include -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security   -s -O2 -O2 -DNDEBUG  -fPIC -MD -MT CMakeFiles/XarEngine.dir/XarEngine/details.cpp.o -MF CMakeFiles/XarEngine.dir/XarEngine/details.cpp.o.d -o CMakeFiles/XarEngine.dir/XarEngine/details.cpp.o -c ../../../../../../XarEngine/details.cpp
clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
../../../../../../XarEngine/details.cpp:175:48: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths]
    : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory");
                                               ^
../../../../../../XarEngine/details.cpp:173:12: note: use constraint modifier "w"
    " SUBS %[sz],%[sz],#0x40 \n"
           ^~~~~
           %w[sz]
../../../../../../XarEngine/details.cpp:175:48: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths]
    : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory");
                                               ^
../../../../../../XarEngine/details.cpp:173:18: note: use constraint modifier "w"
    " SUBS %[sz],%[sz],#0x40 \n"
                 ^~~~~
                 %w[sz]
../../../../../../XarEngine/details.cpp:171:6: error: vector register expected
    " VLDM %[src]!,{d0-d7} \n"
     ^
<inline asm>:2:12: note: instantiated into assembly here
 VLDM x0!,{d0-d7} 
           ^
../../../../../../XarEngine/details.cpp:172:6: error: vector register expected
    " VSTM %[dst]!,{d0-d7} \n"
     ^
<inline asm>:3:13: note: instantiated into assembly here
 VSTM x21!,{d0-d7} 
            ^
2 warnings and 2 errors generated.
ninja: build stopped: subcommand failed.

Who can help figuring out above information?

UPDATE
Is it related about compiler? My compiler is clang while above inline assembly should be gcc-compliant

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Finley
  • 795
  • 1
  • 8
  • 26
  • I hope this is just an experiment in getting the syntax right, not that you're expecting a speedup from this vs. the JVM's own memcpy. I'd expect a JVM to use NEON regs for memcpy if available, without the overhead of marshalling for a JNI call. As for the actual errors, seems weird, I'd have expected a `jint` to be an integer type that could use `"+r"`. – Peter Cordes Nov 24 '22 at 07:20
  • If you want to manually vectorize pixel-format conversions, like averaging the RGB to a single gray level and packing 4 bytes down to 1, I'd suggest using intrinsics and see if the compiler does a decent job. If not, then maybe hand-optimize the asm and wrap it up in an inline `asm` statement. But hopefully you won't need to mess with asm directly, just read it while you tweak the C++ source, to get good performance. – Peter Cordes Nov 24 '22 at 07:23
  • @PeterCordes It's not an experiment and will be adopted on my application if it indeed speed up memcpy – Finley Nov 24 '22 at 07:24
  • 1
    The reason to use inline assembly for memcpy is complicated in some way, put it simply the `dstBuf` is mapped DMA buffer from GPU memory and is not cached on CPU, call c++ memcpy directly may be very slow for some GPU(e.g. Mali), so we can use NEON register to overcome it – Finley Nov 24 '22 at 07:34
  • @PeterCordes refer this [page](https://community.arm.com/support-forums/f/graphics-gaming-and-vr-forum/6657/how-to-gain-performance-through-pbo-pixel-buffer-object-on-mali-t-880) for more details – Finley Nov 24 '22 at 07:37
  • 1
    Ok, that makes some sense. Yeah, a JVM memcpy might not be optimized to read whole 64-byte chunks with a single instruction, so yeah might be super bad on uncacheable memory. And I don't think intrinsics could let you tell the compiler you want a `vldm` like that. – Peter Cordes Nov 24 '22 at 07:41
  • If the `RGBAToGrayRenderer` part of your function name describes what this is part of, probably still best to do that on the fly as part of this loop, unless you also need a copy of the RGBA data. So you don't have to read back the 4x-larger RGBA data in another conversion pass, and you get to overlap the ALU shuffle/add work with waiting for loads from GPU memory. (Probably want to software-pipeline this so you load one chunk, process + store another, to hide more load latency by not using a load result right away.) Those are all things you can think about once you have the basics working. – Peter Cordes Nov 24 '22 at 07:41
  • Also, you can make rounding up the copy size unconditional with `sz = (sz+63) & -64` – Peter Cordes Nov 24 '22 at 07:42
  • @PeterCordes yes, But it does not matter now for the size of rgba texture is 720p forever – Finley Nov 24 '22 at 07:47
  • https://godbolt.org/z/YvhT58xf9 shows GCC also giving errors when you compile this for AArch64; `vldm` isn't a valid mnemonic. As I thought, it only exists in 32-bit ARM. But as your log shows, clang is building for `--target=aarch64-none-linux-android21`. – Peter Cordes Nov 24 '22 at 07:51
  • You also get that clang warning (not error) about integer size if you use `"+r"` with `unsigned int` without `%w[sz]` modifiers, so probably `jint` is a 32-bit integer type. https://godbolt.org/z/zqqx574o7 . So I guess do that, to save an instruction on having the compiler sign-extend it for you if you did `unsigned long sz = jsize;` – Peter Cordes Nov 24 '22 at 07:53
  • @PeterCordes is a minimal reproducible android project necessary? I can upload it to github – Finley Nov 24 '22 at 07:55
  • 1
    No, the cause is already clear: you wrote some code that only works for 32-bit ARM (https://godbolt.org/z/1Pcs7GhjE), and are compiling it for 64-bit AArch64 as part of your android project. Use `#idef __aarch64__` to make sure you use the right inline asm. ([What predefined macro can I use to detect the target architecture in Clang?](https://stackoverflow.com/q/23934862) / [Get architecture type (ABI) to C preprocessor for Android NDK](https://stackoverflow.com/q/17067263)) – Peter Cordes Nov 24 '22 at 07:56
  • @PeterCordes Hey, I find a workaround in this [post](https://stackoverflow.com/questions/61210517/memcpy-for-arm-uncached-memory-for-arm64), it works for me! – Finley Nov 24 '22 at 08:26
  • @PeterCordes Besides, would you mind recommending to me any references for learning arm64-v8a/armeabi-v7a inline assembly(I google it while the most is not I want), because I know little about these codes and cannot maintain it in my project – Finley Nov 24 '22 at 08:30
  • The GCC manual's inline asm docs (https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html) have a section on machine-specific constraints. https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html AArch64 is at the top, alphabetically first so it's easy to find. Unfortunately there aren't official docs specifying modifiers like `%w0` to print the 32-bit version of a register name (`w0` instead of `x0`). I think there are some SO questions about it. I don't do much with AArch64 besides Stack Overflow questions about it, so there's probably some big gaps in my knowledge. – Peter Cordes Nov 24 '22 at 09:15

1 Answers1

0

Duplicate of this post. As @PeterCordes say, above inline assembly only can be compiled for 32-bit ARM

Finley
  • 795
  • 1
  • 8
  • 26