3

I am wondering if this is possible for a Krait 400 CPU. I followed some of the suggestions here

When I compile with mcpu=cortexa15 , then the code compiles and effectively I see udiv instructions in the assembly dump.

However, I would like to know:

  1. Is it possible to get it to work with march=armv7-a? (not specifying a cpu; this is how I have it originally)
  2. I tried to use mcpu=krait2, but since I am not using the snapdragon llvm (I don't know yet how much effort that would be) it does not recognize it. Is it possible to get the cpu definition from the llvm and somehow make it available to my compiler?
  3. Any other method/patch/trick?

My compiler options are as follows:

 /development/android-ndk-r8e/toolchains/arm-linux-androideabi-4.7/prebuilt/linux-x86_64/bin/arm-linux-androideabi-gcc  -DANDROID -DNEON -fexceptions -Wno-psabi --sysroot=/development/android-ndk-r8e/platforms/android-14/arch-arm -fpic -funwind-tables -funswitch-loops -finline-limit=300 -fsigned-char -no-canonical-prefixes -march=armv7-a -mfloat-abi=softfp -mfpu=neon -fdata-sections -ffunction-sections -Wa,--noexecstack  -marm -fomit-frame-pointer -fstrict-aliasing -O3 -DNDEBUG

The error that I get is:

Error: selected processor does not support ARM mode `udiv r1,r1,r3'

As a side note I have to say that I am just beginning o understand the whole scheme, therefore I want to keep it in small steps to understand what I am doing.

Thanks in advance.

EDIT 1:

I tried compiling a separate module only including the udiv instruction. That module is compiled using the -mcpu=cortex-a15 arameter, while the rest of the application is compiled using the -march=armv7-a parameter. The result was (somehow expected) that the function call overhead affected the time performance of the application. I could not get inline code since tring to get in inline resulted in the same error that I originally had. I will switch to the the Snapdragon to see if there is a better performance before trying to reinvent the wheel. Thanks everybody for their answers and tips.

Community
  • 1
  • 1
Paco
  • 41
  • 3
  • If you are using ndk you can't make such hacks and expect it to work with many targets. If you want to play with krait then just use a15. That would be easiest. – auselen Feb 25 '14 at 19:17
  • @auselen Thanks for noticing it was Android. I gave a *bare metal* answer for querying `idiv` support; if you have Linux, */proc/cpuinfo* is the best source of info. I will try to remove the **ISAR0** and merge all the comments to my answer when I have time... – artless noise Feb 25 '14 at 23:51
  • Thanks @auselen and artless noise. I understand that if this is achieved, then the code would (probably) only work with Armv7-a CPUs **with idiv support**. When you say: "If you want to play with krait then just use a15." you mean that it would be easiest to get it to compile? (i.e. switch the flags for gcc). I am a bit concerned though that the code generated is tailored for a CPU that I actually do not have. At this moment I cannot assess the impact of that change. I think I will also look for literature about that. – Paco Feb 26 '14 at 09:00

1 Answers1

3

idiv - an amalgam to mean both sdiv and udiv is supported is an optional Cortex-A instruction. The support by a Cortex-A can be queried via the ID_ISAR0 cp15 registers, in bits [27:24].

  /* Get idiv support. */
  unsigned int ISAR0;
  int idiv;
  __asm ("mrc 15, 0, %0, c0, c2, 0" :"=r" (ISAR0));
#ifdef __thumb2__
  idiv = (ISAR0 & 0xf000000UL) ? 1 : 0;
#else
  idiv = (ISAR0 & 0xf000000UL) == 0x2000000UL ? 1 : 0;
#endif

Bits [27:24] are 0001, if only thumb2 supports the udiv and sdiv instructions. If the bits [27:24] are 0010, then both modes support the instructions.

As the gcc flags -march=armv7-a, etc mean that the code should work on ALL CPUs of this type and this instruction is optional, it would be an error for gcc to emit this instruction.

You may compile different modules with different flags such as,

gcc -march=armv7-a -o general.o -c general.c 
gcc -mcpu=cortex-a15 -D_USE_IDIV_=1 -o fast_idiv.o -c fast_div.c 

These modules can be linked together and the above code can be used to select at run time an appropriate routine. For example, both files may have,

  #include "fir_template.def"

and this file might have,

#ifdef _USE_IDIV_
  #define _FUNC(x) idiv_ ## x
#else
  #define _FUNC(x) x
#endif

int _FUNC(fir8)(FILTER8 *filter, SAMPLE *data,)
{
   ....
}

If you know your code will only run on a Cortex-a15, then use the -mcpu option. If you want this to run faster IF it can and be generic (support all armv7-a CPUs), then you must ID the CPU as outlined above and dynamically select the code.

Addendum: The files above (general.c and fast_idiv.c) could be put in separate shared libraries with the same API. Then interrogate /proc/cpuinfo and see if idiv is supported. Set the LD_LIBRARY_PATH (or dlopen()) to the appropriate version. The choice will depend on how much code is involved.

artless noise
  • 21,212
  • 6
  • 68
  • 105
  • There are lots of other ways to do this. In the future `gcc` may support *tune* function aliases on the ARM; but I think this is only currently available on the *x86*. However, the idea always works; use two sets of functions and ID at run time. Probably the *cortex-a5* is the best one to use as it supports `idiv` and is a lowest common denominator. There are other optional instructions that you need to take care of that *fast_idiv.c* doesn't emit. See the other question for possible CPUs. – artless noise Feb 25 '14 at 18:39
  • Err. You have a [FUBAR CPU](https://lkml.org/lkml/2013/3/12/825); you will have to ID it another way as per the link. – artless noise Feb 25 '14 at 18:46
  • I wonder if you can trick gcc with your own implementation of '__aeabi_idiv' and expect link time optimization to fix it perfectly. – auselen Feb 25 '14 at 19:19
  • So what I am saying is with armv7a switch and supplying your own idiv containing hw division, can you get final binary as it was compiled with hw div support. – auselen Feb 25 '14 at 21:29
  • @auselen With *multi-lib*, you might be able to make the `__aeabi_idiv` link to a routine like `udiv r0, r1, r2\n bx lr\n`, but the advantage of having `idiv` direct is no function call or register juggling to call a function. I guess it may be better; but I think what I proposed above will be fastest with the current state of tools. – artless noise Feb 25 '14 at 21:29
  • Essentially I wonder if lto can remove the branch to a single instruction function – auselen Feb 25 '14 at 21:31
  • @auselen No, this is not a loader option at present; at the very least the code has to move values to `R0-R3` so that when called on the non-idiv, the function machinery is right. As well, you need to save a stack frame, whereas otherwise it maybe a leaf function. Also the scheduling is off as I suggested in the my answer to your original question. The best is to compile the function twice; you can use library paths and shared libraries to select the function via a script or call via function pointers as I suggest above. – artless noise Feb 25 '14 at 22:06