Questions tagged [micro-optimization]

Micro-optimization is the process of meticulous tuning of small sections of code in order to address a perceived deficiency in some aspect of its operation (excessive memory usage, poor performance, etc).

Micro-optimization is the process of meticulous tuning of small sections of code in order to address a perceived deficiency in some aspect of its operation (excessive memory usage, poor performance, etc).

Micro-optimization (and optimization in general) tends to be interesting to programmers because they enjoy finding clever solutions to problems. However, micro-optimization carries the connotation of a disproportionate amount of effort being expended to extract relatively small improvements.

That's not to say that micro-optimization is bad practice in all circumstances. Sometimes a small improvement in a part of a code base that gets used frequently (such as the innermost part of a loop) can yield big overall gains in system performance, and building code for highly constrained systems such as microcontrollers will often require cleverness to eke out the most performance from such a small system.

However, it can be tempting to indulge in the practice where it's not necessary, resulting in a lot of time being spent that could have been used more productively, and in code that is difficult to follow as "clever" solutions to problems are often more difficult to understand than simple solutions, and therefore a micro-optimization can have a negative impact on the maintainability of a piece of code.

Programmers are advised to avoid micro-optimization, unless they can make a solid justification for the problems outlined above being worth the performance gains. Should profiling of the code in question identify a hot-spot that is causing a performance bottleneck, then this can be sufficient justification for a micro-optimization.

900 questions
0
votes
2 answers

Fastest way to find 16bit match in a 4 element short array?

I may confirm by using nanobench. Today I don't feel clever and can't think of an easy way I have a array, short arr[]={0x1234, 0x5432, 0x9090, 0xFEED};. I know I can use SIMD to compare all elements at once, using movemask+tzcnt to find the index…
user20746246
0
votes
0 answers

Causes of performance differences with different instruction orderings?

I ran into a performance issue with a function similar to the following: pub fn attacked(&self, sq: usize) -> bool { self.lut1[sq] || self.lut2[sq] || self.lut3[sq] || self.lut4[sq] || self.lut5[sq] } A number of look-up tables (i.e. arrays,…
user1002430
0
votes
1 answer

How to unroll a loop of a dot product in mips after re-ordering instructions?

I got this question about loop unrolling in mips but I could not figure out how once I got to the step I will show you below, I am not even sure about this steps. I am new to computer Arch, I just had this code snippet which is in assembly: Loop: ld…
0
votes
1 answer

Flex SQLite optimization: adding database name in front of tables

This article on Livedocs says you should add the database name in front of tables. http://help.adobe.com/en_US/AIR/1.5/devappsflex/WS5b3ccc516d4fbf351e63e3d118666ade46-7d47.html My question is, should I do that everywhere where a table name appears…
Francisc
  • 77,430
  • 63
  • 180
  • 276
0
votes
0 answers

What causes stalled-cycles-frontend to rise?

I'm optimizing code. My old code uses an if statement and a goto on true and false. My new code looks up data in an array (which I thought might raise stalled backend) then uses a goto on true and false to different labels. Branch misses dropped…
0
votes
1 answer

How can I see which i686 instructions are faster

As the title says, I want to see which i686 instructions are faster, how can I see? Example: is adding to a register faster or moving a value to a reg faster?
Cyao
  • 727
  • 4
  • 18
0
votes
0 answers

Swap function potential missed optimization? (gcc)

I wrote this swap function in Linux x86-64 assembly. swap: ; written by me mov al, byte [rdi] xchg byte [rsi], al mov byte [rdi], al ret Out of curiosity, I also compiled the following C code with -O3 void swap(char *a, char *b) { …
avighnac
  • 376
  • 4
  • 12
0
votes
0 answers

C++ Inline asm idiv modulo

I am writing a modulo function and want to optimize the number of instructions called. Currently it looks like this #include constexpr long long mod = 1e9 + 7; static __attribute__((always_inline)) long long modulo(long long x) noexcept { …
0
votes
1 answer

Is it more efficient to return a string literal or a const string in each subtype in Java?

I'm newish to Java but having coded in similar languages before I feel I should know the answer to this but there are things in Java (e.g. generics) that are counterintuitive so hopefully someone can enlighten me in this case. If I have a base class…
Colm
  • 155
  • 11
0
votes
3 answers

Is there a tool to test the conciseness of c program?

For example I want to check whether the following code can be more concise or not: for(i = 0; i < map->size; i++){ if(0 < map->bucket[i].n){ p = map->bucket[i].list; while(p){ h = hash(p->key) % n; …
Je Rog
  • 5,675
  • 8
  • 39
  • 47
0
votes
4 answers

Inlining assembly in C

I'm writing a chess engine in c, and speed is essential. The chess engine is based on unsigned long long which I will denote as u64 and it relies heavily on a least significant bit scan. Up until now I have been using the gcc function…
0
votes
1 answer

Are there performance/storage differences between uint2 and uint64_t in cuda10+?

I'm trying to optimize a piece of code for A100 GPUs (ampere gen), right now we use uint64_t but I am seeing uint2 datatypes being used instead in some cuda code. Does the uint2 offer advantages for register usage? I know there are a limited number…
0
votes
0 answers

Avx loop unrolling

I generate high performance loop in runtime which for example sums two array. I want to unroll my loop. Which sequence of operations inside loop should I choose: a. Load as many data as possible (constrained by number of ymm registers) b. Process…
Yuriy
  • 377
  • 1
  • 2
  • 10
0
votes
1 answer

How to speed up my Print all partitions of an n-element set into k unordered sets

how to speed up my program? my task: 1<=k<=n<=10, time 1 sec Print all partitions of an n-element set into k unordered sets. Partitions can be output in any order. Within a partition, sets can be displayed in any order. Within the set, numbers must…
taburetca
  • 31
  • 6
0
votes
0 answers

Do x86 and other architectures have a fused shift and add?

A number of architectures support fused multiply and add such as x86 with pmaddwd (and its SSE extensions), but I am unaware of any x86 fused shift and add which is effectively equivalent to FMA. This question is predominantly about x86, but knowing…
AMDG
  • 1,118
  • 1
  • 13
  • 29