Questions tagged [x86]

x86 is an architecture derived from the Intel 8086 CPU. The x86 family includes the 32-bit IA-32 and 64-bit x86-64 architectures, as well as legacy 16-bit architectures. Questions about the latter should be tagged [x86-16] and/or [emu8086]. Use the [x86-64] tag if your question is specific to 64-bit x86-64. For the x86 FPU, use the tag [x87]. For SSE1/2/3/4 / AVX* also use [sse], and any of [avx] / [avx2] / [avx512] that apply

The x86 family of CPUs contains 16-, 32-, and 64-bit processors from several manufacturers, with backward-compatible instruction sets, going back to the Intel 8086 introduced in 1978.

There is an x86-64 tag for things specific to that architecture, but most of the info here applies to both. It makes more sense to collect everything here. Questions can be tagged with either or both. Questions specific to features only found in the x86-64 architecture, like RIP-relative addressing, clearly belong in x86-64. Questions like "how to speed up this code with vectors or any other tricks" are fine for x86, even if the intention is to compile for 64bit.

Related tag with tag-wikis:

sse wiki (some good SIMD guides), and avx (not much there)
inline-assembly wiki for guides specific to interfacing with a compiler that way.
intel-syntax wiki and att wiki have more details about the differences between the two major x86 assembly syntaxes. And for Intel, how to spot which flavour of Intel syntax it is, like NASM vs. MASM/TASM.

Learning resources

Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” has a gentle introduction to x86 asm itself for asm beginners who know C or C++, as well a very useful guide to looking at compiler output.

If you don't know how to do something in asm, write a simple C function that does it and see what an optimizing compiler does. e.g. int foo(char *p) { return *p; } shows you how to use movsx. See also How to remove "noise" from GCC/clang assembly output?
Short x86 Assembly Guide targetting 32 bit mode and MASM assembler, but being brief and target-agnostic enough to be used as a starting point for any "Intel" syntax dialect assembler (NASM, YASM, FASM, ...).
Suggestions on how to learn asm, with a recommendation against 16bit DOS. Questions should use the x86-16, emu8086, and/or dos tags if applicable, as well as x86 (which includes all platforms.)
To learn assembly - should I start with 32 bit or 64 bit?
OSdev.org: a great resource if you want to understand / modify OS internals or make your own toy OS. Not useful for writing / debugging normal programs that run under existing OSes.
General Tips for Bootloader Development. (Using legacy BIOS, not UEFI).
Working example of a legacy BIOS int 10h bootloader that loads a "kernel" and calls a C main function in it, in 32-bit protected mode. Includes instructions on how to build and link it with NASM, gcc -m32, and ld (with a linker script). And how to make a disk image and run it on QEMU.
the inline-assembly tag wiki. (But see also https://gcc.gnu.org/wiki/DontUseInlineAsm - inline asm is more complicated than writing stand-alone asm functions you call from C, so it's not good for learning asm.)
Using GNU C/C++ inline ASM. The bottom of that answer has a collection of links to info on how to write inline asm that's efficient and correct. The first part of the answer explains why it's not a good way to learn asm in the first place. Don't try to "get your feet wet" with asm by using inline asm. You have to understand everything to write correct input/output operand constraints and clobbers.
Understanding Carry vs. Overflow conditions/flags, normally relevant for unsigned vs. signed respectively.
Style guide: indenting columns for labels / instructions / operands / comments: a Code Review.SE answer: https://codereview.stackexchange.com/questions/204902/checking-if-a-number-is-prime-in-nasm-win64-assembly/204965#204965

Quick guide to what's different in x86-64. AT&T syntax. NASM and YASM behave differently (from each other) in choice of encoding for mov rax, 1, and don't use a separate movabs mnemonic for the 64bit-immediate form.
Introduction to x64 Assembly (PDF published by Intel). Uses MASM syntax. Spends a bit of time talking about the Windows calling convention and / MSVC-specific toolchain issues (like no MSVC inline asm in 64-bit mode), as you might expect from using "x64" in the article title instead of x86-64. But looks like some good generally-applicable stuff that isn't OS-specific. For some bizarre reason, it suggests using the slow LOOP instruction, so it's not perfect.
A NASM tutorial for x86-64 Linux (nasm -felf64) and MacOS (nasm -fmacho64). Includes some basic SIMD stuff, but forgets to use alignas(16) on the C arrays that require alignment, and uses movaps with integer, movdqa with float. (Which is not a correctness problem, and on most CPUs probably not a performance problem, but is backwards.) Otherwise mostly looks good.
Encoding Real x86 Instructions: a tutorial (course material) on how instructions are encoded into machine code. Lots of diagrams.
x86 on Wikipedia
x86 Assembly wikibook
Assembly Language for x86 Processors (website for Kip Irvine's book)
Programming from the Ground Up, a free (GFDL) book by Jonathan Bartlett. Errata for the book. Available as a small (1MB) PDF from the "download" link on that page, or as HTML chapters . It uses 32-bit x86 asm with AT&T syntax on Linux, and has some good stuff about how to "think like a computer" to figure out how to get things done in asm. It covers some essential operating-system stuff like virtual memory, and things like that necessary to understand what's going on, as well as assembly / machine language itself.
x86-64 Assembly Language Programming with Ubuntu, a free book using YASM (NASM syntax) for GNU/Linux. The PDF is CC-BY-NC-SA. Unfortunately no mention of default rel or [rel x] RIP-relative addressing so it's missing some stuff that's essential in practice. But does have some introductory stuff about basics like data representation, bits and bytes in memory vs. registers, and other background beyond just what each instruction does.
8086 assembler tutorial for beginners - emu8086 (MASM/TASM style) 16-bit only, but starts out with some nice intro stuff about hex vs. decimal, what assembly language is, what registers are and how memory is addressed, and how to look at memory in the debugger, before jumping into how specific instructions work.
Assembly tutorial - Dr. Paul Carter
Windows Assembly Programming Tutorial
Why do functions have to save some registers, but not others? See below for links to guides & docs for specific calling conventions.
How to trace what a function does: figure out the inputs and the outputs, then figure out what it does with them.
Linux x86 Program Start Up or - How the heck do we get to main()
A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux
What do the register-names like esi mean, and what special purposes do they have. They're all acronyms, like Counter register, or Source Index.

Guides for performance tuning / optimisation:

Agner Fog's optimization guides and resources. Includes latency/throughput tables for P5 onwards. Also much qualitative discussion of how to go about making your code faster. Also has a good guide to the different calling conventions across OSes, and covers linking / symbols / relocation.
Intel's Sandybridge microarchitecture family can't micro-fuse indexed addressing modes in the out-of-order core, only in the decoders and uop-cache. Also: Haswell's dedicated store-address unit on port7 only works with simple effective addresses. Complex effective addresses need the AGU on a load port.
Enhanced REP MOVSB for memcpy: single-threaded bandwidth vs. aggregate bandwidth on desktop vs. many-core CPUs, RFO vs. non-RFO stores. (Modern CPUs have more DRAM / L3 bandwidth than a single core can use; there are other bottlenecks especially in many-core chips).
What Every Programmer Should Know About Memory by Ulrich Drepper. (Originally posted as a series of LWN articles, Ulrich published the PDF later). How DRAM and caches work, their behaviour, and how to optimize software for cache locality. Includes some charts with real microbenchmark data to illustrate points, and a cache-blocked SSE2 matrix multiply example. See a 2017 review of what's outdated, e.g. the P4 software prefetch stuff is mostly obsolete.
Why xor same,same is better than mov reg, 0 for zeroing a register There are several reasons, some simple and some subtle (e.g. avoiding partial-register stalls on P6/SnB family).
Serializing RDTSC with LFENCE vs. CPUID for benchmarking short sequences within a program.
How to get the CPU cycle count in x86_64 from C++? (including a bunch of info on what rdtsc measures, exactly, and caveats for using it, with links to even more details).
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?: intro to static performance analysis.
Intel's IACA (Intel Architecture Code Analyzer): analyze marked sections of code for throughput (e.g. cycles per iteration) or latency of the critical path. Assumes perfect cache, and other simplifications, and isn't always correct, but can be useful. Was stalled, but updated again for Skylake-X (AVX512). See What is IACA and how do I use it? for a tutorial.
uiCA (uops.info Code Analyzer) is like IACA but with an accurate model of the front-end fetch/pre-decode/decode (and uop cache or LSD if applicable, I assume) not just 4-wide or 5-wide issue that IACA assumes. See Do 32-bit and 64-bit registers cause differences in CPU micro architecture? for an example output graph.
Haswell microarchitecture, Bulldozer microarchitecture. David Kanter's analysis. He's also done writeups on earlier uarches, like Sandybridge and Nehalem.
Modern Microprocessors A 90-Minute Guide!: from in-order pipelined to super-scalar out-of-order. And brainiac (PPro) vs. speed demon (Pentium 4), and Pentium 4 hitting the "power wall" in CPU design.
A whirlwind introduction to dataflow graphs: how to analyze dependency chains for throughput and latency.
http://www.uops.info/ very detailed uop / execution port testing on Intel CPUs, finding some things that repeating a large block of the same instruction (like Agner Fog's testing) sometimes misses.
New CPUs will usually have AIDA64 InstLatx64 results before Agner Fog can test and publish updated tables. For example, Skylake-avx512, and see also https://github.com/InstLatx64/InstLatx64 for a mirror + a spreadsheet of Skylake-AVX512 port assignments (compiled from IACA-2.3 output). BDW vs. SKL points out some of the interesting changes in SKL (more throughput for more instructions, different FP latency).
2015 IDF slides from the Skylake power management talk Unfortunately the main site (http://myeventagenda.com/sessions/0B9F4191-1C29-408A-8B61-65D7520025A8/7/5) which had video (of slides + audio) is offline now.

Instruction set / asm syntax references:

Intel's vector intrinsics finder/search (very good): search by asm mnemonic or C intrinsic name
x86/x64 SIMD Instruction List (SSE to AVX512) Beta: A nice compact table listing instruction mnemonics and their intrinsics, broken down by type and element-size. Detailed pages with graphical data-movement diagrams for each instruction.
SIMD guides in the SSE tag wiki, focusing on how to actually make good use of SIMD in general, not just what the available instructions are.
Intel's manuals, including instruction set reference manual. Extremely detailed description of everything every instruction does to the architectural state. Big, but has a decent index / table of contents. Also on that page: Intel's optimization manual. Some of the same advice as Agner Fog's guides, but sometimes without explaining exactly why in terms of microarch execution ports and other under-the-hood reasons. Also sometimes obsolete, for example recommending against inc/dec long after P4 is irrelevant.
AMD's x86 manuals, including instruction-set reference and optimization manuals.
HTML version of Intel's insn set reference, auto-generated from the PDF. One page per instruction, great for linking in answers.
Another HTML extract, including AVX512, CLFLUSHOPT, etc.. This makes it more cluttered, and harder to find what you need, if you're not targeting AVX512. (But note that CLFLUSH has changed to being strongly-ordered, but felixcloutier.com's HTML extract still has the old documentation. There may be other inaccuracies in the old docs, even for old instructions.)
https://sandpile.org - CPUID maps, instruction encoding, register diagrams, opcode map, miscellaneous other technical details.
x86 Instruction Reference including when introduced (8086, 186, 586, etc) - NASM appendix B. Includes undocumented instructions, and Cyrix-only MMX instructions, and stuff like that.

A fork of an older version includes English descriptions. The original had some errors in which generation introduced each form of each insn but this version keeps the nice formatting while fixing those. Handy for people still developing for x86-16. The similar wikipedia page doesn't mention that 386 is required for the faster 2-operand form of imul r16, r/m16 that doesn't have to calculate the upper half of the result.
x86 Opcode reference guide, sorted by opcode or by mnemonic. 32, 64, or both in one table. The "geek" version includes non-standard / undocumented opcodes, the "coder" one includes columns showing which if any flags are read and written.
Original 8086 errata / anomalies, such as mov ss, src not properly disabling interrupts until the end of the next instruction. Also see the parent directory for some errata, undocumented instructions, and stuff for 186/286/386.
Simply FPU: x87 tutorial. Helpful for understanding old x87 code, esp. the early sections about how the register stack works. (Use SSE for new code.)
fsin's precision is far worse than 1ulp for inputs close to pi, contrary to Intel's previous documentation. The other FP articles in Bruce Dawson's series are also excellent (index in this one on FP comparisons).
GNU as manual, aka gas manual
The NASM manual
YASM manual: describes YASM syntax and macros. Excellent register diagram showing partial registers, with their machine-code encodings, and a reminder on zero-extending vs. unmodified upper parts. (Another simpler register-subset diagram for a single reg).

Possible canonical duplicates for register subsets: Assembly registers in 64-bit architecture includes some calling-convention / usage stuff. How do AX, AH, AL map onto EAX? is a good one for bugs where AL and RAX were used for different things, corrupting each other.
MASM Reference Documentation, and an old MASM 6.1 manual from 1996. Confusing brackets in MASM32 shows that MASM surprisingly ignores brackets around symbolic immediates.
MASM syntax as used by JWasm. JWasm is a portable assembler.
FASM manual
table of AT&T(GNU) vs. NASM syntax for addressing modes and indirect jmp/call
All the available addressing modes (32/64-bit) (Intel syntax, with a note about NASM vs. MASM for mov reg, symbol), with links to further guides.
AT&T addressing-mode syntax
16-bit addressing modes.
TODO: find a good link for AMD's XOP instruction set. (Not recommended for general use; even AMD is dropping XOP support in their Zen architecture.)
Cheat sheet PDF
Win32-specific cheat sheet

OS-specific stuff: ABIs and system-call tables:

x86 ABIs (wikipedia): calling conventions for functions, including x86-64 Windows and System V (Linux). See also Agner Fog's nice calling convention guide
32-bit absolute addresses no longer allowed in x86-64 Linux? (PIE executables are now the default on most distros, with gcc configured with --enable-default-pie.)
Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array (OS X's image base is above the low 32, unlike Linux position-dependent executables). Also mentions 2 known bugs in some NASM versions with macho64 and RIP-relative or 64-bit absolute addressing.

System V ABI summary on osdev: i386 and x86-64, with links to random copies of the per-architecture supplement for various architectures, and the generic gABI that all the processor-specific supplement (psABI) documents expand on.
System V psABI official standard current revisions for x86-64 and i386 (wiki page on github, kept up to date by H.J. Lu). Direct link to x86-64 revision 1.0. Also links to the official forum for ABI discussion by maintainers/contributors.
clang/gcc sign/zero extend narrow args to 32bit, even though the System V ABI as written doesn't (yet?) require it. Clang-generated code also depends on it.
System V 32bit (i386) psABI (official standard, rev 1.1 Dec2015), used by Linux and Unix. (Some OSes don't require 16-byte stack alignment for 32-bit code; GNU/Linux does)
(Historical: very old SCO version of the i386 SysV ABI, before 16B stack alignment was required).

OS X 32bit x86 calling convention, with links to the others. The 64bit calling convention is System V. Apple's site just links to a FreeBSD pdf for that.

Windows x86-64 __fastcall calling convention
Windows __vectorcall: documents the 32bit and 64bit versions
Windows 32bit __stdcall: used used to call Win32 API functions. That page links to the other calling convention docs (e.g. __cdecl).
ABI cheat sheet: x86 vs. x64 vectorcall and non-vectorcall, vs. SysV. SysV section is incomplete.
Why does Windows64 use a different calling convention from all other OSes on x86-64?: some interesting history, esp. for the SysV ABI where the mailing list archives are public and go back before AMD's release of first silicon.
MSVC's 32bit CRT startup code sets the x87 FPU precision to 53 (double). That entire series of articles (table of contents in this one) is excellent, including asm output from MSVC in some examples.

The Definitive Guide to Linux System Calls (on x86). Examples of how to use int 0x80, 32-bit sysenter, and 64-bit syscall, and how to call through the vDSO for gettimeofday, and has some info about glibc's syscall wrappers. Lots of details, and also some background info / basics for beginners.
Linux system call tables. 64bit syscall numbers, with parameter->register mapping (derived from the kernel source code, and the standard rule for order of args).
FreeBSD system calls: question has FreeBSD syscalls, answer has Linux and others.
What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64: Note that 32bit int 0x80 restores all registers (including flags) except eax, while 64bit syscall also clobbers rcx and r11 as well as putting the return value in rax.

16bit interrupt list: PC BIOS system calls (int 10h / int 16h / etc, AH=callnumber), DOS system calls (int 21h/AH=callnumber), and more.

memory ordering:

Weak vs. Strong Memory Models: what it means when people say x86 has a "strongly ordered memory model". See also the c++ info page for many good links if you're using C11/C++11 atomics.
Memory Reordering Caught in the Act: A test case that demonstrates memory reordering in practice on a multicore x86 CPU.
A better x86 memory model: x86-TSO (extended version) A formal definition of the x86 memory model which hopefully matches how real hardware behaves.
Why isn't add dword [num], 1 atomic, even though it's a single instruction. Also asks about compiling num++ in C++. or See also Atomicity on x86: What does it mean for a load or store to be atomic, and how is it implemented internally?

Specific behaviour of specific implementations

TLB and Pagewalk Coherence in x86 Processors. Many x86 microarchitectures, especially Intel's, provide stronger ordering guarantees than the ISA requires for modifying a page-table entry that's not already cached in the TLB. Win95 even depended on this. (Don't write new code that depends on this.)
Measuring Reorder Buffer Capacity Another experimental test that demonstrates the capabilities and limits of out-of-order execution in real hardware.
What are the exhaustion characteristics of RDRAND on Ivy Bridge? With an answer from David Johnston (Intel RNG HW designer and librdrand author).

Q&As with good links, or directly useful answers:

Using GNU C/C++ inline ASM. (Same link from the learning-resources section, but worth repeating here.)
What are the best instruction sequences to generate vector constants on the fly?
Parallel programming using Haswell architecture
Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs. Has a long answer including some introductory computer-architecture stuff as well as details of what can stall a Haswell pipeline.
INC instruction vs ADD 1: Does it matter?
How can I run this assembly code on OS X?: OS X getting-started guide. (Symbol names are prepended with _ on OS X, unlike for Linux ELF systems.)
add/sub/LEA can be used with garbage in high bits, so LEA eax, [rdi + rsi*2 - 15] to compute a + 2*b - 15 works fine, even if a and b are only supposed to be 8 or 16 bits.
TODO: find a question about how to use a profiler to measure uops and stuff. perf comes with most Linux distros, and ocperf.py is a wrapper for it that provides more symbolic names for stuff like micro-arch-specific uop counters.

FAQs / canonical answers:

If you have a problem involving one of these issues, don't ask a new question until you've read and understood the relevant Q&A.

(TODO: find better question links for these. Ideally questions that make a good duplicate target for new dups. Also, expand this.)

My program crashes / segfaults: You need to use a debugger to find what instruction is crashing (see the bottom of this tag wiki for GDB and Visual Studio tips). Most buggy asm programs crash, so without more info this is not useful. Reasons can include clobbering registers or stack memory you shouldn't have, leaving esp pointing to the wrong place before a ret, or many many other reasons besides the following other common problems.
external assembly file in visual studio - VS mixed-source x64 project, for asm files as part of a C/C++ program.
Also Assembly programming - WinAsm vs Visual Studio 2017 for a pure asm project.
Building 32bit code on a 64bit system (with the GNU toolchain). gcc example.s makes a binary that runs in 64bit mode, which will crash if the code was written for 32bit mode. Related: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?.
Building an executable from asm source that defines _start vs. source that defines main, with gcc/as/ld and/or NASM. With or without libc, and static vs. dynamic executable.
Wide load on narrow data loading or modifying extra bytes, e.g. mov eax, [var] from a db 0.
ret from _start segfaults without making a Linux _exit syscall. ret doesn't work because it's not a function. What happens if there is no exit system call in an assembly program? also covers the case of falling off the end with no ret.

Execution just keeps going if there's no jump or ret, falling through to what's next: What if there is no return statement in a CALLed block of code in assembly programs and Why is no value returned if a function does not explicity use 'ret'.
Code executes condition wrong? fall through from the if into the else body in an if/else. Nicely explains that labels aren't magic and execution falls through them.
Segmentation fault when using DB (define byte) inside a function Putting data where it's executed as code. (Assembly (x86): <label> db 'string',0 does not get executed unless there's a jump instruction for legacy BIOS bootloaders with data at the top.)
idiv / div problems: Zero edx first, or sign-extend eax into it.. 32-bit div faults with #DE if the 64b/32b => 32b quotient doesn't actually fit in 32b. (On POSIX systems including Linux, this raises SIGFPE).

8-bit operand size like div dl is the special case where dx isn't involved, just AX and AH/AL. It still faults if the quotient overflows 8 bits.
No output from printf when I pipe the output, or print something without a newline? When you use the exit system call.
Calling printf in x86_64 using GNU assembler calling convention, stack alignment, and working example. Related NASM-syntax version Segfault while calling C function (printf) from Assembly

Canonical duplicate for scanf segfaulting on misaligned stack in modern Linux builds of glibc: glibc scanf Segmentation faults when called from a function that doesn't align RSP
Library functions modify registers / which registers do my functions need to save and restore? This is specified by the calling convention (part of the ABI) for the platform you're targeting. Search for those terms on this page. What registers must be preserved by an x86 function? is a decent canonical duplicate.
mismatched push/pop: if the stack pointer isn't pointing at the return address when you ret, you crash.
How do I handle multi-digit numbers? Linux, Windows, OS X, and DOS system calls for handling user input/output give you ASCII (or UTF-8) characters, or strings of characters. (Canonical Q&A for single-digit failure to do sub al, '0'). You normally need to convert between strings and binary integers to do math on them, like the C functions atoi or sprintf(buf, "%d", number). None of the common system-call APIs for major OSes that run on x86 provide these functions for you; only as libraries.

string-to-integer (32-bit NASM, algorithm works everywhere). (multiply by 10 for place value) Also includes an int-to-string loop.

Printing integers: 16-bit code to print 16 or 32-bit integers (in dx:ax) (1 digit at a time with MS-DOS int 21h, but could be adapted to store into a string or use a different output method.) Another example for unsigned 16b numbers in DOS that calculates digits and stores them into a string in memory.

2-digit decimal numbers (00-99), using BIOS int 10h for each digit: Displaying Time in Assembly. (Just a special case of the general algorithm, not looping.)

NASM x86-64 function to convert and print a 32-bit unsigned integer (using a single Linux write system call on a buffer). Other answers on the same question show printing one character at a time. AT&T version of the same function, also showing a 5x faster version that uses a multiplicative inverse instead of div to divide by the compile-time constant 10.

How to convert a binary integer number to a hex string? (32-bit NASM code. Scalar, SSE2, SSSE3, AVX512F, and AVX512VBMI versions.)
Loading pointers into registers vs. loading data into registers: Make sure you understand the different between mov reg, symbol and mov reg, [symbol] (NASM syntax), or MASM syntax: mov reg, OFFSET symbol vs. mov reg, symbol. Many beginner questions are caused by mistakes in dereferencing addresses, or not dereferencing. This is the same as pointers in C.
Invalid combination of opcode and operands error on mov [msg], [ebp+8]? You can't use two memory operands to one instruction. (Why IA32 does not allow memory to memory mov?)
Bit-shifts and rotates need the count in cl, not any other register, or as an immediate constant. shl eax, ebx is impossible, shl eax, 2 is fine, and so is shl eax, cl
Call an absolute pointer in x86 machine code or jmp to an absolute address. With examples in NASM and AT&T syntax.
Why do most x86-64 instructions zero the upper part of a 32 bit register? In fact, all instructions that write a 32bit register zero the upper 32 of the full 64bit register, so mov eax, 1234 is more efficient than mov rax, 1234, but equivalent. This is not the case for writing to 8 and 16bit registers, like al/ah/ax, so you need movzx or movsx if the upper bits might hold garbage and you need to clear them (e.g. before using as part of a memory address).
Using LEA on values that aren't addresses / pointers? It's just a shift-and-add ALU instruction that uses memory-operand syntax and machine encoding.
How to tell the length of an x86 instruction? – with an overview over the x86 instruction encoding
Reversing a string? This well-commented answer uses 16-bit ms-dos system calls to read the string, but the actual loop over the string works the same for 32 or 64-bit code.
Indexing an array without scaling the index by the element width, resulting in overlapping loads or stores. Declaring and indexing an integer array of qwords in assembly (x86-64 AT&T syntax)
boot loader works in QEMU but not on real hardware – real computers some times expect the MBR to have a BPB (BIOS parameter block). If the BPB is missing or wrong, the BPB area in the MBR is overwritten with “correct” values, corrupting your boot loader.
How do I do X in assembly: usually the same way you would in another programming language, like C. Figure out what needs to happen to the data before you get bogged down in writing instructions to make it happen.

How to get started / Debugging tools + guides

Find a debugger that will let you single-step through your code, and display registers while that happens. This is essential. We get many questions on here that are something like "why doesn't this code work" that could have been solved with a debugger.

On Windows, Visual Studio has a built-in debugger. See Debugging ASM with Visual Studio - Register content will not display. And see Assembly programming - WinAsm vs Visual Studio 2017 for a walk-through of setting up a Visual Studio project for a MASM 32-bit or 64-bit Hello World console application.

On Linux: A widely-available debugger is gdb. See Debugging assembly for some basic stuff about using it on Linux. Also How can one see content of stack with GDB?

There are various GDB front-ends, including GDBgui. Also guides for vanilla GDB:

With layout asm and layout reg enabled, GDB will highlight which registers changes since the last stop. Use stepi to single-step by instructions. Use x to examine memory at a given address (useful when trying to figure out why your code crashed while trying to read or write at a given address). In a binary without symbols (or even sections), you can use starti instead of run to stop before the first instruction. (On older GDB without starti, you can use b *0 as a hack to get gdb to stop on an error.) Use help x or whatever for help on any command.

GNU tools have an Intel-syntax mode that's similar to MASM, which is nice to read but is rarely used for hand-written source (NASM/YASM is nice for that if you want to stick with open-source tools but avoid AT&T syntax):

clang or gcc -Wall -O3 -masm=intel foo.c -fverbose-asm -S -o- | less (affects inline-asm)
GDB: set disassembly-flavor intel (can go in your ~/.gdbinit)
objdump -drwC -Mintel
perf report -Mintel

Another key tool for debugging is tracing system calls. e.g. on a Unix system, strace ./a.out will show you the args and return values of all the system calls your code makes. It knows how to decode the args into symbolic values like O_RDWR, so it's much more convenient (and likely to catch brain-farts or wrong values for constants) than using a debugger to look at registers before/after an int or syscall instruction. Note that it doesn't work correctly on Linux int 0x80 32-bit ABI system calls in 64-bit processes: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?.

To debug boot or kernel code, boot it in Bochs, qemu, or maybe even DOSBox, or any other virtual machine / simulator / emulator. Use the debugging facilities of the VM to get way better information than the usual "it locks up" you will experience with buggy privileged code.

Bochs is generally recommended for debugging real-mode bootloaders, especially ones that switch to protected mode; Bochs's built-in debugger understands segmentation (unlike GDB), and can parse a GDT, IDT, and page tables to make sure you got the fields right.

For DOS programs, see the x86-16 tag wiki for debuggers that run inside the guest, and thus can debug a specific DOS program maybe more easily than Bochs for the whole system.

REPL (Read Eval Print Loop) environments for typing an instruction and seeing what it does to register values. Maybe only useful for user-space, perhaps not osdev stuff.

16952 questions

101

votes

7 answers

Why does Intel hide internal RISC core in their processors?

Starting with Pentium Pro (P6 microarchitecture), Intel redesigned it's microprocessors and used internal RISC core under the old CISC instructions. Since Pentium Pro all CISC instructions are divided into smaller parts (uops) and then executed by…

assembly x86 intel cpu-architecture

asked Apr 27 '11 at 15:27

Goofy

5,187
5
40
56

101

votes

8 answers

What are IN & OUT instructions in x86 used for?

I've encoutered these to instructions IN & OUT while reading "Understanding Linux Kernel" book. I've looked up reference manual. 5.1.9 I/O Instructions These instructions move data between the processor’s I/O ports and a register or…

assembly x86 linux-kernel

asked Jul 09 '10 at 19:23

claws

52,236
58
146
195

101

votes

7 answers

How do I disassemble raw 16-bit x86 machine code?

I'd like to disassemble the MBR (first 512 bytes) of a bootable x86 disk that I have. I have copied the MBR to a file using dd if=/dev/my-device of=mbr bs=512 count=1 Any suggestions for a Linux utility that can disassemble the file mbr?

linux assembly x86 x86-16 mbr

asked Nov 15 '09 at 09:36

sigjuice

28,661
12
68
93

votes

6 answers

Enhanced REP MOVSB for memcpy

I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy. ERMSB was introduced with the Ivy Bridge microarchitecture. See the section "Enhanced REP MOVSB and STOSB operation (ERMSB)" in the Intel optimization manual…

performance assembly x86 cpu-architecture memcpy

asked Apr 11 '17 at 10:22

Z boson

32,619
11
123
226

votes

7 answers

Limitations of Intel Assembly Syntax Compared to AT&T

To me, Intel syntax is much easier to read. If I go traipsing through assembly forest concentrating only on Intel syntax, will I miss anything? Is there any reason I would want to switch to AT&T (outside of being able to read others' AT&T assembly)?…

linux assembly x86 att intel-syntax

asked Jun 09 '09 at 21:28

oevna

1,246
1
11
10

votes

2 answers

What does "rep; nop;" mean in x86 assembly? Is it the same as the "pause" instruction?

What does rep; nop mean? Is it the same as pause instruction? Is it the same as rep nop (without the semi-colon)? What's the difference to the simple nop instruction? Does it behave differently on AMD and Intel processors? (bonus) Where is the…

assembly x86 x86-64 cpu machine-code

asked Aug 16 '11 at 23:12

Denilson Sá Maia

47,466
33
109
111

votes

3 answers

Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

LOOP (Intel ref manual entry) decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets…

performance assembly x86 intel cpu-architecture

asked Mar 02 '16 at 09:01

Peter Cordes

328,167
45
605
847

votes

7 answers

How much memory can a 32 bit process access on a 64 bit operating system?

On Windows, under normal circumstances a 32 bit process can only access 2GB of RAM (or 3GB with a special switch in the boot.ini file). When running a 32 bit process on a 64 bit operating system, how much memory is available? Are there any special…

windows memory x86 virtual-address-space wow64

asked Mar 12 '09 at 17:00

jjxtra

20,415
16
100
140

votes

4 answers

What does the "lock" instruction mean in x86 assembly?

I saw some x86 assembly in Qt's source: q_atomic_increment: movl 4(%esp), %ecx lock incl (%ecx) mov $0,%eax setne %al ret .align 4,0x90 .type q_atomic_increment,@function .size …

c++ qt assembly x86

asked Jan 17 '12 at 07:33

gemfield

3,228
7
27
28

votes

2 answers

Which variable size to use (db, dw, dd) with x86 assembly?

I don't know what all the db, dw, dd, things mean. I have tried to write this little script that does 1+1, stores it in a variable and then displays the result. Here is my code so far: .386 .model flat, stdcall option casemap :none include…

variables assembly x86

asked Apr 16 '12 at 04:32

Progrmr

1,575
4
26
44

votes

1 answer

memory bandwidth for many channels x86 systems

I'm testing the memory bandwidth on a desktop and a server. Sklyake desktop 4 cores/8 hardware threads Skylake server Xeon 8168 dual socket 48 cores (24 per socket) / 96 hardware threads The peak bandwidth of the system is Peak bandwidth desktop =…

c x86 openmp avx512 memory-bandwidth

asked Jun 28 '19 at 09:05

Z boson

32,619
11
123
226

votes

3 answers

Where is the lock for a std::atomic?

If a data structure has multiple elements in it, the atomic version of it cannot (always) be lock-free. I was told that this is true for larger types because the CPU can not atomically change the data without using some sort of lock. for…

c++ c++11 x86 atomic stdatomic

asked May 11 '18 at 18:38

curiousguy12

1,741
1
10
15

votes

3 answers

Double cast to unsigned int on Win32 is truncating to 2,147,483,648

Compiling the following code: double getDouble() { double value = 2147483649.0; return value; } int main() { printf("INT_MAX: %u\n", INT_MAX); printf("UINT_MAX: %u\n", UINT_MAX); printf("Double value: %f\n", getDouble()); …

c visual-c++ casting x86 floating-point

asked Sep 20 '20 at 19:52

Matheus Rossi Saciotto

1,100
9
19

votes

2 answers

Can I use Intel syntax of x86 assembly with GCC?

I want to write a small low level program. For some parts of it I will need to use assembly language, but the rest of the code will be written on C/C++. So, if I will use GCC to mix C/C++ with assembly code, do I need to use AT&T syntax or can I…

c gcc assembly x86 inline-assembly

asked Feb 19 '12 at 08:54

Hlib

2,944
6
29
33

votes

3 answers

Why is Windows 32-bit called Windows x86 and not Windows x32?

The Windows operating system can be either 32 bit or 64 bit. The 64 bit version is called Windows x64 but the 32 bit version is called Windows x86. Why isn't it called Windows x32? What is the reason?

windows x86 operating-system 32bit-64bit 32-bit

asked Apr 30 '15 at 17:43

Bacteria

8,406
10
50
67

Prev 1 2 3

…

99 100 Next