Why does Windows require DLL data to be imported?

Question

On Windows data can be loaded from DLLs, but it requires indirection through a pointer in the import address table. As a result, the compiler must know if an object that is being accessed is being imported from a DLL by using the __declspec(dllimport) type specifier.

This is unfortunate because it means a that a header for a Windows library designed to be used as either a static library or a dynamic library needs to know which version of the library the program is linking to. This requirement is not applicable to functions, which are transparently emulated for DLLs with a stub function calling the real function, whose address is stored in the import address table.

On Linux the dynamic linker (ld.so) copies the values of all linked data objects from a shared object into a private mapped region for each process. This doesn't require indirection because the address of the private mapped region is local to the module, so its address is decided when the program is linked (and in the case of position independent executables, relative addressing is used).

Why doesn't Windows do the same? Is there a situation where a DLL might be loaded more than once, and thus require multiple copies of linked data? Even if that was the case, it wouldn't be applicable to read only data.

It seems that the MSVCRT handles this issue by defining the _DLL macro when targeting the dynamic C runtime library (with the /MD or /MDd flag), then using that in all standard headers to conditionally declare all exported symbols with __declspec(dllimport). I suppose you could reuse this macro if you only supported statically linking when using the static C runtime and dynamically linking when using the dynamic C runtime.

References:

LNK4217 - Russ Keldorph's WebLog (emphasis mine)

__declspec(dllimport) can be used on both code and data, and its semantics are subtly different between the two. When applied to a routine call, it is purely a performance optimization. For data, it is required for correctness.

[...]

Importing data

If you export a data item from a DLL, you must declare it with __declspec(dllimport) in the code that accesses it. In this case, instead of generating a direct load from memory, the compiler generates a load through a pointer, resulting in one additional indirection. Unlike calls, where the linker will fix up the code correctly whether the routine was declared __declspec(dllimport) or not, accessing imported data requires __declspec(dllimport). If omitted, the code will wind up accessing the IAT entry instead of the data in the DLL, probably resulting in unexpected behavior.

Importing into an Application Using __declspec(dllimport)

Using __declspec(dllimport) is optional on function declarations, but the compiler produces more efficient code if you use this keyword. However, you must use `__declspec(dllimport) for the importing executable to access the DLL's public data symbols and objects.

Importing Data Using __declspec(dllimport)

When you mark the data as __declspec(dllimport), the compiler automatically generates the indirection code for you.

Importing Using DEF Files (interesting historical notes about accessing the IAT directly)

How do I share data in my DLL with an application or with other DLLs?

By default, each process using a DLL has its own instance of all the DLLs global and static variables.

Linker Tools Warning LNK4217

What happens when you get dllimport wrong? (seems to be unaware of data semantics)

How do I export data from a DLL?

CRT Library Features (documents the _DLL macro)

_Why doesn't Windows do the same? Is there a situation where a DLL might be loaded more than once, and thus require multiple copies of linked data?_ I think I read about "performance improvements" in window's early days. — movcmpret, Nov 09 '18 at 14:09
@movcmpret The Windows way is strictly less efficient. It's an indirect access (through the IAT) instead of a direct access to an address known at link time (whose contents are initialized at load time). — J Alan, Nov 09 '18 at 14:13
@JAlan - address inside external module can not be known at link time. access data in external module via pointer to it is only way. windows is efficient. however your question is unclear for me — RbMm, Nov 09 '18 at 14:33
The indirect jump though the IAT or dllimport pointer prevents modifying the code pages when the target DLL could not be loaded at its link-time base address. That mattered a lot back when it had to run in 16MB of ram. It still doesn't hurt. — Hans Passant, Nov 09 '18 at 14:35
[Stackoverflow](https://stackoverflow.com/questions/16737347/shared-libraries-windows-vs-linux-method) This is an interesting thread (explaining why windows works like this. related to @HansPassant). — movcmpret, Nov 09 '18 at 14:40
@RbMm The data is copied from the external module into the local module when the external module is loaded. See the quoted reference in the post: "By default, each process using a DLL has its own instance of all the DLLs global and static variables." — J Alan, Nov 09 '18 at 14:46
I'm on the fence here: is this question answerable? is the "why?" of this subject mainly to opinion? or is there a technical answer a specialist can give? even if it is answerable as a technical question, is Stackoverflow the *best* StackExchange site for this type of question? my aim is for this question to be in front of the best audience for answering it; HTH — landru27, Nov 09 '18 at 14:47
@HansPassant It doesn't matter where the DLL is loaded, because the data is copied into the module loading it, at an address known to it at link time. See my previous comment. — J Alan, Nov 09 '18 at 14:50
@landru27 I am looking for a technical answer but I would accept a historical answer in lieu of that. This is definitely the best site for it, because StackOverflow extensively deals with C, the Win32 API, compilers, and operating systems. I also think there is value in leaving this question because I have found no references what-so-ever discussing this particular limitation. — J Alan, Nov 09 '18 at 14:52
No, nothing gets copied, you are addressing the data in the DLL. But data is not relocatable so it has to be done through a pointer. That is why dllimport is a hard requirement for exported data, it tells the compiler to always use the pointer even if your code doesn't. — Hans Passant, Nov 09 '18 at 14:58
I think I finally understand. The data from a DLL is mapped to each process with a copy on write memory map, along with a read only executable text segment, and so on. The dynamic linker loads the entire DLL into a contiguous region of space, so it doesn't know its address at link time. Is this correct? It seems like it could create a second map at a known address covering just the data segments with careful address layout. Do you know how Linux works in this regard? Or where to find more information? — J Alan, Nov 09 '18 at 15:08
*The data is copied from the external module into the local module when the external module is loaded.* - of course no. nothing is copied. this is at all impossible. copied to what place ?! — RbMm, Nov 09 '18 at 15:32
Note that even on ELF targets (such as Linux), the same table structure exists (called the GOT [global offset table]). It's used when shared objects refer to any external data objects. It's just that ELF has an optimisation for the common case that an executable refers to a data object declared in a library. — fuz, Nov 10 '18 at 12:05
I'm voting to close this question as off-topic because this is a software engineering question not a programming one and should be asked on https://softwareengineering.stackexchange.com/ — Rob, Nov 10 '18 at 12:58

J Alan · Accepted Answer · 2018-11-12T01:28:06.663

Linux and Windows use different strategies for accessing data stored in dynamic libraries.

On Linux, an undefined reference to an object is resolved to a library at link time. The linker finds the size of the object and reserves space for it in the .bss or the .rdata segment of the executable. When executed, the dynamic linker (ld.so) resolves the symbol to a dynamic library (again), and copies the object from the dynamic library to the process's memory.

On Windows, an undefined reference to an object is resolved to an import library at link time, and no space is reserved for it. When the module is executed, the dynamic linker resolves the symbol to a dynamic library, and creates a copy on write memory map in the process, backed by a shared data segment in the dynamic library.

The advantage of a copy on write memory map is that if the linked data is unchanged, then it can be shared with other processes. In practice this is a trifling benefit which greatly increases complexity, both for the toolchain and programs using dynamic libraries. For objects which are actually written this is always less efficient.

I suspect, although I have no evidence, that this decision was made for a particular and now outdated use case. Perhaps it was common practice to use large (for the time) read only objects in dynamic libraries on 16-bit Windows (in official Microsoft programs or otherwise). Either way, I doubt anyone at Microsoft has the expertise and time to change it now.

In order to investigate the issue I created a program which writes to an object from a dynamic library. It writes one byte per page (4096 bytes) in the object, then writes the entire object, then retries the initial one byte per page write. If the object is reserved for the process before main is called, the first and third loops should take approximately the same time, and the second loop should take longer than both. If the object is a copy on write map to a dynamic library, the first loop should take at least as long as the second, and the third should take less time than both.

The results are consistent with my hypothesis, and analyzing the disassembly confirms that Linux accesses the dynamic library data at a link time address, relative to the program counter. Surprisingly, Windows not only indirectly accesses the data, the pointer to the data and its length are reloaded from the import address table every loop iteration, with optimizations enabled. This was tested with Visual Studio 2010 on Windows XP, so maybe things have changed, although I wouldn't think that it has.

Here are the results for Linux:

$ dd bs=1M count=16 if=/dev/urandom of=libdat.dat
$ xxd -i libdat.dat libdat.c
$ gcc -O3 -g -shared -fPIC libdat.c -o libdat.so
$ gcc -O3 -g -no-pie -L. -ldat dat.c -o dat
$ LD_LIBRARY_PATH=. ./dat
local          =          0x1601060
libdat_dat     =           0x601040
libdat_dat_len =           0x601020
dirty=      461us write=    12184us retry=      456us
$ nm dat
[...]
0000000000601040 B libdat_dat
0000000000601020 B libdat_dat_len
0000000001601060 B local
[...]
$ objdump -d -j.text dat
[...]
  400693:   8b 35 87 09 20 00       mov    0x200987(%rip),%esi        # 601020 <libdat_dat_len>
[...]
  4006a3:   31 c0                   xor    %eax,%eax                  # zero loop counter
  4006a5:   48 8d 15 94 09 20 00    lea    0x200994(%rip),%rdx        # 601040 <libdat_dat>
  4006ac:   0f 1f 40 00             nopl   0x0(%rax)                  # align loop for efficiency
  4006b0:   89 c1                   mov    %eax,%ecx                  # store data offset in ecx
  4006b2:   05 00 10 00 00          add    $0x1000,%eax               # add PAGESIZE to data offset
  4006b7:   c6 04 0a 00             movb   $0x0,(%rdx,%rcx,1)         # write a zero byte to data
  4006bb:   39 f0                   cmp    %esi,%eax                  # test loop condition
  4006bd:   72 f1                   jb     4006b0 <main+0x30>         # continue loop if data is left
[...]

Here are the results for Windows:

$ cl /Ox /Zi /LD libdat.c /link /EXPORT:libdat_dat /EXPORT:libdat_dat_len
[...]
$ cl /Ox /Zi dat.c libdat.lib
[...]
$ dat.exe # note low resolution timer means retry is too small to measure
local          =           0041EEA0
libdat_dat     =           1000E000
libdat_dat_len =           1100E000
dirty=    20312us write=     3125us retry=        0us
$ dumpbin /symbols dat.exe
[...]
        9000 .data
        1000 .idata
        5000 .rdata
        1000 .reloc
       17000 .text
[...]
$ dumpbin /disasm dat.exe
[...]
  004010BA: 33 C0              xor         eax,eax # zero loop counter
[...]
  004010C0: 8B 15 8C 63 42 00  mov         edx,dword ptr [__imp__libdat_dat] # store data pointer in edx
  004010C6: C6 04 02 00        mov         byte ptr [edx+eax],0 # write a zero byte to data
  004010CA: 8B 0D 88 63 42 00  mov         ecx,dword ptr [__imp__libdat_dat_len] # store data length in ecx
  004010D0: 05 00 10 00 00     add         eax,1000h # add PAGESIZE to data offset
  004010D5: 3B 01              cmp         eax,dword ptr [ecx] # test loop condition
  004010D7: 72 E7              jb          004010C0 # continue loop if data is left
[...]

Here is the source code used for both tests:

#include <stdio.h>
#ifdef _WIN32
#include <windows.h>

typedef FILETIME time_l;

time_l time_get(void) {
    FILETIME ret; GetSystemTimeAsFileTime(&ret); return ret;
}

long long int time_diff(time_l const *c1, time_l const *c2) {
    return 1LL*c2->dwLowDateTime/100-c1->dwLowDateTime/100+c2->dwHighDateTime*100000-c1->dwHighDateTime*100000;
}
#else
#include <unistd.h>
#include <time.h>
#include <stdlib.h>

typedef struct timespec time_l;

time_l time_get(void) {
    time_l ret; clock_gettime(CLOCK_MONOTONIC, &ret); return ret;
}

long long int time_diff(time_l const *c1, time_l const *c2) {
    return 1LL*c2->tv_nsec/1000-c1->tv_nsec/1000+c2->tv_sec*1000000-c1->tv_sec*1000000;
}
#endif

#ifndef PAGESIZE
#define PAGESIZE 4096
#endif

#ifdef _WIN32
#define DLLIMPORT __declspec(dllimport)
#else
#define DLLIMPORT
#endif

extern DLLIMPORT unsigned char volatile libdat_dat[];
extern DLLIMPORT unsigned int libdat_dat_len;
unsigned int local[4096];

int main(void) {
    unsigned int i;
    time_l t1, t2, t3, t4;
    long long int d1, d2, d3;

    t1 = time_get();

    for(i=0; i < libdat_dat_len; i+=PAGESIZE) {
        libdat_dat[i] = 0;
    }

    t2 = time_get();

    for(i=0; i < libdat_dat_len; i++) {
        libdat_dat[i] = 0xFF;
    }

    t3 = time_get();

    for(i=0; i < libdat_dat_len; i+=PAGESIZE) {
        libdat_dat[i] = 0;
    }

    t4 = time_get();

    d1 = time_diff(&t1, &t2);
    d2 = time_diff(&t2, &t3);
    d3 = time_diff(&t3, &t4);

    printf("%-15s= %18p\n%-15s= %18p\n%-15s= %18p\n", "local", local, "libdat_dat", libdat_dat, "libdat_dat_len", &libdat_dat_len);
    printf("dirty=%9lldus write=%9lldus retry=%9lldus\n", d1, d2, d3);

    return 0;
}

I sincerely hope someone else benefits from my research. Thanks for reading!

The Windows design means that if two modules in the process link to the same DLL, the DLL data is shared. For example, A.DLL and B.DLL both use LIBC.DLL. A sets errno = 3. B reads errno and gets 3. The Linux version gives A.DLL and B.DLL their own separate copies of errno. — Raymond Chen, Nov 10 '18 at 14:28
That is exactly the kind of difference I was looking for but didn't find. I actually looked for a way to contact you or give you this topic as a suggestion before posting this but I couldn't find a way. Thanks for responding! — J Alan, Nov 10 '18 at 14:37
If a function in a Linux dynamic library wants to access its own global variable, how does it do it? Suppose your library also has a function `reset_libdat` that sets `libdat_dat` back to zero. There are multiple copies of that `libdat_dat` (one in each client). How does it know which one to reset? Does it reset all of them? How can it find all of them? — Raymond Chen, Nov 10 '18 at 15:26
I checked the disassembly, and the function (`reset_libdat`) computes the address of the module local copy using program counter relative addressing. The disassembly is commented with ` `. I don't know how this works, but some specification is used to ensure that it does. I'll try looking into it more later, although this would make a good new question. — J Alan, Nov 10 '18 at 16:03
Does this also mean (on Linux) that if A and B both link to C, and then C links to D, then A and B each get a separate copy of C's data, but they share a copy of D's (via the shared copy of C's module). The Windows model is that a program and all its DLLs act like they had all been statically linked into one giant program. — Raymond Chen, Nov 10 '18 at 16:24
@RaymondChen That is not correct. Of course both get the same copy of `errno`! Shared libraries access global variables through the GOT table which is very much like the structure on Windows. — fuz, Nov 10 '18 at 21:22
But that's not what this answer says. This answer says that the data is copied into each client, and the asm supports it. The client gets its own private copy of `libdat_dat`. — Raymond Chen, Nov 10 '18 at 21:35
@fuz is correct. I tested your ABCD example, and all libraries share a single copy of all data, which is stored in the `.bss` segment of the executable. The executable accesses it directly. Every dynamic library accesses it indirectly through a GOT specific to each library. The dynamic linker initializes the GOT of every dynamic library by creating a private memory map in the process at the address the dynamic library GOT is loaded at, then writing the addresses of the actual data in the `.bss` segment to it. — J Alan, Nov 11 '18 at 03:53
@JAlan Aha, so the sentence "The linker finds the size of the object and reserves space for it in the `.bss` or the `.rdata` segment of the module." was incorrect. It is stored in the `.bss`/`.rdata` of the **executable**. The test didn't demonstrate this distinction because the importing module **was** the executable. — Raymond Chen, Nov 11 '18 at 15:52
@RaymondChen That is correct. I fixed the answer. When I originally wrote it I didn't understand that dynamic libraries shared data, and that this convention is important. — J Alan, Nov 12 '18 at 01:31

Why does Windows require DLL data to be imported?

1 Answers1