2

What actually happens after calling read:

n = read(fd, buf, try_read_size);

here fd is a TCP socket descriptor. buf is the buffer. try_read_size is the number of bytes that the program tries to read.

I guess this may finally invokes a system call to the kernel. But could anyone provide some details? say the source code implementation in glibc or kernel source?

ericzma
  • 763
  • 3
  • 9
  • 23
  • 1
    kernel details are way too complex for a short SO answer. and BTW why don't you check it yourself? – Karoly Horvath Apr 19 '12 at 10:38
  • 1
    @KarolyHorvath I tried but totally got lost. Any direction or suggestion to get the details? I will highly appreciate it. – ericzma Apr 19 '12 at 10:46
  • Read fs/read_write.c from the linux kernel source and see what's happening. – strkol Apr 19 '12 at 10:50
  • that's just the generic read function, isn't it? TCP related stuff must be in net/ somewhere. – Karoly Horvath Apr 19 '12 at 10:54
  • 1
    @strkol Thanks! I find "SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)" ( http://lxr.linux.no/#linux+v3.3.2/fs/read_write.c#L460 ). Diving into the code from it. – ericzma Apr 19 '12 at 10:58

1 Answers1

6

From a high-level perspective, this is what happens:

  • A wrapper function provided by glibc is called
  • The wrapper function puts the parameters passed on the stack into registers and sets the syscall number in the register dedicated for that purpose (e.g. EAX on x86)
  • The wrapper function executes a trap or equivalent instruction (e.g. SYSENTER)
  • The CPU switches to ring0, and the trap handler is invoked
  • The trap handler checks the syscall number for validity and looks it up in a jump table to kernel functions
  • The respective kernel function checks whether arguments are valid (e.g. the range buf to buf+try_read_size refers to accessible memory pages, fd is really a file descriptor). If something is amiss, a negative error code (e.g. -EFAULT) is generated, the cpu is switched back to user mode and the call returns to the wrapper.
  • Another function is called depending on the file descriptor's type (in your case a socket, but one could read from a block device or a proc entry or something more exotic)
  • The socket's input buffer is checked:
    • If there is some data in the buffer, min(available, try_read_size) is copied to buf, the amount is written to the return code register (EAX on x86), the cpu is switched back to user mode and the call returns to the wrapper.
    • If the input buffer is empty
      • If the connection has been closed, zero is written to the return code register, the cpu is switched back to user mode and the call returns to the wrapper
      • If the connection has not been closed
        • A negative error code (-EAGAIN) is written to the return code register if the socket is nonblocking, the cpu is switched back to user mode and the call returns to the wrapper.
        • The process is suspended if the socket is not non-blocking
  • The wrapper function checks whether the return value is negative (error).
    • If positive or zero, it returns the value.
    • If negative, it sets errno to the negated value (a positive error is reported) and returns -1
Damon
  • 67,688
  • 20
  • 135
  • 185
  • +1, very good. If the process is suspended it blocks until at least one byte of data or a FIN is received or an error is posted on the socket, then the outermost bullet point in which this is contained restarts. – user207421 Apr 22 '12 at 23:14