2
/*
Low Level I/O - Read and Write
Chapter 8 - The C Programming Language - K&R
Header file in the original code is "syscalls.h"
Also BUFSIZ is supposed to be defined in the same header file   
*/

#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>

#define BUFSIZ 1

int main()  /* copy input to output */
{
    char buf[BUFSIZ];
    int n;

    while ((n = read(0, buf, BUFSIZ)) > 0)
        write(1, buf, n);

    return 0;
}

When I feed "∂∑∑®†¥¥¥˚π∆˜˜∫∫√ç tu 886661~EOF" as input the same is copied. How so many non ASCII characters are stored at the same time?

BUFSIZ is number of bytes to be transferred. How is BUFSIZ limiting byte transfer if for any value, anything can be copied from input to output?

How char buf[BUFSIZ] is storing non-ASCII characters ?

A.H.
  • 63,967
  • 15
  • 92
  • 126
  • Non-ASCII characters are today usually encoded as UTF-8, so a single character could be encoded by *several* bytes whose upper bit is set. – Basile Starynkevitch Jul 29 '12 at 07:47
  • 1
    Normally, BUFSIZ is defined in `` and is typically a power of two from 512 upwards. In this context it is legitimate (but unusual) to define it as 1. The code you show doesn't need `` or ``; `` is sufficient. – Jonathan Leffler Jul 29 '12 at 08:55

3 Answers3

3

You read by little chunks until EOF:

while ((n = read(0, buf, BUFSIZ)) > 0)

That's why. You literally, byte by byte, copy input to output. How convert it back to unicode, is problem of console, not your. I guess, It do not output anything until it can recognize data as symbol.

KAction
  • 1,977
  • 15
  • 31
  • I'm still not very clear how it reads so many characters. "∂∑∑®†¥¥¥˚π∆˜˜∫∫√ç tu 886661~EOF" has many characters and BUFSIZ specifies only 1 byte transer. Then how all are accommodated in array buf with 1 byte storage ? –  Jul 29 '12 at 07:58
  • 2
    Read carefully. You took only 1 byte a time, but do it as long as needed. Imagine, you have a box of apples. Very heavy, so you take as much, as you can handle, move it to kitchen and return to box. Rinse and repeat, until box is empty. So, count of apples you can handle -- it is a buffer size. It may be little, but sooner or later you will move all apples. – KAction Jul 29 '12 at 08:37
  • 1
    For example, the character `∂` (U+2202) is represented in UTF-8 as 3 octets (bytes): `0xe2 0x88 0x82`. Your program reads those bytes one at a time from standard input, then writes them one at a time to standard output. Your terminal emulator then re-assembles the three bytes into a single `∂` character and displays it. – Keith Thompson Jul 29 '12 at 09:09
  • @illusionoflife : So you are saying its reading 1 byte from string and writing it until EOF? –  Jul 29 '12 at 13:39
  • `writing it until` Not it. Each time, each step of while -- new byte. Just english wording correction. – KAction Jul 30 '12 at 05:20
0

Since you are calling read in a loop until 'end of file' is reached on an error in encountered, you are getting precisely 1 character in buf after each call of read. After that that character is printed via write system call. It is guaranteed that read system call will read no more than it's specified in the last argument. If you pass 10, for example, in your case, read will go ahead and try to copy the data read beyond the array bounds.

As for the characters you have fed - these seem to be extended ASCII characters (codes 128-255), so no problem here.

Maksim Skurydzin
  • 10,301
  • 8
  • 40
  • 53
0

When you call read from standard input you are reading from the pipe, that bound to terminal or to another program. Of course there is a buffer(s) between writer (terminal or other program) and your program. When this buffer is underflow reader (your program) is blocking on read. When the buffer is overflow than writer (terminal etc) in blocking on write and vice versa.

When you write to the standard output you writing to the pipe, that bound to terminal or to another program.

So if your program is run by the shell from the terminal, than your program input and output is bound to the (pseudo)terminal. (Pseudo)terminal is program that can convert user's key presses to the characters and convert some encoded strings (ISO8859-1, UTF-8 etc) to the symbols on the screen.

  1. Characters are stored in the terminal program before you press the EOF of EOL. This is canonical mode of the terminal. After your press enter the bytes are wrote to the pipe bound to your program.
  2. BUFSIZ is number of bytes that you trying to read from the input per one operation. n return value is number of bytes that really have read when operation complete. So BUFSIZ is maximum bytes that can be read by your program from the pipe.
  3. char buf[BUFSIZ] is array of bytes (not the characters of some charset), so it can handle any values (including non-printable and even zero).
CrownedEagle
  • 125
  • 7
Dmitry Poroh
  • 3,705
  • 20
  • 34