Is it possible "force" UTF-8 in a C program?

Question

Usually when I want my program to use UTF-8 encoding, I write setlocale (LC_ALL, "");. But today I found that it's just setting locate to environment's default locale, and I can't know whether the environment is using UTF-8 by default.

I wonder is there any way to force the character encoding to be UTF-8? Also, is there any way to check whether my program is using UTF-8?

You may change the locale, but the encoding in locale is just a hint for the user's terminal encoding. — Alastair McCormack, Mar 23 '16 at 19:35
Could you elaborate a bit and mention what kind of problems your application would encounter if given non-UTF-8 input or if output is not to a UTF-8 device or file? I could amend my answer to show how to fix such issues *properly*, without proverbially slapping us non-English-speaking-non-USians (`en_US.UTF-8` etc.) in the face. (The fact that some here are content to just offer advice on how "hard" you should slap, is disheartening.) — Nominal Animal, Mar 23 '16 at 20:32
@dan04 Both. In fact, I'm looking for a generic way to "standardise" character encoding. — nalzok, Mar 29 '16 at 01:28

Nominal Animal · Accepted Answer · 2016-03-24T10:23:49.760

It is possible, but it is the completely wrong thing to do.

First of all, the current locale is for the user to decide. It is not just the character set, but also the language, date and time formats, and so on. Your program has absolutely no "right" to mess with it.

If you cannot localize your program, just tell the user the environmental requirements your program has, and let them worry about it.

Really, you should not really rely on UTF-8 being the current encoding, but use wide character support, including functions like wctype(), mbstowcs(), and so on. POSIXy systems also provide iconv_open() and iconv() function family in their C libraries to convert between encodings (which should always include conversion to and from wchar_t); on Windows, you need a separate version libiconv library. This is how for example the GCC compiler handles different character sets. (Internally, it uses Unicode/UTF-8, but if you ask it to, it can do the necessary conversions to work with other character sets.)

I am personally a strong proponent of using UTF-8 everywhere, but overriding the user locale in a program is horrific. Abominable. Distasteful; like a desktop applet changing the display resolution because the programmer is particularly fond of certain one.

I would be happy to write some example code to show how to correctly solve any character-set-sensible situation, but there are so many, I don't know where to start.

If the OP amends their question to state exactly what problem overriding the character set is supposed to solve, I'm willing to show how to use the aforementioned utilities and POSIX facilities (or equivalent freely available libraries on Windows) to solve it correctly.

If this seems harsh to someone, it is, but only because taking the easy and simple route here (overriding the user's locale setting) is so ... wrong, purely on technical grounds. Even no action is better, and actually quite acceptable, as long as you just document your application only handles UTF-8 input/output.

Example 1. Localized Happy New Year!

#include <stdlib.h>
#include <locale.h>
#include <stdio.h>
#include <wchar.h>

int main(void)
{
    /* We wish to use the user's current locale. */
    setlocale(LC_ALL, "");

    /* We intend to use wide functions on standard output. */
    fwide(stdout, 1);

    /* For Windows compatibility, print out a Byte Order Mark.
     * If you save the output to a file, this helps tell Windows
     * applications that the file is Unicode.
     * Other systems don't need it nor use it.
    */
    fputwc(L'\uFEFF', stdout);

    wprintf(L"Happy New Year!\n");
    wprintf(L"С новым годом!\n");
    wprintf(L"新年好！\n");
    wprintf(L"賀正！\n");
    wprintf(L"¡Feliz año nuevo!\n");
    wprintf(L"Hyvää uutta vuotta!\n");

    return EXIT_SUCCESS;
}

Note that wprintf() takes a wide string (wide string constants are of form L"", wide character constants L'', as opposed to normal/narrow counterparts "" and ''). Formats are still the same; %s prints a normal/narrow string, and %ls a wide string.

Example 2. Reading input lines from standard input, and optionally saving them to a file. The file name is supplied on the command line.

#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <wctype.h>
#include <wchar.h>
#include <errno.h>
#include <stdio.h>

typedef enum {
    TRIM_LEFT     = 1,      /* Remove leading whitespace and control characters */
    TRIM_RIGHT    = 2,      /* Remove trailing whitespace and control characters */
    TRIM_NEWLINE  = 4,      /* Remove newline at end of line */
    TRIM          = 7,      /* Remove leading and trailing whitespace and control characters */
    OMIT_NUL      = 8,      /* Skip NUL characters (embedded binary zeros, L'\0') */
    OMIT_CONTROLS = 16,     /* Skip control characters */
    CLEANUP       = 31,     /* All of the above. */
    COMBINE_LWS   = 32,     /* Combine all whitespace into a single space */
} trim_opts;


/* Read an unlimited-length line from a wide input stream.
 *
 * This function takes a pointer to a wide string pointer,
 * pointer to the number of wide characters dynamically allocated for it,
 * the stream to read from, and a set of options on how to treat the line.
 *
 * If an error occurs, this will return 0 with errno set to nonzero error number.
 * Use strerror(errno) to obtain the error description (as a narrow string).
 *
 * If there is no more data to read from the stream,
 * this will return 0 with errno 0, and feof(stream) will return true.
 *
 * If an empty line is read,
 * this will return 0 with errno 0, but feof(stream) will return false.
 *
 * Typically, you initialize variables like
 *      wchar_t *line = NULL;
 *      size_t   size = 0;
 * before calling this function, so that subsequent calls the same, dynamically
 * allocated buffer for the line, and it is automatically grown if necessary.
 * There are no built-in limits to line lengths this way.
*/
size_t getwline(wchar_t **const lineptr,
                size_t   *const sizeptr,
                FILE     *const in,
                trim_opts const trimming)
{
    wchar_t *line;
    size_t   size;
    size_t   used = 0;
    wint_t   wc;
    fpos_t   startpos;
    int      seekable;

    if (lineptr == NULL || sizeptr == NULL || in == NULL) {
        errno = EINVAL;
        return 0;
    }

    if (*lineptr != NULL) {
        line = *lineptr;
        size = *sizeptr;
    } else {
        line = NULL;
        size = 0;
        *sizeptr = 0;
    }

    /* In error cases, we can try and get back to this position
     * in the input stream, as we cannot really return the data
     * read thus far. However, some streams like pipes are not seekable,
     * so in those cases we should not even try.
     * Use (seekable) as a flag to remember if we should try.
    */
    if (fgetpos(in, &startpos) == 0)
        seekable = 1;
    else
        seekable = 0;

    while (1) {

        /* When we read a wide character from a wide stream,
         * fgetwc() will return WEOF with errno set if an error occurs.
         * However, fgetwc() will return WEOF with errno *unchanged*
         * if there is no more input in the stream.
         * To detect which of the two happened, we need to clear errno
         * first.
        */
        errno = 0;
        wc = fgetwc(in);
        if (wc == WEOF) {
            if (errno) {
                const int saved_errno = errno;
                if (seekable)
                    fsetpos(in, &startpos);
                errno = saved_errno;
                return 0;
            }
            if (ferror(in)) {
                if (seekable)
                    fsetpos(in, &startpos);
                errno = EIO;
                return 0;
            }
            break;
        }

        /* Dynamically grow line buffer if necessary.
         * We need room for the current wide character,
         * plus at least the end-of-string mark, L'\0'.
        */
        if (used + 2 > size) {
            /* Size policy. This can be anything you see fit,
             * as long as it yields size >= used + 2.
             *
             * This one increments size to next multiple of
             * 1024 (minus 16). It works well in practice,
             * but do not think of it as the "best" way.
             * It is just a robust choice.
            */
            size = (used | 1023) + 1009;
            line = realloc(line, size * sizeof line[0]);
            if (!line) {
                /* Memory allocation failed. */
                if (seekable)
                    fsetpos(in, &startpos);
                errno = ENOMEM;
                return 0;
            }
            *lineptr = line;
            *sizeptr = size;
        }

        /* Append character to buffer. */
        if (!trimming)
            line[used++] = wc;
        else {
            /* Check if we have reasons to NOT add the character to buffer. */
            do {
                /* Omit NUL if asked to. */
                if (trimming & OMIT_NUL)
                    if (wc == L'\0')
                        break;

                /* Omit controls if asked to. */
                if (trimming & OMIT_CONTROLS)
                    if (iswcntrl(wc))
                        break;

                /* If we are at start of line, and we are left-trimming,
                 * only graphs (printable non-whitespace characters) are added. */
                if (trimming & TRIM_LEFT)
                    if (wc == L'\0' || !iswgraph(wc))
                        break;

                /* Combine whitespaces if asked to. */
                if (trimming & COMBINE_LWS)
                    if (iswspace(wc)) {
                        if (used > 0 && line[used-1] == L' ')
                            break;
                        else
                            wc = L' ';
                    }

                /* Okay, add the character to buffer. */
                line[used++] = wc;

            } while (0);
        }

        /* End of the line? */
        if (wc == L'\n')
            break;
    }

    /* The above loop will only end (break out)
     * if end of line or end of input was found,
     * and no error occurred.
    */

    /* Trim right if asked to. */
    if (trimming & TRIM_RIGHT)
        while (used > 0 && iswspace(line[used-1]))
            --used;
    else
    if (trimming & TRIM_NEWLINE)
        while (used > 0 && (line[used-1] == L'\r' || line[used-1] == L'\n'))
            --used;

    /* Ensure we have room for end-of-string L'\0'. */
    if (used >= size) {
        size = used + 1;
        line = realloc(line, size * sizeof line[0]);
        if (!line) {
            if (seekable)
                fsetpos(in, &startpos);
            errno = ENOMEM;
            return 0;
        }
        *lineptr = line;
        *sizeptr = size;
    }

    /* Add end of string mark. */
    line[used] = L'\0';

    /* Successful return. */
    errno = 0;
    return used;
}

/* Counts the number of wide characters in 'alpha' class.
*/
size_t count_letters(const wchar_t *ws)
{
    size_t count = 0;
    if (ws)
        while (*ws != L'\0')
            if (iswalpha(*(ws++)))
                count++;
    return count;
}

int main(int argc, char *argv[])
{
    FILE    *out;

    wchar_t *line = NULL;
    size_t   size = 0;
    size_t   len;

    setlocale(LC_ALL, "");

    /* Standard input and output should use wide characters. */
    fwide(stdin, 1);
    fwide(stdout, 1);

    /* Check if the user asked for help. */
    if (argc < 2 || argc > 3 || strcmp(argv[1], "-h") == 0 || strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "/?") == 0) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help | /? ]\n", argv[0]);
        fprintf(stderr, "       %s FILENAME [ PROMPT ]\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "The program will read input lines until an only '.' is supplied.\n");
        fprintf(stderr, "If you do not want to save the output to a file,\n");
        fprintf(stderr, "use '-' as the FILENAME.\n");
        fprintf(stderr, "\n");
        return EXIT_SUCCESS;
    }

    /* Open file for output, unless it is "-". */
    if (strcmp(argv[1], "-") == 0)
        out = NULL; /* No output to file */
    else {
        out = fopen(argv[1], "w");
        if (out == NULL) {
            fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
            return EXIT_FAILURE;
        }

        /* The output file is used with wide strings. */
        fwide(out, 1);
    }

    while (1) {

        /* Prompt? Note: our prompt string is narrow, but stdout is wide. */
        if (argc > 2) {
            wprintf(L"%s\n", argv[2]);
            fflush(stdout);
        }

        len = getwline(&line, &size, stdin, CLEANUP);
        if (len == 0) {
            if (errno) {
                fprintf(stderr, "Error reading standard input: %s.\n", strerror(errno));
                break;
            }
            if (feof(stdin))
                break;
        }

        /* The user does not wish to supply more lines? */
        if (wcscmp(line, L".") == 0)
            break;

        /* Print the line to the file. */
        if (out != NULL) {
            fputws(line, out);
            fputwc(L'\n', out);
        }

        /* Tell the user what we read. */
        wprintf(L"Received %lu wide characters, %lu of which were letterlike.\n",
                (unsigned long)len, (unsigned long)count_letters(line));
        fflush(stdout);
    }

    /* The line buffer is no longer needed, so we can discard it.
     * Note that free(NULL) is safe, so we do not need to check.
    */
    free(line);

    /* I personally also like to reset the variables.
     * It helps with debugging, and to avoid reuse-after-free() errors. */
    line = NULL;    
    size = 0;

    return EXIT_SUCCESS;
}

The getwline() function above is pretty much at the most complicated end of functions you might need when dealing with localized wide character support. It allows you to read localized input lines without length restrictions, and optionally trims and cleans up (removing control codes and embedded binary zeros) the returned string. It also works fine with both LF and CR-LF (\n and \r\n) newline encodings.

It'll be nice if you can add some examples! Can you give me the simplest example code that the program asks the user to input a string(maybe in Chinese or Japanese), and print it both on the screen and in a file? I know I can make use of functions in `wchar.h` to do this, but I want to know more about such character-set-sensible problems. — nalzok, Mar 23 '16 at 22:49

Myst · Answer 2 · 2016-03-23T17:02:35.743

Try:

setlocale(LC_ALL, "en_US.UTF-8");

You can run locale -a in the terminal to get a full list of locales supported by your system ("en_US.UTF-8" should be supported by most/all UTF-8 supporting systems).

EDIT 1 (alternate spelling)

In the comments, Lee points out that some systems have an alternate spelling, "en_US.utf8" (which surprised me, but we learn new stuff every day).

Since setlocale returns NULL when it fails, you can chain these calls:

if(!setlocale(LC_ALL, "en_US.UTF-8") && !setlocale(LC_ALL, "en_US.utf8"))
   printf("failed to set locale to UTF-8");

EDIT 2 (finding out if we're using UTF-8)

To find out if the locale is set to UFT-8 (after attempting to set it), you can either check for the returned value (NULL means the call failed) or check the locale used.

Option 1:

char * result;
if((result = setlocale (LC_ALL, "en_US.UTF-8")) == NULL)
   printf("failed to set locale to UTF-8");

Option 2:

setlocale (LC_ALL, "en_US.UTF-8"); // set
char * result = setlocale (LC_ALL, NULL); // review
if(!strstr(result, "UTF-8"))
   printf("failed to set locale to UTF-8");

@LeeDanielCrocker - thanks. I edited my answer to reflect the alternate spelling. — Myst, Mar 23 '16 at 17:03

score 1 · Answer 3 · edited Oct 07 '21 at 11:32

This is not an answer, but a third, quite complex example, on how to use wide character I/O. This was too long to add to my actual answer to this question.

This example shows how to read and process CSV files (RFC-4180 format, optionally with limited backslash escape support) using wide strings.

The following code is CC0/public domain, so you are free to use it any way you like, even include in your own proprietary projects, but if it breaks anything, you get to keep all the bits and not complain to me. (I'll be happy to include any bug fixes if you find and report them in a comment below, though.)

The logic of the code is robust, however. In particular, it supports universal newlines, all four common newline types: Unix-like LF (\n), old CR LF (\r\n), old Mac CR (\r), and the occasionally encountered weird LF CR (\n\r). There are no built-in limitations wrt. the length of a field, the number of fields in a record, or the number of records in a file. It works very nicely if you need to convert CSV or process CSV input stream-like (field by field or record-by-record), without having to have more than one in memory at one point. If you want to construct structures to describe the records and fields in memory, you'll need to add some scaffolding code for that.

Because of universal newline support, when reading input interactively, this program might require two consecutive end-of-inputs (Ctrl+Z in Windows and MS-DOS, Ctrl+D everywhere else), as the first one is usually "consumed" by the csv_next_field() or csv_skip_field() function, and the csv_next_record() function needs to re-read it again to actually detect it. However, you do not normally ask the user to input CSV data interactively, so this should be an acceptable quirk.

#include <stdlib.h>
#include <locale.h>
#include <string.h>
#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include <errno.h>

/* RFC-4180 -format CSV file processing using wide input streams.
 *
 * #define BACKSLASH_ESCAPES if you additionally wish to have
 * \\, \a, \b, \t, \n, \v, \f, \r, \", and \, de-escaped to their
 * C string equivalents when reading CSV fields.
*/

typedef enum {
    CSV_OK = 0,
    CSV_END = 1,
    CSV_INVALID_PARAMETERS = -1,
    CSV_FORMAT_ERROR = -2,
    CSV_CHARSET_ERROR = -3,
    CSV_READ_ERROR = -4,
    CSV_OUT_OF_MEMORY = -5,
} csv_status;

const char *csv_error(const csv_status code)
{
    switch (code) {
    case CSV_OK:                 return "No error";
    case CSV_END:                return "At end";
    case CSV_INVALID_PARAMETERS: return "Invalid parameters";
    case CSV_FORMAT_ERROR:       return "Bad CSV format";
    case CSV_CHARSET_ERROR:      return "Illegal character in CSV file (incorrect locale?)";
    case CSV_READ_ERROR:         return "Read error";
    case CSV_OUT_OF_MEMORY:      return "Out of memory";
    default:                     return "Unknown csv_status code"; 
    }
}

/* Start the next record. Automatically skips any remaining fields in current record.
 * Returns CSV_OK if successful, CSV_END if no more records, or a negative CSV_ error code. */
csv_status csv_next_record (FILE *const in);

/* Skip the next field. Returns CSV_OK if successful, CSV_END if no more fields in current record,
 * or a negative CSV_ error code. */
csv_status csv_skip_field  (FILE *const in);

/* Read the next field. Returns CSV_OK if successful, CSV_END if no more fields in current record,
 * or a negative CSV_ error code.
 * If this returns CSV_OK, then *dataptr is a dynamically allocated wide string to the field
 * contents, space allocated for *sizeptr wide characters; and if lengthptr is not NULL, then
 * *lengthptr is the number of wide characters in said wide string. */
csv_status csv_next_field  (FILE *const in, wchar_t **const dataptr,
                                            size_t   *const sizeptr,
                                            size_t   *const lengthptr);

static csv_status internal_skip_quoted(FILE *const in)
{
    while (1) {
        wint_t  wc;

        errno = 0;
        wc = fgetwc(in);

        if (wc == WEOF) {
            if (errno == EILSEQ)
                return CSV_CHARSET_ERROR;
            if (errno)
                return CSV_READ_ERROR;
            if (ferror(in)) {
                errno = EIO;
                return CSV_READ_ERROR;
            }
            errno = 0;
            return CSV_FORMAT_ERROR;
        }

        if (wc == L'"') {
            errno = 0;
            wc = fgetwc(in);            

            if (wc == L'"')
                continue;

            while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc)) {
                errno = 0;
                wc = fgetwc(in);
            }

            if (wc == WEOF) {
                if (errno == EILSEQ)
                    return CSV_CHARSET_ERROR;
                if (errno)
                    return CSV_READ_ERROR;
                if (ferror(in)) {
                    errno = EIO;
                    return CSV_READ_ERROR;
                }
                errno = 0;
                return CSV_END;
            }

            if (wc == L',') {
                errno = 0;
                return CSV_OK;
            }

            if (wc == L'\n' || wc == L'\r') {
                ungetwc(wc, in);
                errno = 0;
                return CSV_END;
            }

            ungetwc(wc, in);
            errno = 0;
            return CSV_FORMAT_ERROR;
        }

#ifdef BACKSLASH_ESCAPES
        if (wc == L'\\') {
            errno = 0;
            wc = fgetwc(in);

            if (wc == L'"')
                continue;

            if (wc == WEOF) {
                if (errno == EILSEQ)
                    return CSV_CHARSET_ERROR;
                if (errno)
                    return CSV_READ_ERROR;
                if (ferror(in)) {
                    errno = EIO;
                    return CSV_READ_ERROR;
                }
                errno = 0;
                return CSV_END;
            }
        }
#endif
    }
}

static csv_status internal_skip_unquoted(FILE *const in, wint_t wc)
{
    while (1) {

        if (wc == WEOF) {
            if (errno == EILSEQ)
                return CSV_CHARSET_ERROR;
            if (errno)
                return CSV_READ_ERROR;
            if (ferror(in)) {
                errno = EIO;
                return CSV_READ_ERROR;
            }
            errno = 0;
            return CSV_END;
        }

        if (wc == L',') {
            errno = 0;
            return CSV_OK;
        }

        if (wc == L'\n' || wc == L'\r') {
            ungetwc(wc, in);
            errno = 0;
            return CSV_END;
        }

#ifdef BACKSLASH_ESCAPES
        if (wc == L'\\') {
            errno = 0;
            wc = fgetwc(in);
            if (wc == WEOF) {
                if (errno == EILSEQ)
                    return CSV_CHARSET_ERROR;
                if (errno)
                    return CSV_READ_ERROR;
                if (ferror(in)) {
                    errno = EIO;
                    return CSV_READ_ERROR;
                }
                errno = 0;
                return CSV_END;
            }
        }
#endif

        errno = 0;
        wc = fgetwc(in);
    }
}

csv_status csv_next_record(FILE *const in)
{
    while (1) {
        wint_t      wc;
        csv_status  status;

        do {
            errno = 0;
            wc = fgetwc(in);
        } while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc));

        if (wc == WEOF) {
            if (errno == EILSEQ)
                return CSV_CHARSET_ERROR;
            if (errno)
                return CSV_READ_ERROR;
            if (ferror(in)) {
                errno = EIO;
                return CSV_READ_ERROR;
            }
            errno = 0;
            return CSV_END;
        }

        if (wc == L'\n' || wc == L'\r') {
            wint_t next_wc;

            errno = 0;
            next_wc = fgetwc(in);

            if (next_wc == WEOF) {
                if (errno == EILSEQ)
                    return CSV_CHARSET_ERROR;
                if (errno)
                    return CSV_READ_ERROR;
                if (ferror(in)) {
                    errno = EIO;
                    return CSV_READ_ERROR;
                }
                errno = 0;
                return CSV_END;
            }

            if ((wc == L'\n' && next_wc == L'\r') ||
                (wc == L'\r' && next_wc == L'\n')) {
                errno = 0;
                return CSV_OK;
            }

            ungetwc(next_wc, in);
            errno = 0;
            return CSV_OK;
        }

        if (wc == L'"')
            status = internal_skip_quoted(in);
        else
            status = internal_skip_unquoted(in, wc);

        if (status < 0)
            return status;
    }
}

csv_status csv_skip_field(FILE *const in)
{
    wint_t  wc;

    if (!in) {
        errno = EINVAL;
        return CSV_INVALID_PARAMETERS;
    } else
    if (ferror(in)) {
        errno = EIO;
        return CSV_READ_ERROR;
    }

    /* Skip leading whitespace. */
    do {
        errno = 0;
        wc = fgetwc(in);
    } while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc));

    if (wc == L'"')
        return internal_skip_quoted(in);
    else
        return internal_skip_unquoted(in, wc);

}        

csv_status csv_next_field(FILE *const in, wchar_t **const dataptr,
                                          size_t   *const sizeptr,
                                          size_t   *const lengthptr)
{
    wchar_t *data;
    size_t   size;
    size_t   used = 0; /* length */
    wint_t   wc;

    if (lengthptr)
        *lengthptr = 0;

    if (!in || !dataptr || !sizeptr) {
        errno = EINVAL;
        return CSV_INVALID_PARAMETERS;
    } else
    if (ferror(in)) {
        errno = EIO;
        return CSV_READ_ERROR;
    }

    if (*dataptr) {
        data = *dataptr;
        size = *sizeptr;
    } else {
        data = NULL;
        size = 0;
        *sizeptr = 0;
    }

    /* Skip leading whitespace. */
    do {
        errno = 0;
        wc = fgetwc(in);
    } while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc));

    if (wc == WEOF) {
        if (errno == EILSEQ)
            return CSV_CHARSET_ERROR;
        if (errno)
            return CSV_READ_ERROR;
        if (ferror(in)) {
            errno = EIO;
            return CSV_READ_ERROR;
        }
        errno = 0;
        return CSV_END;
    }

    if (wc == L'\n' || wc == L'\r') {
        ungetwc(wc, in);
        errno = 0;
        return CSV_END;
    }

    if (wc == L'"')
        while (1) {

            errno = 0;
            wc = getwc(in);

            if (wc == WEOF) {
                if (errno == EILSEQ)
                    return CSV_CHARSET_ERROR;
                if (errno)
                    return CSV_READ_ERROR;
                if (ferror(in)) {
                    errno = EIO;
                    return CSV_READ_ERROR;
                }
                errno = 0;
                return CSV_FORMAT_ERROR;

            } else
            if (wc == L'"') {
                errno = 0;
                wc = getwc(in);

                if (wc != L'"') {
                    /* Not an escaped doublequote. */

                    while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc)) {
                        errno = 0;
                        wc = getwc(in);
                    }

                    if (wc == WEOF) {
                        if (errno == EILSEQ)
                            return CSV_CHARSET_ERROR;
                        if (errno)
                            return CSV_READ_ERROR;
                        if (ferror(in)) {
                            errno = EIO;
                            return CSV_READ_ERROR;
                        }
                    } else
                    if (wc == L'\n' || wc == L'\r') {
                        ungetwc(wc, in);
                    } else
                    if (wc != L',') {
                        errno = 0;
                        return CSV_FORMAT_ERROR;
                    }
                    break;
                }

#ifdef BACKSLASH_ESCAPES
            } else
            if (wc == L'\\') {
                errno = 0;
                wc = getwc(in);

                if (wc == L'\0')
                    continue;
                else
                if (wc == WEOF) {
                    if (errno == EILSEQ)
                        return CSV_CHARSET_ERROR;
                    if (errno)
                        return CSV_READ_ERROR;
                    if (ferror(in)) {
                        errno = EIO;
                        return CSV_READ_ERROR;
                    }
                    break;
                } else
                    switch (wc) {
                    case L'a':  wc = L'\a'; break;
                    case L'b':  wc = L'\b'; break;
                    case L't':  wc = L'\t'; break;
                    case L'n':  wc = L'\n'; break;
                    case L'v':  wc = L'\v'; break;
                    case L'f':  wc = L'\f'; break;
                    case L'r':  wc = L'\r'; break;
                    case L'\\': wc = L'\\'; break;
                    case L'"':  wc = L'"';  break;
                    case L',':  wc = L',';  break;
                    default:
                        ungetwc(wc, in);
                        wc = L'\\';
                    }
#endif
            }

            if (used + 2 > size) {
                /* Allocation policy.
                 * Anything that yields size >= used + 2 is acceptable.
                 * This one allocates in roughly 1024 byte chunks,
                 * and is known to be robust (but not optimal) in practice. */
                size = (used | 1023) + 1009;
                data = realloc(data, size * sizeof data[0]);
                if (!data) {
                    errno = ENOMEM;
                    return CSV_OUT_OF_MEMORY;
                }
                *dataptr = data;
                *sizeptr = size;
            }

            data[used++] = wc;
        }
    else
        while (1) {

            if (wc == L',')
                break;

            if (wc == L'\n' || wc == L'\r') {
                ungetwc(wc, in);
                break;
            }

#ifdef BACKSLASH_ESCAPES
            if (wc == L'\\') {
                errno = 0;
                wc = fgetwc(in);
                if (wc == WEOF) {
                    if (errno == EILSEQ)
                        return CSV_CHARSET_ERROR;
                    if (errno)
                        return CSV_READ_ERROR;
                    if (ferror(in)) {
                        errno = EIO;
                        return CSV_READ_ERROR;
                    }
                    wc = L'\\';
                } else
                    switch (wc) {
                    case L'a':  wc = L'\a'; break;
                    case L'b':  wc = L'\b'; break;
                    case L't':  wc = L'\t'; break;
                    case L'n':  wc = L'\n'; break;
                    case L'v':  wc = L'\v'; break;
                    case L'f':  wc = L'\f'; break;
                    case L'r':  wc = L'\r'; break;
                    case L'"':  wc = L'"';  break;
                    case L',':  wc = L',';  break;
                    case L'\\': wc = L'\\'; break;
                    default:
                        ungetwc(wc, in);
                        wc = L'\\';
                    }
            }
#endif

            if (used + 2 > size) {
                /* Allocation policy.
                 * Anything that yields size >= used + 2 is acceptable.
                 * This one allocates in roughly 1024 byte chunks,
                 * and is known to be robust (but not optimal) in practice. */
                size = (used | 1023) + 1009;
                data = realloc(data, size * sizeof data[0]);
                if (!data) {
                    errno = ENOMEM;
                    return CSV_OUT_OF_MEMORY;
                }
                *dataptr = data;
                *sizeptr = size;
            }

            data[used++] = wc;

            errno = 0;
            wc = getwc(in);

            if (wc == WEOF) {
                if (errno == EILSEQ)
                    return CSV_CHARSET_ERROR;
                if (errno)
                    return CSV_READ_ERROR;
                if (ferror(in)) {
                    errno = EIO;
                    return CSV_READ_ERROR;
                }
                break;
            }
        }

    /* Ensure there is room for the end-of-string mark. */
    if (used >= size) {
        size = used + 1;
        data = realloc(data, size * sizeof data[0]);
        if (!data) {
            errno = ENOMEM;
            return CSV_OUT_OF_MEMORY;
        }
        *dataptr = data;
        *sizeptr = size;
    }

    data[used] = L'\0';

    if (lengthptr)
        *lengthptr = used;

    errno = 0;
    return CSV_OK;
}

/* Helper function: print a wide string as if in quotes, but backslash-escape special characters.
*/
static void wquoted(FILE *const out, const wchar_t *ws, const size_t len)
{
    if (out) {
        size_t i;

        for (i = 0; i < len; i++)
            if (ws[i] == L'\0')
                fputws(L"\\0", out);
            else
            if (ws[i] == L'\a')
                fputws(L"\\a", out);
            else
            if (ws[i] == L'\b')
                fputws(L"\\b", out);
            else
            if (ws[i] == L'\t')
                fputws(L"\\t", out);
            else
            if (ws[i] == L'\n')
                fputws(L"\\n", out);
            else
            if (ws[i] == L'\v')
                fputws(L"\\v", out);
            else
            if (ws[i] == L'\f')
                fputws(L"\\f", out);
            else
            if (ws[i] == L'\r')
                fputws(L"\\r", out);
            else
            if (ws[i] == L'"')
                fputws(L"\\\"", out);
            else
            if (ws[i] == L'\\')
                fputws(L"\\\\", out);
            else
            if (iswprint(ws[i])) 
                fputwc(ws[i], out);
            else
            if (ws[i] < 65535)
                fwprintf(out, L"\\x%04x", (unsigned int)ws[i]);
            else
                fwprintf(out, L"\\x%08x", (unsigned long)ws[i]);
    }
}


static int show_csv(FILE *const in, const char *const filename)
{
    wchar_t        *field_contents = NULL;
    size_t          field_allocated = 0;
    size_t          field_length = 0;
    unsigned long   record = 0UL;
    unsigned long   field;
    csv_status      status;

    while (1) {

        /* First field in this record. */
        field = 0UL;
        record++;

        while (1) {

            status = csv_next_field(in, &field_contents, &field_allocated, &field_length);

            if (status == CSV_END)
                break;

            if (status < 0) {
                fprintf(stderr, "%s: %s.\n", filename, csv_error(status));
                free(field_contents);
                return -1;
            }

            field++;

            wprintf(L"Record %lu, field %lu is \"", record, field);
            wquoted(stdout, field_contents, field_length);
            wprintf(L"\", %lu characters.\n", (unsigned long)field_length);
        }

        status = csv_next_record(in);

        if (status == CSV_END) {
            free(field_contents);
            return 0;
        }

        if (status < 0) {
            fprintf(stderr, "%s: %s.\n", filename, csv_error(status));
            free(field_contents);
            return -1;
        }
    }
}

static int usage(const char *argv0)
{
    fprintf(stderr, "\n");
    fprintf(stderr, "Usage: %s [ -h | --help | /? ]\n", argv0);
    fprintf(stderr, "       %s CSV-FILE [ ... ]\n", argv0);
    fprintf(stderr, "\n");
    fprintf(stderr, "Use special file name '-' to read from standard input.\n");
    fprintf(stderr, "\n");
    return EXIT_SUCCESS;
}

int main(int argc, char *argv[])
{
    FILE *in;
    int   arg;

    setlocale(LC_ALL, "");

    fwide(stdin, 1);
    fwide(stdout, 1);

    if (argc < 1)
        return usage(argv[0]);

    for (arg = 1; arg < argc; arg++) {

        if (!strcmp(argv[arg], "-h") || !strcmp(argv[arg], "--help") || !strcmp(argv[arg], "/?"))
            return usage(argv[0]);

        if (!strcmp(argv[arg], "-")) {

            if (show_csv(stdin, "(standard input)"))
                return EXIT_FAILURE;
                
        } else {

            in = fopen(argv[arg], "r");
            if (!in) {
                fprintf(stderr, "%s: %s.\n", argv[arg], strerror(errno));
                return EXIT_FAILURE;
            }

            if (show_csv(in, argv[arg]))
                return EXIT_FAILURE;
            if (ferror(in)) {
                fprintf(stderr, "%s: %s.\n", argv[arg], strerror(EIO));
                fclose(in);
                return EXIT_FAILURE;
            }
            if (fclose(in)) {
                fprintf(stderr, "%s: %s.\n", argv[arg], strerror(EIO));
                return EXIT_FAILURE;
            }
        }
    }

    return EXIT_SUCCESS;
}

The use of the above csv_next_field(), csv_skip_field(), and csv_next_record() is quite straightforward.

Open the CSV file normally, then call fwide(stream, 1) on it to tell the C library you intend to use the wide string variants instead of the standard narrow string I/O functions.
Create four variables, and initialize the first two:
```
wchar_t   *field = NULL;
size_t     allocated = 0;
size_t     length;
csv_status status;
```
field is a pointer to the dynamically allocated contents of each field you read. It is allocated automatically; essentially, you don't need to worry about it at all. allocated holds the currently allocated size (in wide characters, including terminating L'\0'), and we'll use length and status later.
At this point, you are ready to read or skip the first field in the first record.

You do not wish to call csv_next_record() at this point, unless you wish to skip the very first record entirely in the file.
Call status = csv_skip_field(stream); to skip the next field, or status = csv_next_field(stream, &field, &allocated, &length); to read it.

If status == CSV_OK, you have the field contents in wise string field. It has length wide characters in it.

If status == CSV_END, there was no more fields in the current record. (The field is unchanged, and you should not examine it.)

Otherwise, status < 0, and it describes an error code. You can use csv_error(status) to obtain a (narrow) string describing it.
At any point, you can move (skip) to the start of the next record by calling status = csv_next_record(stream);.

If it returns CSV_OK, there might be a new record available. (We only know when you try to read or skip the first field. This is similar to how standard C library function feof() only tells you whether you have tried to read past the end of input, it does not tell whether there is more data available or not.)

If it returns CSV_END, you already have processed the last record, and there are no more records.

Otherwise, it returns a negative error code, status < 0. You can use csv_error(status) to obtain a (narrow) string describing it.
After you are done, discard the field buffer:
```
free(field);
field = NULL;
allocated = 0;
```
You do not actually need to reset the variables to NULL and zero, but I recommend it. In fact, you can do the above at any point (when you are no longer interested in the contents of the current field), as the csv_next_field() will then automatically allocate a new buffer as necessary.

Note that free(NULL); is always safe and does nothing. You do not need to check if field is NULL or not before freeing it. This is also the reason why I recommend initializing the variables immediately when you declare them. It just makes everything so much easier to handle.

The compiled example program takes one or more CSV file names as command-line parameters, then reads the files and reports the contents of each field in the file. If you have a particularly fiendishly complex CSV file, this is optimal for checking if this approach reads all the fields correctly.

Is it possible "force" UTF-8 in a C program?

3 Answers3

Linked