Multithread file reading and processing

Question

I just came to multithreaded programming. Below is my attempt to do a large txt file reading and processing with 1 thread to read and multiple threads to process line by line. Right now, I put a trivial process function; the actual process function takes longer in my real application, then outputs the processed result to a new file.

Somehow, the code works intermittently; some race conditions may occur somewhere. I tried using gdb and extensively logging to debug, couldn't figure out where the bug was. So, please help. Any suggestion is appreciated.

#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <semaphore.h>
#include <string.h>
#include <zlib.h>

#define READER_SIZE 30

#define NUM_PROCESSORS 4

// shared data by the reader and processors
//------------------------------------------------
char reader_buffer[READER_SIZE][64]; 
int reader_in = 0;                      // next free position
int reader_out = 0;                     // first full position

sem_t reads_mutex; // semaphore to lock buffer
sem_t reads_full;  // counts the number of filled slots
sem_t reads_empty; // counts the number of empty slots
gzFile fileIn;
//------------------------------------------------

// shared data by the processors and writer
//------------------------------------------------
pthread_mutex_t writes_mutex; // mutex lock for writting
gzFile fileOut;
//------------------------------------------------

// Function reader
void *producer()
{
    while (!gzeof(fileIn))
    {
        printf("reader turn ");
        sem_wait(&reads_empty);
        sem_wait(&reads_mutex);

        gzgets(fileIn, reader_buffer[reader_in], 64);

        // printf("reader is working!\n");
        // printf("producer read in: %d, %s \n", reader_in, reader_buffer[reader_in]);
        int empty, full;
        sem_getvalue(&reads_empty, &empty);
        sem_getvalue(&reads_full, &full);

        reader_in = (reader_in + 1) % READER_SIZE;
        printf("reader_in: %d, empty is %d, full is %d\n", reader_in, empty, full + 1);

        sem_post(&reads_mutex);
        sem_post(&reads_full);
    }
}

void *consumer(void *id)
{
    int consumer_id = *((int *)id);
    while (!gzeof(fileIn))
    {
        printf("consumer %d turn ", consumer_id);
        sem_wait(&reads_full);
        sem_wait(&reads_mutex);

        // copy the data
        char processor_buffer[64];
        char cell_barcode[17];

        strcpy(processor_buffer, reader_buffer[reader_out]);
        reader_out = (reader_out + 1) % READER_SIZE;

        int empty, full;
        sem_getvalue(&reads_empty, &empty);
        sem_getvalue(&reads_full, &full);

        printf("reader_out: %d, empty is %d, full is %d \n", reader_out, empty + 1, full);

        sem_post(&reads_mutex);
        sem_post(&reads_empty);

        // processing, here I made trivial to take substring
        if(processor_buffer[0] == '@')
        {

        pthread_mutex_lock(&writes_mutex);

        gzputs(fileOut, processor_buffer);

        pthread_mutex_unlock(&writes_mutex);

        }

    }

}

int main()
{
    fileIn = gzopen("temp.gz", "rb");
    fileOut = gzopen("test.gz", "wb");

    int *proc_id = malloc(NUM_PROCESSORS * sizeof(int));

    pthread_t reader_thread;
    pthread_t processor_thread[NUM_PROCESSORS];

    // initialize semaphores
    sem_init(&reads_mutex, 0, 1);
    sem_init(&reads_full, 0, 0);
    sem_init(&reads_empty, 0, READER_SIZE);

    pthread_mutex_init(&writes_mutex, NULL);

    pthread_create(&reader_thread, NULL, producer, NULL);

    for (int i = 0; i < NUM_PROCESSORS; i++)
    {
        proc_id[i] = i;
        pthread_create(&processor_thread[i], NULL, consumer, &proc_id[i]);
    }

    // wait for the reader and processor threads to finish
    pthread_join(reader_thread, NULL);

    for (int i = 0; i < NUM_PROCESSORS; i++)
    {

        pthread_join(processor_thread[i], NULL);
    }

    sem_destroy(&reads_mutex);
    sem_destroy(&reads_full);
    sem_destroy(&reads_empty);

    gzclose(fileIn);

    gzclose(fileOut);

    return 0;
}

Try compling with `-g -fsanitize=thread` and see if that can give you some more clues when you run the program. You can also try `-g -fsanitize=address,undefined` - but you can't use those two at the same as `thread`. — Ted Lyngmo, Jan 08 '23 at 22:22
Since `producer` reads the compressed file (via `gzgets`) and posts the _decompressed_ output to a shared buffer, I'm not sure why `consumer` uses any `gz*` functions (e.g. `gzeof`) for the _input_ stream. This seems wrong, and there is a race condition that might mess up the tail of the output. — Craig Estey, Jan 08 '23 at 22:31
@CraigEstey, you are absolutely right about the messed-up tail of output. It always reads one more line than the file has in the producer; I couldn't figure out why. As for the condition of `while` loop in consumer, I could find a better-stopping criterion that could sync with the fact done with file reading. — yuw444, Jan 08 '23 at 22:37
Also, let's pretend there is _no_ compression. If the file has "chunks" of 64: `c0, c1, c2, ..., cN`, the producer queues these in a round robin fashion to (assume 3 output threads): `t0: c0, c3, c6, ...`, `t1: c1, c4, c7, ...`, and `t2: c2, c5, c8, ...` But, even with locking on the output, the output ordering could be scrambled because of racing between the output threads. — Craig Estey, Jan 08 '23 at 22:37
@TedLyngmo I just tried those flags without luck. Thanks for your input. — yuw444, Jan 08 '23 at 22:39
@CraigEstey, I agree with the order. The order is not critical in my application. — yuw444, Jan 08 '23 at 22:41
@CraigEstey I see your point now, do you mean using NULL to evaluate the while loop in consumer? After reading out in the consumer, put the slot with NULL. — yuw444, Jan 08 '23 at 22:45
I'm not 100% about `pthread`s but shouldn't you `return NULL;` from the thread functions (or call `pthread_exit(NULL)`)? — Ted Lyngmo, Jan 08 '23 at 22:49
Okay. Order not critical. Is chunk processing time _considerably_ more than read/decompress time and recompress/write time? There are a few ways to do this, depending on the relative processing times. If output ordering is _not_ important, I'd have producer send data to a _single_ shared queue. Each consumer thread can dequeue from the top of the single queue. This allows all consumer threads to operate at maximum speed. Otherwise, if the processing time varies based on the chunk data, round robin can cause a bottleneck because one thread is given several "hard" chunks while others are idle. — Craig Estey, Jan 08 '23 at 22:50
Also, the producer queues round robin, but does _not_ check if the _prior_ chunk sent to a given consumer has been completed (i.e. the consumer's area in `reader_buffer` is free/available for the new chunk). That is, for `t0`, `c0` was queued. Now, producer wants to queue `c3` to `t0` but, it has no way to tell whether `t0` has finished `c0`. It just _assumes_ it can (i.e. another race condition). — Craig Estey, Jan 08 '23 at 22:54
@CraigEstey Thanks for the queue idea; I never thought about it. Way to go! — yuw444, Jan 08 '23 at 22:54
@CraigEstey Lol, there is always a but. I will think about it. — yuw444, Jan 08 '23 at 23:05
You're welcome! See my answer: [Need test to determine if asynchronous I/O is actually happening in C code](https://stackoverflow.com/a/65099395/5382650) It has some basic queue code. But, it doesn't have interthread locking (e.g. mutexes). So, you can add those. Or, if you put the enqueue/dequeue indexes in a struct: `struct qpair { int enq; int deq; };`, you can use `stdatomic.h` primitives to atomically change the index pairs instead of using mutexes — Craig Estey, Jan 08 '23 at 23:07

Multithread file reading and processing

0 Answers0