Best way of Multithreading to process lines of number of files

Question

I have some separated files, i want to process every line of the files (sequentially and independently), and i want it to be fast.

So i wrote a code to read a big chunk of a file into a buffer on ram, and then multi threads will compete to read the lines from buffer and process them. the pseudo code is as follows:

do{
  do{      

    fread(buffer,500MB,1,file);
    // creating threads
    // let the threads compete to read from buffer and PROCESS independently
    // end of threads

  while( EOF not reached )
  file = nextfile;
while( there is another file to read )

Or this one:

void mt_ReadAndProcess(){
  lock();
  fread(buffer,50MB,1,file);
  if(EOF reached)
    file = nextfile;
  unlock();
  process();
}
main(){
  // create multi threads
  // call mt_ReadAndProcess() with multi threads
}

The PROCESS is an (timely) expensive process.

Is there any better way to do this? a better way to read the file faster or to process it with multi threads ?

Thanks All,

Ameer.

score 0 · Answer 1 · answered Aug 07 '17 at 12:28

Why would you want to have threads "compete to read from buffer"? The data can be easily partitioned as it's read by the thread doing the reading. Contending to get data from a buffer gains nothing while likely wasting both CPU and wall clock time.

Since you're processing line-by-line, just read lines from the file and pass the buffers by pointer to the worker threads.

Assuming you're running on a POSIX-compliant system, something like this:

#include <unistd.h>
#include <pthread.h>

#define MAX_LINE_LEN 1024
#define NUM_THREADS 8

// linePipe holds pointers to lines sent to
// worker threads
static int linePipe[ 2 ];

// bufferPipe holds pointers to buffers returned
// from worker threads and used to read data
static int bufferPipe[ 2 ];

// thread function that actually does the work
void *threadFunc( void *arg )
{
    const char *linePtr;

    for ( ;; )
    {
        // get a pointer to a line from the pipe
        read( linePipe[ 1 ], &linePtr, sizeof( linePtr ) );

        // end loop on NULL linePtr value
        if ( !linePtr )
        {
            break;
        }

        // process line

        // return the buffer
        write( bufferPipe[ 0 ], &linePtr, sizeof( linePtr ) );
    }

    return( NULL );
}

int main( int argc, char **argv )
{
    pipe( linePipe );
    pipe( bufferPipe );

    // create buffers and load them into the buffer pipe for reading
    for ( int ii = 0; ii < ( 2 * NUM_THREADS ); ii++ )
    {
        char *buffer = malloc( MAX_LINE_LEN );
        write( bufferPipe[ 0 ], &buffer, sizeof( buffer ) );
    }

    pthread_t tids[ NUM_THREADS ];
    for ( int ii = 0; ii < NUM_THREADS; ii++ )
    {
        pthread_create( &( tids[ ii ] ), NULL, thread_func, NULL );
    }

    FILE *fp = ...

    for ( ;; )
    {
        char *linePtr;

        // get the pointer to a buffer from the buffer pipe 
        read( bufferPipe[ 1 ], &linePtr, sizeof( linePtr ) );

        // read a line from the current file into the buffer
        char *result = fgets( linePtr, MAX_LINE_LEN, fp );

        if ( result )
        {
            // send the line to the worker threads
            write( linePipe, &linePtr, sizeof( linePtr ) );
        }
        else
        {
            // either end loop, or open another file
            fclose( fp );
            fp = fopen( ... );
        }
    }

    // clean up and exit

    // send NULL to cause worker threads to stop
    char *nullPtr = NULL;
    for ( int ii = 0; ii < NUM_THREADS; ii++ )
    {
        write( linePipe[ 0 ], &nullPtr, sizeof( nullPtr ) );
    }

    // wait for worker threads to stop
    for ( int ii = 0; ii < NUM_THREADS; ii++ )
    {
        pthread_join( tids[ ii ], NULL );
    }

    return( 0 );
}

you're right. it's better to let the threads read themselves. and in my second example i have the same idea.. each thread will read a block of file into it's own buffer, in this case, could you please tell me if there's any problem with the speed? or is there any better idea? — ameerosein, Aug 07 '17 at 12:47
*as you can check in the following post, reading a big block (or chunk) of a file at once using fread() is faster than reading that chunk line by line!* Really? Do you think you're going to be able to write code that's as fast and as reliable as the developers who wrote your operating systems libraries? You really think you can write better and faster code to split a text file into separate lines? Do you know how `fread()` actually reads data? How a call to `fread()` translates to one or more actual `read()` system calls? — Andrew Henle, Aug 07 '17 at 12:58
so you can write a simple code to test that, read an entire file at once, then read it line by line!! — ameerosein, Aug 07 '17 at 13:05
please check the top answer (by adam) to this post and give me your comment, thanks https://stackoverflow.com/questions/24851291/read-huge-text-file-line-by-line-in-c-with-buffering — ameerosein, Aug 07 '17 at 13:37
two lines of your code causes error, could you check please? char *buffer = malloc( MAX_LINE_LEN ); and write( linePipe, &linePtr, sizeof( linePtr ) ); caused invalid conversions — ameerosein, Aug 07 '17 at 13:59
@ameerosein Those lines shouldn't cause any problems in C code. Are you compiling as C or C++ code? — Andrew Henle, Aug 07 '17 at 15:20

Best way of Multithreading to process lines of number of files

1 Answers1