Saving data while doing multithreaded processing

Question

There is a downloader application which performs different kinds of processing on the download items in multiple threads. Some threads analyze input data, some perform downloading, extraction, save state etc. Thus each type of a thread operates on certain data members and some of these threads may run simultaneously. Download item could be described like this:

class File;

class Download
{
public:
    enum State
    {
        Parsing, Downloading, Extracting, Repairing, Finished
    };

    Download(const std::string &filePath): filePath(filePath) { }

    void save()
    {
        // TODO: save data consistently

        StateFile f;    // state file for this download

        // save general download parameters
        f << filePath << state << bytesWritten << totalFiles << processedFiles;

        // Now we are to save the parameters of the files which belong to this download,
        // (!) but assume the downloading thread kicks in, downloads some data and 
        // changes the state of a file. That causes "bytesWritten", "processedFiles" 
        // and "state" to be different from what we have just saved.

        // When we finally save the state of the files their parameters don't match 
        // the parameters of the download (state, bytesWritten, processedFiles).
        for (File *f : files)
        {
            // save the file...
        }
    }

private:
    std::string filePath;
    std::atomic<State> state = Parsing;
    std::atomic<int> bytesWritten = 0;
    int totalFiles = 0;
    std::atomic<int> processedFiles = 0;
    std::mutex fileMutex;
    std::vector<File*> files;
};

I wonder how to save these data consistently. For instance, the state and the number of processed files might have already been saved, and we are going to save the list of files. Meanwhile some other thread may alter the state of a file, and consequently the number of processed files or the state of the download, making saved data inconsistent.

The first idea that comes to mind is to add a single mutex for all data members and lock it every time any of them is accessed. But that would be, probably, inefficient as most time threads access different data members and saving takes place only once in a few minutes.

It seems to me such a task is rather common in multithreaded programming, so I hope experienced people could suggest a better way.

*"The first idea that comes to mind is to add a single mutex for all data members and lock it every time any of them is accessed."* - Why can you not use multiple mutexes and lock access to indidividual members? And why not split up the class into multiple different classes so that each thread can quietly work on its own piece(s) of data until it's finished and the partial results are assembled to the final result? — Christian Hackl, Oct 01 '16 at 07:44
Well, as I described above, locking individual members doesn't prevent the whole data set from being saved inconsistently. E.g. the saved state of the download and the number of processed files may not match the saved list of files. Well, threads may use the same data members. I just meant they may use not all of them. — mentalmushroom, Oct 01 '16 at 07:51

score 0 · Answer 1 · 2016-10-01T11:34:56.833

I would suggest you use the producer consumer pattern.

The downloader produces to parser and notify it to consume, the parser produces to extractor and notify it to consume and the extractor to the repairer. You will then have a queue for each task. The synchronization could be optimized using condition variables so the consumer pulls only when it is notified once something has been produced. You will end up using very less mutexes and much more readable and efficient design.

Here is a sample code for a queue and how to do if you have to simultaneously download, parse, extract and save:

#include <thread>
#include <condition_variable>
#include <mutex>
#include <queue>
template<typename T>
class synchronized_queu
{
public:
    T consume_one()
    {
        std::unique_lock<std::mutex> lock(lock_);
        while (queue_.size() == 0)
            condition_.wait(lock); //release and obtain again
        T available_data = queue_.front();
        queue_.pop();
        return available_data;
    }
    void produce_one(const T& data)
    {
        std::unique_lock<std::mutex> lock(lock_);
        queue_.push(data);
        condition_.notify_one();//notify only one or all as per your design...
    }
private:
    std::queue<T> queue_;
    std::mutex lock_;
    std::condition_variable condition_;
};
struct data
{
    //.....
};

void download(synchronized_queu<data>& q)
{
    //...
    data data_downloaded = ; //data downloaded;
    q.produce_one(data_downloaded);
}

void parse(synchronized_queu<data>& q1, synchronized_queu<data>& q2)
{
    //...
    data data_downloaded = q1.consume_one();
    //parse
    data data_parsed = ;//....
    q2.produce_one(data_parsed);
}

void extract(synchronized_queu<data>& q1, synchronized_queu<data>& q2)
{
    //...
    data data_parsed = q1.consume_one();
    //parse
    data data_extracted = ;//....
    q2.produce_one(data_extracted);
}
void save(synchronized_queu<data>& q)
{
    data data_extracted = q.consume_one();
    //save....
}

int main()
{
    synchronized_queu<data> dowlowded_queue;
    synchronized_queu<data> parsed_queue;
    synchronized_queu<data> extracted_queue;

    std::thread downloader(download, dowlowded_queue);
    std::thread parser(parse, dowlowded_queue, parsed_queue);
    std::thread extractor(extract, parsed_queue, extracted_queue);
    std::thread saver(save, extracted_queue);
    while (/*condition to stop program*/)
    {

    }
    downloader.join();
    parser.join();
    extractor.join();
    saver.join();
    return 0;
}

The thing is these tasks are intended to run simultaneously, so there is no queue. Saving is required to occur periodically regardless of what other threads are doing. — mentalmushroom, Oct 01 '16 at 09:10
Yes and that is what I wanted to suggest. For a simple explanation let's have only downloader and extractor; you start simultaneously two threads downloader and extractor sharing same queue Q1. The downloader once it downloads anything it pushes to the queue and as the extractor is pulling from the queue once something is there it processes it.... — , Oct 01 '16 at 09:37
My question was rather about saving data while it's processed by other threads. — mentalmushroom, Oct 01 '16 at 09:56
The saver could be a task working on a queue which contain final data to save. — , Oct 01 '16 at 10:00
Download state must be saved at, say, every 3 minutes so as to make it possible for the program to load it and continue downloading in case of crash or abnormal termination. Hence even intermediate results must be saved. — mentalmushroom, Oct 01 '16 at 10:52
I have added full sample code. I hope you see more clearly my idea and fid how check states... — , Oct 01 '16 at 11:36
Thanks for the code, but as far as I understand 1) It saves data only when it gets extracted. The requirement is to save periodically (even when extraction didn't take place). 2) It saves only extracted items, which is also wrong, because the task is to save all items and their state must be consistent. Consistency of all download parameters as a whole is the key problem. — mentalmushroom, Oct 01 '16 at 13:10
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/124711/discussion-between-farhat-latrach-and-mentalmushroom). — , Oct 01 '16 at 13:28

Saving data while doing multithreaded processing

1 Answers1