1

I am writing a multi-threaded application and as of now I have this idea. I have a FILE*[n] where n is a number determined at runtime. I open all the n files for reading and then multiple threads can access to read it. The computation on the data of each file is equivalent i.e. if serial execution is supposed then each file will remain in memory for the same time.

Each files can be arbitrarily large so on should not assume that they can be loaded in memory.

Now in such a scenario I want to reduce the number of disk IO's that occur. It would be great if someone can suggest any shared memory model for such scenario (I don't know if I am using one because I have very less idea of how things are implemented) .I am not sure how should I achieve this. In other words i just want to know what is the most efficient model to implement such a scenario. I am using C.

EDIT: A more detailed scenario.

The actual problem is I have n bloom filters for data contained in n files and once all the elements from a file are inserted in the corresponding bloom filter I need to need to do membership testing. Since membership testing is a read-only process on data file I can read file from multiple threads and this problem can be easily parallelized. Now the number of files having data are fairly large(around 20k and note that number of files equals number of bloom filter) so I choose to spawn a thread for testing against a bloom-filter i.e. each bloom filter will have its own thread and that will read every other file one by one and test the membership of data against the bloom filter. I wan to minimize disk IO in such a case.

Aman Deep Gautam
  • 8,091
  • 21
  • 74
  • 130
  • What platform are you talking about? If you're on Linux, the easiest approach is to open them as memory-mapped files, and let the OS deal with it. (I'm sure there's an equivalent for Windows.) – Oliver Charlesworth Jun 25 '12 at 23:36
  • I am on linux. Can you explain a bit more, please – Aman Deep Gautam Jun 25 '12 at 23:37
  • Not sure what you are trying to share via shared memory? If you are thinking memory mapped files, that deosn't necssarily reduce IO (you still have to read all the stuff you have to read). Why do you think IO is a problem? I don't think there is enough detail here to give meaningful suggestions... I notice your question title mentions writing a file, but there is no mention of how/where files are written in the body – John3136 Jun 25 '12 at 23:38
  • 1
    With a memory-mapped file, you let the OS virtual-memory system deal with an efficient approach to paging the file in and out of physical memory, taking into account multiple accesses from different threads/processes. – Oliver Charlesworth Jun 25 '12 at 23:39

2 Answers2

3

At the start use the mmap() function to map the files into memory, instead of opening/reading FILE*'s. After that spawn the threads which read the files. In that way the OS buffers the accesses in memory, only performing disk io when the cache becomes full.

timos
  • 2,637
  • 18
  • 21
0

If your program is multi-threaded, all the threads are sharing memory unless you take steps to create thread-local storage. You don't need o/s shared memory directly. The way to minimize I/O is to ensure that each file is read only once if at all possible, and similarly that results files are only written once each.

How you do that depends on the processing you're doing.

f each thread is responsible for processing a file in its entirety, then the thread simply reads the file; you can't reduce the I/O any more than that. If a file must be read by several threads, then you should try to memory map the file so that it is available to all the relevant threads. If you're using a 32-bit program and the files are too big to all fit in memory, you can't necessarily do the memory mapping. Then you need to work out how the different threads will process each file, and try to minimize the number of times different threads have to reread the files. If you're using a 64-bit program, you may have enough virtual memory to handle all the files via memory mapped I/O. You still want to keep the number of times that the data is accessed to a minimum. Similar concepts apply to the output files.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278