0

I've written a python script that uses opencv to read images off a drive, do a little bit of processing, and store them in a buffer. To speed it up, I created a multithreaded version using python's "threading" package that spawns a bunch of workers that read the files in parallel.

(Keeping everything in memory would be the ideal solution, but there's just too many files and not enough memory)

The implementation is super simple, each worker is given a list of filenames to read, using cv2.imread(), process, and store in a buffer until it is asked for them.

I've tested this script multiple times on two machines, and observe the following:

  • Windows PC, SATA SSD, Single-Threaded: This is the baseline, it works fine.

  • Windows PC, SATA SSD, Multi-Threaded: Significant speedup that scales very well with worker count, and works fine.

  • Ubuntu PC, NVME SSD, Single-Threaded: Again, this is just reading files in a loop - also works fine.

  • Ubuntu PC, NVME SSD, Multi-Threaded: Does not seem to be any faster than the ST version, and is considerably slower than the MT version on the SATA SSD. It does do its job up until the image files on the drive start becoming corrupt. A handful become corrupt, and the script crashes out b/c it can't open them.

Reading the affected files programmatically produces the error: "libpng error: bad adaptive filter value" They can't be opened with a photo viewer or anything like that. At a cursory inspection they seem to have been truncated.

I initially wrote the script on my Windows PC where it worked fine. I've replaced the images and run a few trials to verify that the MT script is the cause of this file corruption that I'm seeing. It does seem to be the case.

My best guess at the issue is that two threads trying to read the same image at the same time is the culprit. I'm not sure of this however, as it is a read operation, so naively I wouldn't expect it to change the data at all, and it doesn't seem to cause any issues on the windows machine using the SATA drive.

I was also expecting a similar speed boost on the Ubuntu machine, it's curious that this isn't the case.

The drive in question is a Samsung 970 Evo Plus.

fmw42
  • 46,825
  • 10
  • 62
  • 80
  • 1
    `My best guess at the issue is that two threads trying to read the same image at the same time is the culprit.` On Linux, reading the same file handle from two different threads can result in each getting a chunk of data, advancing the seek position, and making the other thread skip part of the file. Are these two threads sharing a file, or sharing a file *handle*? – Nick ODell Aug 12 '23 at 17:36
  • That's a good point, though it does sound like the issue you're describing should only result in faulty data being used by my script, rather than actually corrupting the file on disk. In therms of sharing a file handle, I'm honestly not sure. Each thread is calling cv2.imread() independently, but I don't know if that necessarily means distinct handles. – cytokinesis Aug 12 '23 at 21:15
  • Got it - while reading your question, I missed that this is corrupting files on disk, not just giving you invalid data. – Nick ODell Aug 12 '23 at 21:52
  • Does the number of files in the dataset matter to causing the corruption? Suppose you have one file, and you attempt to repeatedly read from it in multiple threads. Is that enough to cause the corruption? – Nick ODell Aug 12 '23 at 21:53
  • be sure that each thread gets a file _name_ not an _open_ file - and they should open the file in the thread . Nonetheless, this is certain an unusual outcome - to the point it should be reported as a bug to opencv – jsbueno Aug 16 '23 at 16:04

0 Answers0