I'm maintaining an archive of the heavily redacted documents coming out of the Foreign Intelligence Surveillance Court.
They come with big sections of text that look like this:
And when the OCR tries to work with this, you get text like:
production of this data on a daily basis for a period of 90 days. The sole purpose of this
production is to obtain foreign intelligence information in support of
individual authorized investigations to protect against international terrorism and
So in the OCRed version, where there are blacked out spots, there are just missing words. Sometimes, the missing words create a grammatically correct sentence with a different/weird meaning (like above). Other times, the resulting sentences make no sense, but either way it's a problem. It would be much better if the OCR engine could return X's for these spots or Unicode squares like ▮▮▮▮ instead.
The result I'd like is something like:
production of this data on a daily basis for a period of 90 days. The sole purpose of this
production is to obtain foreign intelligence information in support of XXXXXXXXXXX
individual authorized investigations to protect against international terrorism and
My question is how to go about getting these X's. Is there a way to analyze the images to identify the black spots? Is there a way to replace them with X's or some better unicode character? I'm open to any ideas to make this look right, but image editing is not a strong suit for me nor is hacking deep within the OCR engine.