4

I'm maintaining an archive of the heavily redacted documents coming out of the Foreign Intelligence Surveillance Court.

They come with big sections of text that look like this:

screenshot of redacted text

And when the OCR tries to work with this, you get text like:

production of this data on a daily basis for a period of 90 days. The sole purpose of this

production is to obtain foreign intelligence information in support of

individual authorized investigations to protect against international terrorism and

So in the OCRed version, where there are blacked out spots, there are just missing words. Sometimes, the missing words create a grammatically correct sentence with a different/weird meaning (like above). Other times, the resulting sentences make no sense, but either way it's a problem. It would be much better if the OCR engine could return X's for these spots or Unicode squares like ▮▮▮▮ instead.

The result I'd like is something like:

production of this data on a daily basis for a period of 90 days. The sole purpose of this

production is to obtain foreign intelligence information in support of XXXXXXXXXXX

individual authorized investigations to protect against international terrorism and

My question is how to go about getting these X's. Is there a way to analyze the images to identify the black spots? Is there a way to replace them with X's or some better unicode character? I'm open to any ideas to make this look right, but image editing is not a strong suit for me nor is hacking deep within the OCR engine.

mlissner
  • 17,359
  • 18
  • 106
  • 169

1 Answers1

0

You may want to train Tesseract for those long blobs. Depending on the length of the blob, you would assign a different number of 'X' characters. Read TrainingTesseract3 for training process.

nguyenq
  • 8,212
  • 1
  • 16
  • 16