3

I am referring to software based OCR ?Image to text engine conversion tools, stackoverflow has tons of posting on building OCR but I am looking opposite, like any guidance on how to protect my images from reverse engineering.

For example i have images containing only texts, how can i make it difficult for anyone to decode the data, is there any desired image format which can do this? or we can obfuscate images?

Can using special fonts or distortion guarantee OCR protection? though my requirement do not allow too much of distorted text being served.

Any direction will be very helpful

duckduckgo
  • 1,280
  • 1
  • 18
  • 32
  • Are you looking for a CAPCHA to authenticate a login, to avoid spam? If so, you should use an existing component. Or are you trying to post a document and want to avoid it being scanned? If so, I'm sure that OCR engines are advanced enough that anything that is OCR-proof is going to be way too annoying for your audience to read. – Hank Feb 04 '12 at 04:44
  • @HenryJackson - you guessed it right i am posting long documents to be read by peoples. why you said OCR proof ways are irritating? If this demand high end research at low level programming, it would like to give a try. – duckduckgo Feb 04 '12 at 05:07
  • 1
    if you can read it, you can (theoretically) OCR it. – aldrin Feb 04 '12 at 06:41
  • @aldrin you are correct this why captcha.net and Google have their images so obscure that they hard to read by humans too. – duckduckgo Feb 06 '12 at 05:05

4 Answers4

4

As I understand, you have a collection of some copyrighted text that should be clearly readable by humans, but you don't want it to leak from your server in electronic form. I don't think that it's a good idea to obfuscate text making it harder to OCR, since it will make it unreadable by humans, especially if texts are really long. Basically, what is easy to read for humans, can be perfectly OCR-ed. What is difficult to OCR is difficult for people too. In worst case, attacker may hire an Indian company to do manual retyping of text, this is not that expensive actually.

I would offer you to look for other aspects to make good protection. How does your use case look like? How come that users can get your texts as images on their PC? Do they download it just as PDF or image files? In this case it would be much simpler to fight against possibility to DOWNLOAD your files, instead of making it unreadable.

For example, you may think about not giving access to the whole file at once, but showing it page by page with human interaction required to get to the next page. You may even scramble your web interface to make it not possible to download everything by typical site download utilities. Each page shold be displayed on same URL, but actual navigation should be communicating with the server with AJAX or even some proprietary interface.

Another way is to make a lot of false links on every page not visible by humans, but they will mislead download utilities making them download tons of wrong content, or download it in wrong order making it unusable.

And if you will be successful in fighting against automated download, you won't even have to provide your content as an image, it can be straight text, but just small piece of it. It anyway will be unusable.

Hope that gives you some idea which way to go.

Ben Walker
  • 2,037
  • 5
  • 34
  • 56
Tomato
  • 2,169
  • 15
  • 24
  • thanks for detailed reply. My content on server side is HTML being displayed on browser just for reading it is split into pages and cannot download in one shot, but sending a plaintext would not solve because by crawlers who can make multiple request one can collect whole content, when using image there one step is involved which makes process tedious, I saw that today OCR can crack almost any format or high obscurity. – duckduckgo Feb 06 '12 at 02:21
1

I do not think you can do that. For CAPTCHA, yes, and there is tons of research, but you will also know from personal experience how annoying they are to read. For longer text it is impossible. I would seriously question the use case or business model here though. You have some content that for some reason needs protection from OCR. That means somebody would be willing to spend resources to OCR your content. Why would you fight those people? Make them a customer and offer the content in plain text for some fee. If that fee is less than their OCR cost, you have a win-win. What you are trying to implement sounds like a lose-lose.

Ben Walker
  • 2,037
  • 5
  • 34
  • 56
starmole
  • 4,974
  • 1
  • 28
  • 48
  • My need to display copyrighted content just for reading purpose, at this time it would be acceptable to have some of those annoying experience during content usage because content is anyway offered in a restricted manner, I saw that there several free online tools/bots for sequential download and text conversion for whole content. To some extent developers deny DRM because it is never full proof. – duckduckgo Feb 04 '12 at 06:54
  • You are trying to solve an unsolvable problem. I think one trait of a good engineer is that you point out those things to business people instead of nodding and trying to implement the impossible. That said, what you really want to google for are things like protected video (or audio) path that some OSes or HW implement. It is trying to disallow screen scraping at the OS level and might be closest to what you are looking for. Of course it also does not work against a dedicated attacker. – starmole Feb 04 '12 at 07:21
  • with os implementation or hardware level versatility and cross compatibility is problem though, i found the following article http://www.codeproject.com/Articles/3907/Creating-Optical-Character-Recognition-OCR-applica – duckduckgo Feb 04 '12 at 12:00
1

As I and others have said, making a large amount of text obscure enough that OCR can't read it will make it impractical for humans.

Is there a specific threat you're trying to beat? Simple web crawlers often don't execute javascript, so a dumb way to make your text harder to scrape would be to load it with an AJAX request and insert it into the DOM.

Or if you want to get more intense, you could have the text displayed in a Flash or Silverlight control -- still not OCR-proof, but that would make it non-trivial to automatically grab large amounts of text, particularly if you have a Flash scrollbar and/or pagination. (I should point out that Flash controls for something simple like text sounds annoying to use, won't be searchable or bookmarkable, and obviously won't work on the majority of mobile devices.)

Hank
  • 8,289
  • 12
  • 47
  • 57
  • I did some prototype and found that only difficulties are humans are not able to select text or copy them for searching on Google etc, otherwise those text are looking exactly same if not obscured (Prone to OCR). i would like to explore Flash etc I am curious if it can allow decryption of encrypted text sent from server. – duckduckgo Feb 06 '12 at 02:25
  • I don't really have any experience with Flash, but I'm certain there is a way to encrypt communication between the control and the server, e.g. over SSL. Like I said, I think the idea of reading text in a Flash control sounds a little annoying for the user, and most people would probably agree that Flash is a technology that's on its way out, particularly for non-multimedia stuff like this. But if you're determined to get your site as theft-proof as possible, I suppose it's an option. Obviously, anyone determined enough WILL be able to scrape your site. – Hank Feb 06 '12 at 02:31
0

I have seen some pages obfuscating text by using invisible letters and other "noise" in the text. This way you can still display it as text, while making it a lot harder to copy.

Another idea might be to watermark the text in some way to recognize from where a "stolen" copy came from. If this is useful depends on exactly what you want to be protected from. As has already been mentioned, if it is readable, someone could manually copy it.

Hjulle
  • 2,471
  • 1
  • 22
  • 34