177

Having worked at the Amazon Mechanical Turk for a long time, I find that some ReCaptcha questions are very similar to the 1c tasks there. Identify this, identify that.

A search shows that several web authors share the same suspicion. Most of them are high in the tinfoil scale, so they are not that reliable of a source. Included below for notability purposes.

claim source 1

claim source 2

claim source 3

WP/ReCaptcha#criticism

claim source 5

Given that Google is not a charity, and that recaptcha is free, the claim might hold water.

What is Google getting out of ReCaptcha? Free labour? Big data behavioural analysis? Something else?

Mindwin Remember Monica
  • 3,289
  • 4
  • 16
  • 32
  • 47
    this is the entire point of recaptcha. – user428517 Jun 01 '16 at 16:35
  • 8
    They also use it to identify business addresses ("click images that have storefronts"), house numbers, and waterways ("click images where there are rivers") for mapping –  Jun 01 '16 at 14:13
  • Not really free considering creating and maintaining captch can not be cheap – Jeroen Jun 02 '16 at 11:52
  • @Jeroen The OP means the service is free to the user, not the provider – JBentley Jun 03 '16 at 08:05
  • 41
    This is amazing. You went on conspiracy theory blogs to see whether it was true, even though google makes no secret of this and it's explicitly written on recaptcha's website... –  Jun 03 '16 at 10:07
  • @NajibIdrissi The help center requires notability and the question to be inedit in the site - there is no mention if the question/claim should be hard or easy to prove or debunk. And wikipedia is hardly a conspiracy theory site. Take this as an entry-level question. Probably a thing to learn is to check what the very company target of the claim is saying, i admit a fault in that and will include the "official status" in future questions. Thanks for your criticism. – Mindwin Remember Monica Jun 03 '16 at 14:07
  • 1
    @Mindwin Have you not accepted because it wasn't the answer you were looking for? – Insane Jun 04 '16 at 09:27
  • I think this all has to do with inglip. – Raystafarian Jun 05 '16 at 08:03
  • 4
    "High in the tinfoil scale", that's rich. – Celeritas Jun 05 '16 at 10:06
  • Aah, i always knew this! They're smart!! – ABcDexter Jun 05 '16 at 15:42
  • 13
    The irony of this question is extreme. "Is Stack Exchange using Q&A sites as a free source of human-intelligent labour?" – Paul Draper Jun 05 '16 at 18:42
  • Relevant XKCD: xkcd.com/1897 – qazwsx Feb 15 '19 at 18:48

2 Answers2

277

Yes, and ReCaptcha have always been open about it, before and after being acquired by Google.

From its formation, one of ReCaptcha's main selling points was that the data would be used. At first, it was used for fixing errors and ambiguities in the digitisation of books. Here's an example of this being praised back in 2007, 2 years before Google acquired it, when ReCaptcha was new:

reCaptcha makes captchas more useful than just preventing spam; by tapping into the reportedly 150.000 hours spent daily typing in captchas, reCaptcha has users proofread book text that OCR could not recognize and which would otherwise have to farmed out to a Mechanical Turk or other distributed proofreader. ...reCaptcha’s ingeniousness lies in making an otherwise cumbersome task worthwhile

Today (2016), it's expanded to include improving maps, machine learning/AI, and possibly other uses. Google are open about this, and even have a gallery of examples of how recaptcha data is used: https://www.google.com/recaptcha/intro/index.html#creation-of-value

Millions of CAPTCHAs are solved by people every day. reCAPTCHA makes positive use of this human effort by channeling the time spent solving CAPTCHAs into digitizing text, annotating images, and building machine learning datasets. This in turn helps preserve books, improve maps, and solve hard AI problems.

Google were also open about this back when they first acquired it. In fact, at the time, not only did their slogan reference how the data was used, but the tool itself also explicitly stated it:

reCAPTCHA. Stop spam, read books.

The words above come from scanned books. By typing them, you help to digitize old texts.

enter image description here

Screenshot from this blog

They promoted the product as a form of "crowdsourcing", using labour people were giving freely anyway on existing captcha systems to do something useful. For example, from the official blog post by Google from 2009 announcing the acquisition:

Since computers have trouble reading squiggly words like these, CAPTCHAs are designed to allow humans in but prevent malicious programs from scalping tickets or obtain millions of email accounts for spamming. But there’s a twist — the words in many of the CAPTCHAs provided by reCAPTCHA come from scanned archival newspapers and old books. Computers find it hard to recognize these words because the ink and paper have degraded over time, but by typing them in as a CAPTCHA, crowds teach computers to read the scanned text.

In this way, reCAPTCHA’s unique technology improves the process that converts scanned images into plain text, known as Optical Character Recognition (OCR). This technology also powers large scale text scanning projects like Google Books and Google News Archive Search.


Whether or not that's efficiency or exploitation, smart or immoral, is subjective - but there's never been any doubt or secrecy about the fact it happens.

[edit] of course, with some of the challenges coming from "text" software had failed to identify, there's always the possibility it might be unreasonably difficult or not even text at all. With thanks to Mateo's comment:

??

user56reinstatemonica8
  • 8,942
  • 5
  • 40
  • 51
  • 18
    If this helps make a better OCR software, I'm all for it. I've done a lot of OCR result proofing and the current apps leave a lot to be desired. – Joe L. May 30 '16 at 15:54
  • 3
    Doesn't that imply that reCAPTCHA will have to get trickier and trickier over time as existing solutions are improved by the growing database of user input? (Though one could say it already has, what with the movement from word recognition to visual classification.) – JAB May 30 '16 at 16:45
  • @JAB that's probably a question for Google, not me... – user56reinstatemonica8 May 30 '16 at 16:53
  • 1
    @JAB Is reCAPTCHA data *publicly* available? If not I don't really see the problem... only Google becomes better at reading, not other random bots. – Bakuriu May 30 '16 at 17:49
  • 3
    @JAB That could only be true if there were no more poorly-typed books being produced, which I doubt, and bare in mind that single captchas are shown to multiple users to make sure they don't get false positives. The fact that Google has moved onto visual probably has more to do with the fact that doing the Google Maps and Google Images recognition projects are more important to them than Google Scholar at the moment. – MrLore May 30 '16 at 18:24
  • 5
    Also for a while they were improving house number recognition for google street maps. – PlasmaHH May 30 '16 at 19:24
  • 30
    What I don't understand is, doesn't it have to know what the answer is to verify your input? How then, are you helping by participating? – Carcigenicate May 30 '16 at 20:30
  • 36
    @Carcigenicate Follow the link below the image, and the comments: at least one person tried to sneak rude words into Google Books by figuring out which word was the "test" word and which was the "scan" word, writing the "test" one correctly to pass the test, while trolling the "scan" one. Presumably the devs got wise to this and only accepted input that matched from multiple unrelated people... – user56reinstatemonica8 May 30 '16 at 20:39
  • 36
    @Carcigenicate as I understand it that is why they sent two words, one was a word they already had high confidence about and could therefore use as a test. The other was a word they wanted you to read but you didn't know which was which. IIRC they also sent "suspiscious" users two "test" words rather than one "test" and one "read". – Peter Green May 31 '16 at 01:02
  • 2
    @PeterGreen Exactly. And that's why their whole claim of "using the work done anyway for something useful" was bogus (at least in those days, I haven't looked into the more recent forms). Half the work was the CAPTCHA, the other half was used for something useful. In essence, they just made you do extra work. (That is not to say that the service doesn't have a number of positive sides to it, though.) – Jasper May 31 '16 at 16:44
  • 11
    The “unknown word” if entered the same way by a few people, then becomes a test word. As this is a word that the best OCR software could not cope with, the system generated its own test images. Creatiing the test images for CAPTCHA is one of the hard problems to solve. – Ian Ringrose May 31 '16 at 17:12
  • 1
    @JAB yes, but that follows anyway — advances in ML mean that captchas have to get harder to serve their primary purpose of distinguishing humans from machines. – hobbs May 31 '16 at 17:27
  • 1
    @JAB That's exactly what already happened. There was a period where these captchas were all but unsolvable. It got better as the test set was vetted more smartly, but it still happens. – Konrad Rudolph May 31 '16 at 19:21
  • @KonradRudolph I've always hated captchas, but recently I've errored the old-school distorted text style ones that some big companies still use, SEVERAL times in a row. Not sure if small sample size error or if a result of what you said, but that does give me some context. – HC_ May 31 '16 at 22:33
  • 8
    had to try and "solve" this one once: http://i.stack.imgur.com/EbT01.png – Mateo Jun 01 '16 at 02:05
  • 2
    @mateo hahaha that's amazing, I'm stealing that and adding it to the answer, hope you don't mind! – user56reinstatemonica8 Jun 01 '16 at 06:22
  • 1
    @Carcigenicate It's also important to note that reCaptcha now evaluates most users by an overall score based on behavior, so many users won't have to solve captchas to authenticate themselves: https://security.googleblog.com/2014/12/are-you-robot-introducing-no-captcha.html In this case, each individual test is not so important, but a user who has a suspicious behavior pattern and fails multiple tests may be blocked from proceeding. – recognizer Jun 01 '16 at 19:58
  • @Mateo "牙 牙 牙 牙 牙 牙 牙 牙 foreInc", amirite? – mattdm Jun 01 '16 at 20:40
  • 2
    This is ridiculous - why is *this* my all time highest voted answer?!?! – user56reinstatemonica8 Jun 03 '16 at 12:16
  • 1
    @user568458 Because this is the answer that the most readers feel confident is correct, and therefore upvote. Its a known feature of the stack exchange system that "simple" answers gets more upvotes. – Taemyr Jun 06 '16 at 04:22
42

Yes!

Luis von Ahn, one of original developers, talked in one TEDx conference about reCAPTCHA technology, and his new Project DuoLingo

In this presentation, he talks about CAPTCHA history and problems and how people were wasting about 500,000 hours every day using CAPTCHA. Then he thought how use this time in a useful thing, like helping OCR books.

He wants use this idea about massive collaboration on Duolingo, a Language-Learning Platform. The idea is to translate texts to more languages than English.

Oddthinking
  • 140,378
  • 46
  • 548
  • 638
Rodrigo Menezes
  • 651
  • 4
  • 6
  • 3
    [Here's the original talk](https://www.youtube.com/watch?v=tx082gDwGcM) from 2006 at Google TechTalks. – Max Jun 03 '16 at 15:53