All of us have at least one encounter with the infamous â€œcaptchaâ€ whether we realized or not. Do not really know what captcha is?
It is the annoying distorted image that we have to decipher and type out while filling up a form on a website. For most of us, that part is the most difficult one and some of are unfortunate enough to fail the captcha test but there is no escaping it as we are forced to squint hard and try to decipher the hard to read test presented in the image. Most of the modern browsers helps the user with auto fill feature which fills most of the required information, but when it comes to the captcha part, the hard work had to be done by the user. Like all other technologies, captcha too has evolved over the years. Let us take the journey of captcha from the beginning.
Captcha images looks like this
According to Wikipedia a CAPTCHA (an acronym for “Completely Automated Public Turing test to tell Computers and Humans Apart”) is a type of challenge-response test used in computing to determine whether or not the user is human.
Oops.. Too many technical terms. Rightâ€¦??? Let’s simplify this, but before simplifying letâ€™s understand what is “TURING TEST”. The below imageÂ will help you understand the same.
The standard Turing test consists of 3 participants,
So as per the test the interrogator (our genius dude C), is tasked with trying to determine which player â€“ A or B â€“ is a computer and which is a human. The interrogator is limited to using the responses to written questions to make the determination
In practice, our dude C is chatting with two girls simultaneously on some chat application, both the girls are saying they donâ€™t have cam and this guy has to chat them in text only, this guy doesn’t know that one of the two is computer program while other one is real human. And here the most interesting fact about this entire chat or conversation is this guy doesn’t know he is giving Turing test. Similar to this guy, we have also taken the Turing test many times while we are online (maybe while chatting using Yahoo Messenger) 😎
Back to captcha, a CAPTCHA is a program that protects websites against bots (automated program) by generating and grading tests that humans can pass but a computer programs cannot. For example, humans can read distorted text (as the one shown above), but a computer programs can’t. So with captcha we are actually securing our web site against some kind of vulnerabilities.
Let’s take an example what could be the consequence if there is no captcha implemented on a web form.
Suppose I am working on an online survey tool and ready to payÂ $1 for each submission of the form (I know the rate is too high). Letâ€™s assume Â the site is not secured with captchaÂ but I am trying to identify the submission using cookies. Now an automated program can fill and submit the survey n numbers of times may be using different browser or after cleaning the cooking or in incognito window. At the end of the day I will be having zero in my bank balance with some junk data in as the survey result.
But if I secure the form using captcha and check the cookie as well, the automated program will not be able to submit the form until unless the user enters correct captcha; This will ensure protection of the form and survey against bot attack.
Enough of captcha gyan. Let’s move the next topic reCAPTCHA.
reCAPTCHA was originally developed by Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum at Carnegie Mellon University’s main Pittsburgh campus and acquired by Google in September 2009. Like the CAPTCHA interface, reCAPTCHA asks users to enter words seen in distorted text images onscreen. By presenting two words or an image having numbers in it (such as door number/street number etc.) it both protects websites from bots attempting to access restricted areas and helps digitize the text of books.
So if you have ever tried to read the small text below reCAPTCHA logo, which is reCAPTCHA slogan as well
By using this feature Google is not only helping us to protect web site from bot but also helping digitization projects to digitize text which is very hard to read by an optical character recognition (OCR). The New York Times archives are one of the projects which was digitized using this service and people like you and me helped them to achieve this task. (Donâ€™t you think our name should be listed in credit part or at least a thanks mail should be sent ;-)). Everyday the server is displaying over 100 million of captcha.
Now the question comes when an OCR is not able to read the text properly then how come Google is able to identify that we are keying a correct response?
The process works like this: There are two altogether different OCR program which reads the text and provides their output, and then using a standard string matching program both outputs are aligned to each other to check with a standard dictionary. If the output is there in the dictionary, itâ€™s bingo else the word is marked as suspicious and later converted to be display as captcha. This word will then show with a known word. That means in captcha out of the 2 words one word is already successfully digitized andÂ just to check weather a user is human or bot and on submission the request gets rejected if the first word (which is a known word) is wrong.
If the known word is entered correctly but the second words whose data is not there in database (i.e. you are the luckiest early bird to enter the response for that particular word) it grants the access and keeps your response for other uses reference. If first 3 guess to the second word (which is harder for OCR) is same but doesnâ€™t match with OCR, the word response is marked as valid and saved. Now this word could be placed as a known word along with a new second word in a new captcha.
If six responses for a second word is altogether different and doesn’t even matches with any of the OCR response(s) is considered as unreadable and probably never pops up again.
The point allocation for the entire process is like this:
Once a given word hits 2.5 points, the word is considered valid and marked as known word.
In the above case, where the first 3 responses was not matching with OCR result but all 3 were same, they were not enjoying the power of 0.5 points given to OCR identification, but got 3 full points from 3 different users, hence marked as known word.
The below pictorial representation will make you understand this better:
“This aged portion of society were distinguished from” this text is first used by two OCR’s output was:
From the first OCR: niis aged pntkm so society were distinguished frow
From the second OCR: This aged pntkn so society were distinguished fiow
After this, when the standard string matching program runs against the dictionary it marks few words as known words which are there in dictionary, but two (aged and from) was not captured correctly by any of OCR marked as captcha words and appears in some captcha for human interpretation.
If first 3 human enters same input i.e. aged as their response for this captcha (shown in image) than it straight forward gets 3 points and becomes a known word. And as it is now a known word so may appear as below captcha.
When Google was using reCAPTCHA for the New York Times digitization we used to get only text like this. From 2012, reCAPTCHA began using photographs of house numbers taken from Google’s Street View project, in addition to scanned words which looks like this.
There are currently two version of reCAPTCHA exists, V1 and V2.
Until now we were discussing about v1 only. reCAPTCHA v1 can take many forms, but here are the most common ones:
The newer v2 reCAPTCHA looks like this:
Through the new captcha, users can gain access to a websites protected with it, by simply clicking on a box to confirm that they are not robots. There is no need for reading distorted text and numbers which was part of reCAPTCHA v1.
Let’s see how to use the new reCAPTCHA.
The new captcha appears as shown in the above image (with a text I’m not a robot). As soon as user checks the check box the animation starts animating and if you are lucky enough that the basic check completes you considering as a human it will allow access else it may ask for another check that could be our old reCAPTCHA or an image check (shown below in image).
The new No captcha helps us by listening to the hint provided or by refresh button to get new image (whichever is given as additional task).
Thatâ€™s all on captcha reCAPTCHA and NOcaptcha. If I am not able to clear any point or missed anything please mention your comments below. I will try to reply to them as soon as possible.
Thank you for your precious time. The very next post will be on Implementation of Captcha, re-captcha and no-captcha in real life. Meanwhile you can visit our home page @ thoughts2Share.in to get more detail about us.
Your comments and suggestion are always welcome.
Everybody is genius. But if u judge a fish by its ability 2 climb a tree, it will live whole life believing that it is stupid – Albert Einstein
Bad Behavior has blocked 6 access attempts in the last 7 days.