Machine learning techniques applied to crack CAPTCHAs
F-Secure says it’s achieved 90% accuracy in cracking Microsoft Outlook’s text-based CAPTCHAs using its AI-based CAPTCHA-cracking server, CAPTCHA22.
For the last two years, the security firm has been using machine learning techniques to train unique models that solve a particular CAPTCHA, rather than trying to build a one-size-fits-all model.
And, recently, it decided to try the system out on a CAPTCHA used by an Outlook Web App (OWA) portal.
The initial attempt, according to F-Secure, was comparatively unsuccessful, with the team finding that after manually labelling around 200 CAPTCHAs, it could only identify the characters with an accuracy of 22%.
The first issue to emerge was noise, with the team determining that the greyscale value of noise and text was always within two distinct and constant ranges. Tweaks to the tool helped filter out the noise.
The team also realized that some of the test CAPTCHAs had been labelled incorrectly, with confusion between, for example, ‘l’ and ‘I’ (lower case ‘L’ and upper case ‘i’). Fixing this shortcoming brought the accuracy up to 47%.
Pyppeteer pulls the strings
More challenging, though, was handling the CAPTCHA submission to Outlook’s web portal.
“Instead of trying to replicate what occurred in JS, we decided to use Pyppeteer, a browsing simulation Python package, to simulate a user entering the CAPTCHA,” said Tinus Green, a senior information security consultant at F-Secure
“Doing this, the JS would automatically take care of the submission for us.”
Green added: “We could use this simulation software to solve the CAPTCHA whenever it blocked entries and once solved, we could continue with our conventional attack, hence automating the process once again.
“We have now also refactored CAPTCHA22 for a public release.”
CAPTCHA the flag
CAPTCHAs are challenge-response tests used by many websites in an attempt to distinguish between genuine requests to sign-up to or access web services by a human user and automated requests by bots.
Spammers, for example, attempt to circumvent CAPTCHAs in order to create accounts they can later abuse to distribute junk mail.
CAPTCHAs are something of a magnet for cybercriminals and security researchers, with web admins struggling to stay one step ahead.
Late last year, for example, PortSwigger Web Security uncovered a security weakness in Google’s reCAPTCHA that allowed it to be partially bypassed by using Turbo Intruder, a research-focused Burp Suite extension, to trigger a race condition.
Soon after, a team of academics from the University of Maryland was able to circumvent Google’s reCAPTCHA v2’s anti-bot mechanism using a Python-based program called UnCaptcha, which could solve its audio challenges.
Green said: “There is a catch 22 between creating a CAPTCHA that is user friendly – grandma safe as we call it – and sufficiently complex to prevent solving through computers. At this point it seems as if the balance does not exist.”
Web admins shouldn’t, he says, “give away half the required information” through username enumeration, and users should be required to set strong pass phrases conforming to NIST standards.
And, he adds: “Accept that accounts can be breached, and therefore implement MFA [multi-factor authentication] as an additional barrier.”