Machine learning techniques applied to crack CAPTCHAs

Upgrade & Secure Your Future with DevOps, SRE, DevSecOps, MLOps!

We spend hours on Instagram and YouTube and waste money on coffee and fast food, but won’t spend 30 minutes a day learning skills to boost our careers.
Master in DevOps, SRE, DevSecOps & MLOps!

Learn from Guru Rajesh Kumar and double your salary in just one year.

Source: portswigger.net

F-Secure says it’s achieved 90% accuracy in cracking Microsoft Outlook’s text-based CAPTCHAs using its AI-based CAPTCHA-cracking server, CAPTCHA22.

For the last two years, the security firm has been using machine learning techniques to train unique models that solve a particular CAPTCHA, rather than trying to build a one-size-fits-all model.

And, recently, it decided to try the system out on a CAPTCHA used by an Outlook Web App (OWA) portal.

The initial attempt, according to F-Secure, was comparatively unsuccessful, with the team finding that after manually labelling around 200 CAPTCHAs, it could only identify the characters with an accuracy of 22%.

The first issue to emerge was noise, with the team determining that the greyscale value of noise and text was always within two distinct and constant ranges. Tweaks to the tool helped filter out the noise.

The team also realized that some of the test CAPTCHAs had been labelled incorrectly, with confusion between, for example, ‘l’ and ‘I’ (lower case ‘L’ and upper case ‘i’). Fixing this shortcoming brought the accuracy up to 47%.

Pyppeteer pulls the strings

More challenging, though, was handling the CAPTCHA submission to Outlook’s web portal.

There was no CAPTCHA POST request, with the CAPTCHA instead sent as a value appended to a cookie. JavaScript was used to keylog the user as the answer to the CAPTCHA was typed.

“Instead of trying to replicate what occurred in JS, we decided to use Pyppeteer, a browsing simulation Python package, to simulate a user entering the CAPTCHA,” said Tinus Green, a senior information security consultant at F-Secure

“Doing this, the JS would automatically take care of the submission for us.”

Green added: “We could use this simulation software to solve the CAPTCHA whenever it blocked entries and once solved, we could continue with our conventional attack, hence automating the process once again.

“We have now also refactored CAPTCHA22 for a public release.”

CAPTCHA the flag

CAPTCHAs are challenge-response tests used by many websites in an attempt to distinguish between genuine requests to sign-up to or access web services by a human user and automated requests by bots.

Spammers, for example, attempt to circumvent CAPTCHAs in order to create accounts they can later abuse to distribute junk mail.

CAPTCHAs are something of a magnet for cybercriminals and security researchers, with web admins struggling to stay one step ahead.

Late last year, for example, PortSwigger Web Security uncovered a security weakness in Google’s reCAPTCHA that allowed it to be partially bypassed by using Turbo Intruder, a research-focused Burp Suite extension, to trigger a race condition.

Soon after, a team of academics from the University of Maryland was able to circumvent Google’s reCAPTCHA v2’s anti-bot mechanism using a Python-based program called UnCaptcha, which could solve its audio challenges.

Green said: “There is a catch 22 between creating a CAPTCHA that is user friendly – grandma safe as we call it – and sufficiently complex to prevent solving through computers. At this point it seems as if the balance does not exist.”

Web admins shouldn’t, he says, “give away half the required information” through username enumeration, and users should be required to set strong pass phrases conforming to NIST standards.

And, he adds: “Accept that accounts can be breached, and therefore implement MFA [multi-factor authentication] as an additional barrier.”

Upgrade & Secure Your Future with DevOps, SRE, DevSecOps, MLOps!

Pyppeteer pulls the strings

CAPTCHA the flag

Related Posts

What is Machine Learning and what are the Types of Machine Learning Tools Available?

What is an Autonomous System and what are Applications of Autonomous Systems?

What is Predictive Analytics and what is the Types of Predictive Analytics Tools

What is Neural Network Libraries and What are the popular neural network libraries available today?

What is Reinforcement Learning and What are Reinforcement Learning Libraries?

What are Graphical Models? Why use Graphical Models Libraries and Types of Graphical Models Libraries?