CrowdFlower and Human in the Loop Machine Learning
Source – nanalyze.com
The Chinese are pretty much on a tear lately, having announced a few days ago that they plan to achieve global dominance in AI by 2030. The Chinese have traditionally been “not so very good” when it comes to translating English, but have recently committed to changing that, perhaps, because they know it is an important step towards global domination in AI. The race to achieve technological dominance with AI will be not be won by the country with the best algorithms but the country with the best data. That’s why companies like Google have such a distinct advantage when it comes to developing their AI platform. It was back in 2013 that the old “98% of data was produced in the last 2 years” metric came out, and with the emergence of technology like IoT which will use trillions of sensors, data is going to be produced even faster now. Much of this data is not structured in a consistent manner because there aren’t any widely adopted standards yet.
For all you original gangsters of data, you’ll know where we’re going when we start to talk about ETL which stands for Extract Transform and Load. It’s something that comes up a lot when you’re working with structured data. Here is a very simple example:
In the above real-world example, we see that the ETL process didn’t just move the data, it extracted the location from the ID/NUM fields, changed the birthdays to a common format, standardized the gender field, and split out the first and last names from the “Location B” table. Nowadays, any basic ETL tool will automatically create simple mappings like these. For more advanced cases though, you can see how you would need a “human in the loop” in order to make decisions about how you might map these fields. A human needs to create the rules first, and then maybe AI can learn from the human over time. That human may not need to “stay in the loop” for too long before AI “frees them up to do more value added activities”. One leading “human in the loop” platform is CrowdFlower.
Founded way back in 2007, San Francisco startup CrowdFlower has taken in $58 million in funding so far from a wide range of investors to develop their “essential human-in-the-loop AI platform for data science and machine learning teams“. Featured in the CB Insights list of 100 AI Companies to Watch, CrowdFlower’s “human in the loop” refers to a global crowdworking platform that is focused on cleansing big data for use in training AI algorithms. In order to understand how crowdWORKing works, it’s helpful to think about how crowdFUNDing works.
Crowdfunding is where a large number of people contribute a small amount of money to some project. The majority of these projects then fail to deliver on their promises (like 70%). Nobody raises too big of a stink about these failures because they contributed so little, and the platform is not that interested to fix the problem because they get paid either way. Crowdworking is a much better business model where a large number of people put in a small amount of work to solve a task with a rigid structure of accountability in place to make sure everyone plays by the rules. You’ll recall our last article on Amazon’s Mechanical Turk, a platform that was perhaps the first example of crowdworking:
That platform was the first example of crowdworking but lacked the rigid structure needed to keep everyone in check (employee and employer). Advocates of Amazon’s crowdworking platform say that it provides employment for people who wouldn’t be employed otherwise. Detractors say that this is the ultimate slap in the face to the working man who is now expected to do a few minutes of work for pennies with not one benefit coming from the working relationship between employee and employer. The truth is probably somewhere in the middle.
Getting back to CrowdFlower, they started up just two years after Amazon’s Mechanical Turk came about. Their seed funding round in 2009 came from 13 different investors, one of whom was Travis Kalanick, the man who created the biggest startup in Silicon Valley, ever, Uber. The next interesting event to take place was in September of 2014 when Canvas Ventures led a Series C round and the following month, there were said to be massive layoffs of up to 5 rounds (unverified Glassdoor reviews) over the years that followed.
Fast forward to last month when CrowdFlower successfully closed their biggest round yet, $20 million, with participation from Microsoft (NASDAQ:MSFT) and Salesforce (NYSE:CRM). The startup is clearly establishing their focus in AI with the hiring of three AI experts announced just a few days ago, two of whom will be joining their Scientific Advisory Board. The other is their new “VP of Machine Learning”, a fellow by the name of Robert Munro who probably knows a thing or two about Mechanical Turk given that he was poached from Amazon. All this focus on AI can be explained by a recent statement from the CEO of CrowdFlower:
We’re at the beginning of a Cambrian explosion of AI applications within the enterprise and the bottleneck for the large-scale adoption of machine learning still remains the availability of high quality training data and human-in-the-loop workflows to handle the failure states of the algorithm.
So how does the CrowdFlower platform work? It’s that rigid structure which we mentioned earlier that makes sure everyone behaves. The data scientists who use the platform are required to provide a set of “test questions with answers” that CrowdFlower will then seed throughout a data project to check and make sure people are consistently honest. Not only that, but they can always answer more subjective answers by having multiple people answer the same question and then taking the opinion of the majority. If at any time you fail to meet the quality required you’re then dropped. People who use the platform are subjected to a form of “gamification” where they “level up” as you would in gaming. This creates an incentive for workers to always be on their best behavior, since one can assume that the higher level workers get the better job.
It’s not just your “level” that dictates what jobs you will perform but also your intelligence which is measured over time. CrowdFlower actually learns about their workers’ cognitive capabilities over time to see which users are the smartest and allocates tasks accordingly:
Unfortunately for some, it looks like this platform is moving towards the purest form of meritocracy, using nothing but skill to gauge performance. As the CEO says, “the perfect person for any unique job is statistically speaking, not going to be employed at your company“. That perfect person will definitely be found in the 2.5 billion people on this planet that are capable of doing digital tasks like this, only 5 million of whom are working at CrowdFlower.
The primary use case for CrowdFlower’s platform is something called “data wrangling”, a task that 67% of data scientists find the least interesting and most time consuming. Do you really want those extremely well paid and difficult-to-find data scientists that work for you doing data cleanup? CrowdFlower doesn’t think so.
They’re not alone though in their pursuit of capturing the world’s digital workforce. The University of Austin published a research paper that identified seven Mechanical Turk alternatives: ClickWorker, CloudFactory, CrowdComputing Systems, CrowdFlower, CrowdSource, MobileWorks, and oDesk. The paper is titled “Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms“, and it evaluates all seven of these platforms across the below variables:
While a comparative analysis of these 7 companies is outside the scope of this article, one thing is becoming very clear here. If AI is becoming the new electricity, data is becoming the new oil. There are loads of data science startups popping up, some of which may not even look like they’re data science companies. One example is Grammarly, a company that’s taken in $110 million for their “free spellchecker” which is now capturing ishtloads of data that it uses to create an algorithm that speaks perfect English, better than anyone ever possibly could.
Companies like CrowdFlower provide the “human in the loop” which trains our AI algorithms faster and makes them perform better. With a total addressable market (TAM) of around 100,000 data scientists working in 15,000 different companies, CrowdFlower has a great story to tell when they start thinking about an exit.