Researchers of Google Hit with Failure in a Test in Thailand

Artificial intelligence cited as a miracle worker in medicine, especially in screening processes. Their models of machine learning boasted expert-level skills in detecting problems. Still, like many technologies, the lab is one thing as an idea and quite another to do so in real life. That is what Google researchers learned in rural Thailand in a humbling test.

Google Health created a system of deep learning that looks at images of the eye. Then it is looking for evidence of diabetic retinopathy. Moreover, diabetic retinopathy is a leading cause of vision loss around the world. Nevertheless, despite high theoretical accuracy, the tool proved too impractical in real-world testing. Thus, it frustrated both nurses and patients with inconsistent results. Also, it had a general lack of harmony with on-the-ground practices.

The lessons learned here were hard. Nevertheless, it must be said at the outset that it is a responsible and necessary step to perform that kind of testing. It is commendable that Google published these less than favorable results publicly. Thus, it is clear from their documentation that the team has already taken results to mind. Yet, the blog post is presenting a rather sunny interpretation of events.

The research paper shows the deployment of a tool meant to augment the existing process by which patients at several clinics in Thailand screened for diabetic retinopathy (DR). Moreover, it primarily shows nurses with diabetic patients one at a time, taking images of a “fundus photo” (their eyes). Then they send them to ophthalmologists in batches, who are evaluating them and return results. Usually, it happens at least 4-5 weeks later because of the high demand.

The Google System

The Google system was to provide ophthalmologist-like expertise in seconds. Moreover, in tests of internal, it identified degrees of DR with 90 percent accuracy. Thus, the nurses can make a preliminary recommendation for a referral. Or they can make further tests instead of a month (an ophthalmologist was checking the truth of the automatic decision within a week). It sounds great in theory.

However, that theory broke down soon as the authors of the study hit the ground. The study said that they observed a high degree of variation in the eye-screening process around the eleven clinics in their research. Moreover, the methods of grading and capturing images were consistent across clinics. Nevertheless, nurses had a significant degree of autonomy in how they organized the workflow of screening. Thus, at each clinic, different resources were available.

The locations and settings where eye screenings took place also highly varied across clinics. Only clinics had a room of dedicated selection that could be darkened for ensuring patient’s pupils were large enough to take a high-quality fundus photo.

The variety of processes and conditions resulted in images being sent to not being up to the high standards of the algorithm.

The system of deep learning has stringent guidelines concerning the images it will assess. For instance, if an image has a bit of a dark or blur area, the system will reject it. It will happen even in case it can make a reliable prediction.