Today, Google published a pipeline, coinciding with the kickoff of the 2020 TensorFlow Developer Summit. The pipeline is an Objectron that is spotting objects in 2D images and is estimating their sizes and poses through an artificial intelligence model. The company says it has implications for augmented reality, image retrieval, self-driving vehicles, and robotics. For example, it might help a factory flow robot to avoid obstacles in real-time.
Mainly when dealing with limited compute resources (like a smartphone system-on-chip), tracking 3D objects is a tricky prospect. When the only imagery (typically videos) is available in 2D, it becomes more robust, due to a lack of data and the diversity of shapes and appearances of objects.
Then, the Google team behind Objector developed a set of tools that allowed annotators to label 3D bounding boxes (i.e., rectangular borders) for objects using a view of split-screen to display 2D video frames. 3D bounding boxes were overlaid atop it alongside detected planes, camera positions, and point clouds. Annotators drew 3D bounding boxes in the 3D view. Moreover, they verified their locations by reviewing the projections in 2D video frames. On the other side, they only had to annotate the target object in a single framework for static objects. The tool propagated the object’s location to all frames. For that, they used ground truth camera pose information from AR session data.
The team developed an engine that placed virtual objects into scenes containing AR session data. It was for supplementing the real-world data to boost the accuracy of the artificial model’s predictions. All in all, it allowed using detected planar surfaces, camera poses, and estimated lighting for generating physically probable placements with the light that matches the scene. Thus, it resulted in synthetic data of high-quality with rendered objects that respected the scene geometry. It fitted seamlessly into real backgrounds. Accuracy increased by about 10% with the synthetic data in validations tests.
Better still, the team said that the currency version of the Objectron model is enough lightweight for running in real-time on flagship mobile devices. It can process around 25 frames per second, with the Adreno 650 mobile graphics chip found in phones like the Sony Xperia 1 II, LG V60 ThinQ, and Samsung Galaxy S20+. The Objectron is accessible in MediaPipe. MediaPipe is a framework to build cross-platform artificial intelligence pipelines, which consists of fast inference and media processing such as, video decoding. As well as an end-to-end demo app, models trained to recognize chairs and shoes are available.
In the future, the team plans to share additional solutions with the development and research community. It will be to stimulate new research efforts, applications, and use cases. Moreover, it indents scaling the Objectron model to more categories of objects and to further improve its on-device performance. MediaPipe was open-sourced by Google, roughly a year ago. It is a dirty and quick way to perform hair segmentation, multi-hand tracking, hand tracking, face detection, and object detections. You can also perform other tasks in a modular fashion.
Let’s see if Google will succeed.