Illustration of Pepper the robot with a superimposed brain

How to Use Google’s ML Kit to Enhance Pepper With AI (Part 2)

13 ​​min

Welcome to the second part of this blog series “How to use Google’s ML Kit to enhance Pepper with AI“! In case you missed the first part, I recommend you start reading here for an introduction on what we’re building and how to start.

In this article, we are going to look at something cool we can build by combining Google’s ML Kit with the humanoid-shaped social robot Pepper. Imagine you can ask the robot via natural language to go pick something up for you. For some reason, an all-time favorite and the most-asked question ever about Pepper is “Can it bring me a coffee?“ While that is a considerably challenging endeavor that encompasses complex tasks in several areas of AI and robotics for which Pepper is not quite ready yet, we can start going a step in this direction by embedding image classification to recognize the objects around it or, better yet, object detection to identify the position of those objects in the room and also point at them.  Have a look at this video illustrating what that looks like:

ML Kit’s Object Detection API

With this demo, our idea is to leverage object detection to recognize objects in an image and their position so that Pepper can localize them in a room. Pepper should be able to respond to the question of what objects he can see and even point at them when asked to.

To simply categorize everyday objects, the base model of the Image Labeling API returns pretty good results. It is a general-purpose classifier that can identify general objects, places, activities, animal species, and products out of more than 400 categories and takes approximately 200 ms for inference when run on Pepper. However, since we also want to know the position of the objects, Image Labeling is not enough. The Object detection and tracking API should be the right choice. 

Custom models

The API offers two modes that are optimized for two core use cases: tracking the most prominent object across images and detecting multiple objects in a static image. However, although it can optionally classify the detected objects, the base coarse classifier used by default and trained by Google is not enough for our use case as it only classifies into five broad categories: Place, Fashion good, Home good, Plant, and Food. Same as for Image Labeling, you can use the API either with the base models or custom TensorFlow Lite models that are more targeted. They can be bundled with the app or downloaded from the cloud using Firebase. The APIs are compatible with a selection of pre-trained models published on TensorFlow Hub or a custom model trained with TensorFlow, AutoML Vision Edge, or TensorFlow Lite Model Maker, provided it meets certain requirements.

Google released a family of image classification models called EfficientNet in May 2019, which achieved state-of-the-art accuracy with an order of magnitude of fewer computations and parameters and EfficientNet-Lite, which runs on TensorFlow Lite and is designed for performance on mobile CPU, GPU, and EdgeTPU in 2021. It brings the power of EfficientNet to edge devices and comes in five variants, starting from the low latency/model size option (EfficientNet-Lite0) to the high accuracy option (EfficientNet-Lite4). The largest variant, integer-only quantized EfficientNet-Lite4, achieves 80.4 % ImageNet top-1 accuracy. However, running this model on Pepper’s processor has an inference time of around 15 seconds! Unfortunately, that is too long for any kind of interactive application. That is why we will have to trade off accuracy and settle for the lower accuracy (and lower model size) variants. Even the B0 variant still has a higher latency (over one second) than previous mobile models such as mobilenet v2, which makes the interaction less fluid. That is why, although this could also make a very good candidate for the job, we are using an object labeler based on MobileNet V2 and optimized for TFLite trained by Google using quantization-aware training as our custom model with the Object Detector that yields pretty good results in about 0.8 seconds. This model can be found on TensorFlow Hub as a “Google Mobile Object Labeler“.


Here you can find the full code of the application we are building throughout this series.

When this demo has been selected either via voice or via touch, the activity replaces the menu with this fragment. Its layout includes a PreviewView  with the images of the camera currently being processed, on top of which the predicted information will be drawn, the home button to go back, and a button to repeat the rules. For our demo purposes, the analyzer will be continuously running in the background and updating the results in the form of bounding boxes and text with the labels on the screen even if no question was asked, whenever we are showing this fragment. Once the view is created, Pepper briefly explains how it works.

With regards to the architecture, we have a fragment where we use data binding to access the views, use a ViewModel to store the data, and an Analyzer helper class for the recognizer.

How to build the model

In the onCreate method of our fragment, after inflating and initializing the views, we start by building the LocalModel we are going to use with the analyzer. So that it can be found, our custom TFLite model needs to be located in the assets folder of the project.

How to analyze the images with an object detector

Let’s start with our recognizer, which is quite simple. It takes the image, the model, and the function type of a lambda function that we use as a callback with the list of detected objects as parameters. In that way, we will be informed asynchronously when the results are ready.

We create a CustomObjectDetector with our LocalModel and in the options, we enable the recognition of multiple objects and their classification. We select SINGLE_IMAGE_MODE. This mode analyzes each image independently. 

Preparing the input to our detector is also straightforward, since conversion from a bitmap, which is the format in which we get the camera image from the QiSDK action to get a picture, is easily done in one line of code.

On completion, we return the detected objects sorted by the confidence of their labels to be processed in the fragment.

How to process and show the results

The process will work asynchronously in the following way: in the fragment, every time a new image is created, we will set it as the source in the preview and start the analyzer, passing the created local model and the image as arguments. We then observe the results of the analyzer in order to process them and present them to the user via tablet and voice.

For that purpose, we set up the image observer and the analyzer observer and take the first image. The results from the analyzer have the form of a list of objects of type DetectedObjects (part of the ML Kit vision package) that enclose a bounding box, a tracking id, and the labels for each object. The labels, in turn, each have fields for the text, its confidence, and its index.

We save the labels to a chat variable for them to be available in the voice interaction. We will also use the labels to update the results on the screen, which might need to be translated depending on the language of the robot. This is because the models return the recognition in English. To that end, we use the TextTranslate API, if necessary. 

The next thing to do is calculate roughly in which area of the image they are situated. We do that by simply dividing the image into six areas and calculating in which of them the center of the object (that can be known since we know the bounding box) lies. We then combine the labels, the areas, and the bounding boxes to show them on the screen whenever the confidence is higher than the recommended 0.35 threshold, by drawing them with a helper class over the preview in our resultsView

The results, including bounding boxes, labels, and confidence, drawn over the preview look something like this:

Roboter Pepper with tablet on chest showing face recognition via squares around face

If you see that the bounding boxes do not match 100% to the objects and you’re wondering if it is an issue with the object detector: it is not. This is caused by Pepper’s constant lively movement. Do not forget it is a social robot imitating natural human movements. Therefore, sometimes he might move just a little too fast before the image is updated with the new content and cause these small differences.

Voice interaction: “Pepper, what do you see?“

Now, to the voice interaction part: When asked, we want Pepper to be able to respond to the question of what is around it or which objects it can see. To make it robust, we include many different ways how you can phrase the question. Whenever heard, it will check the contents of the chat variable updated with the last results and, if it is available, tell its content. If nothing was detected, it will adapt its answer according to that. 

Voice interaction: “Pepper, where is it?“

In this demo, we also want Pepper to point to an object when asked about its location. As the SDK does not currently provide a method to point in a specific direction, we will simplify he task and approximate it by defining areas from 1 to 6. The area we determine by dividing the image in six parts by a 3×2 grid and calculating to which one the center of each object belongs. For each of those areas, we programmed short animations using the Animation Editor tool included in the plugin. The Editor allows defining a series of movements of all the robot’s moveable parts and their position for a time period. In our case, we want Pepper to point with either the right or the left arm to the wanted area.

Although the recognition will be continuously running, the pointing needs to be triggered by the user by asking where an object is to be found: 

Once again, we make use of a bookmark to connect with the logic and reach the method in the fragment from its listener in the Activity. Using the variable, we check among the current results to find the object with the mentioned name and get its area.

The only thing that’s left is playing the animation to point in the direction:

Conclusion and next steps

That is it! That is how we can make a robot point to a certain object identified by object detection.  Building on this, it can be improved to much more sophisticated and precise pointing. Another possible use of object detection with the robot is to track an object. One could make Pepper follow a presented object with its head or even the entire body, by walking towards it, similarly as it follows humans. Thus, getting closer to Pepper fetching a coffee, the wish of many 🙂

I hope you enjoyed this demo! Check out the other articles of this series, where we’re going to see more use cases and how to implement them in our ML-Kit-powered Android app for the Pepper robot!

  1. Introduction
  2. demo with ML Kit’s Object Detection API (this article)

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert