Welcome to the fourth part of this blog series “Enhancing Pepper robot with AI with Google’s ML Kit“! Now, we want to further enhance Pepper’s abilities by teaching it to read text aloud. The goal of this demo is having the possibility to ask Pepper what the text in front of it says, to what it is able to read by means of OCR (Optical Character Recognition) through the ML Kit Text Recognition API.
In case you missed the previous parts, I recommend you start reading here for an introduction to what we are building in this series. In previous articles, we explored some ways how to leverage ML Kit to have Pepper robot recognize what’s being drawn or handwritten on its tablet and to recognize the position of the objects around it.
But before we start with the implementation of the new feature in our Android app, let’s look at an example video to see what we want to build.
Text Recognition
The ML Kit Text Recognition API can recognize text in images or videos. There are two versions of this API available. Version 1 is capable of recognizing text in any Latin-based character set, while version 2 recognizes text in any Chinese, Devanagari, Japanese, Korean, and Latin character set.
The Text Recognizer segments text into blocks, lines, elements, and symbols, and for all detected text blocks, lines and elements, the API returns the bounding boxes, corner points, rotation information, confidence score, recognized languages, and recognized text. The text recognition API uses an unbundled library that must be downloaded. This download can be performed when the app is installed or when it is first launched. We opted for configuring the app to automatically download the ML model to the device after the app is installed so that it is immediately available to use, as explained in this guide.
We will base the recognition on images taken with Pepper’s head camera and not the one mounted on the chest tablet due to it having the most appropriate angle.
Implementation
Here you can find the full code of the application we are building throughout this series.
When this demo has been selected either via voice or via touch, the activity replaces the menu with this fragment. Its layout is very similar to that in Part 2 where we used the Object Detection API. It includes a PreviewView with the images of the camera currently being processed, a TextView to show the predicted information, the home Button to go back and a button to repeat the rules. The reason why we show a preview of what the camera is seeing is to have the means to understand what exact image is currently being processed and therefore understand better the obtained results. This feedback helps us get used to working with it with regards to, e.g., how and at which height and angle we actually need to hold the text for it to be recognized best in this demo but might not be needed in an application in production. Also, for our demo purposes, the recognizer is continuously running in the background whenever we are showing this fragment and updating both the current image and the results as text on the screen even if no question was asked.
With regards to the architecture, similarly to the architecture in Part 2, we have a fragment where we use data binding to access the views, we use a ViewModel to store the data, and an Analyzer helper class for the recognizer.
How to analyze the images
Let’s start with the analyzer class which, as you will see, is very straightforward, since the TextRecognition ML Kit library is very simple to use.
It takes the image and the function type of a lambda function that we use as a callback with the list of detected objects as parameters. In that way, we will be informed asynchronously in the ViewModel when the results are ready.
Preparing the input to our detector is also straightforward, since conversion from a bitmap, which is the format in which we get the camera image from the QiSDK action to get a picture, is easily done in one line of code. This InputImage does not necessarily need to be created from a bitmap but could also be created from other sources, such as images taken with the CameraX API, a file URI, a ByteBuffer, or a ByteArray, depending on what’s more convenient in each particular case.
We create an instance of TextRecognizer and then process the obtained InputImage with the recognizer client using a task that asynchronously returns the recognized Text. The listeners notify us when the task has either been completed successfully or failed with an exception and we pass those results back. In this simple case, we are only interested in the text extracted but the results also contain more information, such as the position and language of each text block, the lines it is composed of, and more, which might be useful for other applications.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
class ImageTextAnalyzer { fun extractTextFromImage(input: Bitmap, completion: (String?) -> Unit) { val image = InputImage.fromBitmap(input, 0) val textRecognizer = TextRecognition.getClient(TextRecognizerOptions.DEFAULT_OPTIONS) // Extract the recognition results textRecognizer.process(image) .addOnSuccessListener { visionText -> Timber.i("ImageAnalyzer recognized text: ${visionText.text}") completion(visionText.text.replace("\n", " ")) } .addOnFailureListener { e -> Timber.e("Error processing the image: $e") completion(null) } } } |
How to process and show the results
The process will run in an asynchronous manner by analyzing the images whenever they become available from the camera and then observing the asynchronous results of the analyzer in order to process them and present them to the user via tablet and voice.
Hence, after the UI initializations, we set an observer for the last image taken to become available. Capturing images using the QiSDK usually takes around 400ms. Once we are notified that a new image is ready, we update the preview with the Bitmap and start the analysis of the text in the image using ML Kit Text Recognition.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
// This is triggered when a new image has been taken with the head camera viewModel.getLastImage().observe(viewLifecycleOwner) { image -> Timber.d("Got new image") // Show the preview on the tablet try { requireActivity().runOnUiThread { binding.readingImageView.setImageBitmap(image) } } catch (e: Exception) { Timber.w("Could not show recognition results due to $e") } // Analyze the image to extract text viewModel.extractTextFromImage(image) } |
We then set another observer for the results of the analyzer, the text extracted from the last image, becoming available, and start taking the first image asynchronously.
We update the results in the UI and also store them in a chat variable when they are received to make them available in the dialog so Pepper can always use the most recent results when needed in the voice interaction. Besides that, a new image will always be taken directly after processing the previous one. In the ViewModel, the method takeImage makes a call to PepperActions which, in turn, uses the QiSDK to take a picture with Pepper’s head camera and returns it in bitmap format.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
// Get the text from the last image, update UI and chat viewModel.getReadText().observe(viewLifecycleOwner) { text -> if (text != "Nothing recognized") { Timber.i("Found this text in the image: $text}") updateRecognizedText(text) } else { updateRecognizedText("") } // Take the next image this.viewModel.takeImage(mainViewModel.qiContext) } |
1 2 3 4 5 6 7 8 9 10 11 12 |
private fun updateRecognizedText(text: String) { try { requireActivity().runOnUiThread { binding.readingResultsTextView.text = text } } catch (e: Exception) { Timber.w("Could not show recognition results due to $e") } // Save the results in a variable for them to available in the chat mainViewModel.setQiChatVariable(getString(R.string.recognizedText), text) } |
Voice interaction: “Pepper, can you read this?“
Now, to the voice interaction part: When asked, we want Pepper to be able to read some text that is in front of it.
When using QiChat language to program the interaction, it’s important to keep in mind that we should consider several ways how users may pose the question, as to cover all variations of the same question. To any of the variants, it will respond by checking the contents of the chat variable updated with the last results and, if it is available, tell its content. If nothing had been detected, it will adapt its answer according to that. Since the application is continuously taking images and the recognizer is always running, the content is updated often enough that it is always up-to-date.
1 2 3 |
concept:(readthat) ["read [this that] {please}" "what does it say [there here]" "what's written [there here]" "can you {please} read [this that]"] u:(~readthat) ["^exist(recognizedText) It says \pau=500\ $recognizedText" "Sorry, I don't see any text right now"] |
Conclusion and applications
That is it! That’s how we can make a robot read a text aloud by means of OCR using ML Kit.
This feature can be the core of a myriad of tasks in a social robot such as Pepper. It could, for instance, be of great help as a companion for the visually impaired, the elderly, young children, or any person needing assistance reading a sign, a document, a prescription, medication, or, in general, any machine-written text. It can also be used to identify products, as well as look for specific data on their labels, such as the expiration date. It can furthermore be used by robots to identify people wearing a name tag, in order to know how to refer to them. Another use of being capable of reading signs could be to have robots equipped with another source of information for spatial navigation that could help enhance or adjust orientation in changing or unknown environments, among many others.
I hope you enjoyed the implementation of this demo! Check out the other articles of this series, where we look at more use cases and how to implement them in our ML Kit-powered Android app for the Pepper robot!
- Introduction
- demo with ML Kit’s Object Detection API
- demo with ML Kit’s digital ink recognition api
- demo with ML Kit’s text recognition API (this article)