This is the third article of our blog post series about Deep Learning for mobile devices. The first post tackled some of the theoretical background of on-device machine learning, including quantization and state-of-the-art model architectures. The second explored how to do quantization-aware model training with the TensorFlow Object Detection API. In this article, we will describe how to convert a model to TensorFlow Lite and how to build an AI-powered mobile app by using the model in an Android application.
As we’re going to be working with the model we trained on part 2 of this series, I would recommend you to start there and come back afterwards, if you haven’t read that yet and you’re interested in reproducing our use-case. However, if you have a different use-case in mind or just want to find out how to create awesome apps enhanced with machine learning, you’re very welcome to just keep reading as the steps described here will also apply.
The use-case we’re building is an app that – thanks to an object detector trained on the cars196 dataset – is able to continuously identify cars seen by the phone’s back camera in real time and display the most probable classifications as well as where they are situated on the screen. Cool, right? No more wondering what make and model those cars on the streets are. Let’s see how to do that!
Converting the model to TensorFlow Lite
In order to develop an AI-powered app, we decided to go for TensorFlow Lite, which allows the inference to happen on-device. Thus, it removes latency, privacy, connectivity as well as power consumption issues, as all computation happens on the Android device itself. It also allows enhancing Android apps with machine learning models in an accessible way.
First, we need to convert the TensorFlow model we trained to TensorFlow Lite format (an optimized FlatBuffer format identified by the .tflite file extension). To start with a regular TensorFlow model and then convert it is always the way to go, since one cannot directly create or train a model using TensorFlow Lite.
To do that, we can use the TensorFlow Lite Converter in one of the following two ways:
- Using the Python API
- Via the Command line
The recommended way is using the Python API. It offers more features and makes it easier to convert models as part of the model development pipeline and apply optimizations, such as post training quantization and adding metadata. Therefore, we make use of the API in a short python script like this:
1 2 3 4 5 6 7 |
import tensorflow as tf # Convert the model converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) tflite_model = converter.convert() # Save the model. with open('model.tflite', 'wb') as f: f.write(tflite_model) |
It’s as simple as that. We create a converter, convert the saved model and save it as a file with the extension .tflite. Converting a keras model is also possible using the converter.
Although TensorFlow Lite plans to provide high performance on-device inference for any TensorFlow model, the interpreter currently supports only a limited subset of operators that have been optimized for on-device use. Keep that in mind if you’re embedding a custom model other than the one we’ve trained in this series, one the pre-trained models the TFLite team offers here or one from the Hub. Some models will require additional steps to work with TensorFlow Lite. You can check which operators are available here: Operator compatibility.
The model we converted has the following input and four outputs:
- input: Float32 Tensor of shape [1, 640, 640, 3]
- detection_boxes: a float32 tensor of shape [1, num_boxes, 4] with box locations
- detection_classes: a float32 tensor of shape [1, num_boxes] with class indices
- detection_scores: a float32 tensor of shape [1, num_boxes] with class scores
- num_boxes: a float32 tensor of size 1 containing the number of detected boxes
This is important to know, since we will be working with them in our application.
Model Metadata
The APIs are developed with general methods rather than specific so they can be used for all kinds of model tasks. Therefore, metadata, a source of knowledge about what the model does and information about its inputs and outputs is required to adapt the code to a specific model.
TensorFlow Lite metadata provides a standard for model descriptions. The metadata consists of both human-readable parts, which convey the best practice when using the model, and machine- readable parts that are leveraged by code generators, such as the TensorFlow Lite Android code generator and the Android Studio ML Binding feature. In fact, it is mandatory to be able to use the TensorFlow Lite Android code generator, as well as the ObjectDetector API from the Task Library.
Adding metadata to the model
In order to be able to add metadata to your model, you will need a Python programming environment setup for running TensorFlow. There is a detailed guide on how to set this up here.
After setting up the Python programming environment, install the tflite-support toolkit:
1 |
pip install tflite-support |
There are three parts that need to be present in the model metadata:
- Model information – Overall description of the model as well as items such as license terms.
- Input information – Description of the inputs and pre-processing required such as normalization.
- Output information – Description of the output and post-processing required such as mapping to labels.
TensorFlow Lite metadata for inputs and outputs are not designed for specific model types but rather input and output types, which must consist of the following or a combination of the following:
- Feature – Unsigned integer or float32.
- Image – Either an RGB or grayscale image.
- Bounding box – Rectangular shape bounding boxes. The schema supports several numbering schemes from which we will be using the Boundaries type. It represents the bounding box by a combination of boundaries in the form: {left, top, right, bottom}.
The associated files also need to be included. In this case, the labelmap.txt file, which contains the labels.
As this feature is relatively new, the official website only provides an example for populating metadata into Image Classification models. Here you can find the metadata writer Python script we used for our Object Detector, where we describe all information about the input and outputs and include the associated file.
In order to execute it, you need to pass the model, the labels file and the export directory as follows:
1 |
python metadata_writer_for_object_detector.py --model_file=detect.tflite --label_file=labelmap.txt --export_directory=export |
At the moment of writing this post, a new library, the Metadata Writer library is under development for image classifiers and object detectors as part of the tflite-support library and aims at simplifying the process by means of a wrapper class. This will make the task much easier for the developer since she will not need to code all of the above.
Inferencing can be as easy as just a few lines of code after metadata has been successfully added, since it contains a rich description of how to use it.
Importing the model
To import the newly converted model in an Android Project in Android Studio we will use the new Android Studio ML Model Binding and import the TensorFlow Lite model through the graphical interface. Open your project or create a new one and import it by clicking on File, then New > Other > TensorFlow Lite Model. After selecting the location of your TFLite file, the tooling will automatically configure the module’s dependency with ML Model binding and all dependencies will be inserted into the Android module’s build.gradle file.
After the model was successfully imported, you will see information about the model, such as the description and the shape of its inputs and outputs. Now that our model contains metadata, sample code to execute the inference that we can simply copy/paste will additionally be provided there. For a model that doesn’t yet contain metadata, you would see minimal information and standard sample code using TensorBuffers for the inputs and the outputs. Always in both kotlin and java (for those who still have good reasons to use java in their Android apps 😉 ). You can see this info again anytime by opening the model file in Android Studio.
Note: the ML Model Binding is a new component and it requires Android Studio version 4.1 or above.
If you’re using the TensorFlow Lite Task Library or the Interpreter API for the inference (more on that later), the way to import it is by including it in your Android project under the assets folder.
Structure of the application
Our application is composed of an Activity, a ViewModel, where the recognition data will be stored as a LiveData list object, and a data model for the recognition item objects. Recognition item objects have fields for the label, the probability and the location of the bounding box. Additionally, we will also work with two helper classes for displaying the labels and the bounding boxes. The layout of the activity holds a PreviewView for the camera preview and an ImageView over it on which we will draw the results. The Activity will update the views whenever there is new data.
In this article, we will just cover the most relevant parts of the code but, if you’re interested, you can find the full code of the application here. Please keep in mind that all TensorFlow Lite libraries are still at a very early stage and are therefore subject to change. The code represents the current state as of the publication of this article.
These are the steps to follow to run the model and display meaningful information about the camera images for the user:
- Gather the data: capture the camera stream and pass the frames (we only keep the last image) to the analyzer, the function that performs the inference.
- Transform the data: since raw input data generally will not match the input data format expected by the model, some adjustments need to be made. For example, you might need to resize, crop or rotate an image or change the image format for it to be compatible with the input the model expects.
- Load the model: load the .tflite model into memory, which contains the model’s execution graph.
- Run the inference: use the TensorFlow Lite APIs to execute the model using the input and extract results of the prediction in form of outputs.
- Interpret the results: extract meaningful information from the results that is relevant to the application and apply transformations in case needed.
- Display the results: present the acquired information to the user.
Gathering the data
Although we could use static images, we want our app to be able to recognize cars directly from the camera stream as we walk by them. Currently, the easiest way to capture the camera stream is to use the CameraX library. It is part of the Jetpack support library and offers an easy-to-use interface that makes using the camera much easier as it was with camera2. Another great advantage is its lifecycle awareness.
We will be using both the preview use-case, to display a preview on the screen, and the image analysis use-case, to extract the image buffers and perform the analysis. We build and attach both of them to the lifecycle of the activity. Afterwards, for the preview we just attach the output of the preview object to the PreviewView object from our layout. As for the image analysis, each image is provided to the analyze method of an implementation of a class inheriting from ImageAnalysis.Analyzer, where it can access image data via an ImageProxy.
Transforming the data
We will avail ourselves of the TensorFlow Lite Android Support Library for transforming the data. It provides high-level APIs that help us transform raw input data into the form required by the model, and interpret the model’s output, and thus, reduces the amount of boilerplate code required. It supports common data formats for inputs and outputs, including images and arrays. It also provides pre- and post-processing units that perform tasks such as image resizing, rotating and cropping.
In our case, the size of the image used by the model is 640 x 640 pixels, as mentioned earlier.
Usually, the image captured will be larger than that. It’s likely that it will be something like 1920 x 1080 pixels with new devices. That’s why we need to resize. The question is, how to get a non-square image into a square? There are several ways to do that. We will go for centerCrop. In this operation, only the center will be kept and the surroundings outside of this box will be discarded. For that reason, our app will work best when the cars are as much centered in the image as possible as having objects too close to the edges will unfortunately make them be out of the image.
The next step is rotation. We need to take into account what the current rotation of the device is in order to know if and how much it needs to be rotated to pass it in the expected orientation to the model. When it is used in portrait mode, for example, it will need to be rotated three times (to the right) whereas in landscape mode, it will depend on the direction.
To convert the input from the back camera to a bitmap using the following kotlin extension (the image format used may vary among different devices):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
fun Image.toBitmap(): Bitmap { val yBuffer = planes[0].buffer val vuBuffer = planes[2].buffer val ySize = yBuffer.remaining() val vuSize = vuBuffer.remaining() val nv21 = ByteArray(ySize + vuSize) yBuffer.get(nv21, 0, ySize) vuBuffer.get(nv21, ySize, vuSize) val yuvImage = YuvImage(nv21, ImageFormat.NV21, this.width, this.height, null) val out = ByteArrayOutputStream() yuvImage.compressToJpeg(Rect(0, 0, yuvImage.width, yuvImage.height), 50, out) val imageBytes = out.toByteArray() return BitmapFactory.decodeByteArray(imageBytes, 0, imageBytes.size) } |
This is the code to create an ImageProcessor and process the bitmap we converted with the operations we described to feed a TensorImage:
1 2 3 4 5 6 7 8 9 10 |
// Create an image processor val imageProcessor = ImageProcessor.Builder() // Center crop the image .add(ResizeWithCropOrPadOp(HEIGHT, WIDTH)) // Rotate .add(Rot90Op(calculateNecessaryRotation())) .build() var tImage = TensorImage(DataType.UINT8) tImage.load(imageProxy.image!!.toBitmap()) tImage = imageProcessor.process(tImage) |
A tip always useful for troubleshooting is to check the final image that the model is receiving to ensure the operations applied were correct and the result is indeed what we expect it to be, as this has a lot of potential to be the reason why the model returns wrong results.
Optionally: we extract the dominant color from the image using the Palette API to compute what the most suitable color for displaying text over it would be to use it later.
1 2 |
val palette = Palette.from(bitmap).generate() val color = palette.dominantSwatch?.bodyTextColor |
Loading the model
After importing the model into the project as described above, we need just one line of code in our analyze method to load it into memory:
1 |
private val carsModel = CarsModel.newInstance(context) |
The class from which we create a new instance is automatically generated by the TensorFlow Lite Android Wrapper Code Generator from the imported model. We supply the activity Context as argument.
Running the inference
When talking about inference in this context, we refer to the process of running a machine learning model in order to process the input data, which is the camera stream in our case, and make predictions, such as detecting, classifying and localizing the objects it’s been trained to recognize.
There are currently several ways to run the inference with Object Detection TensorFlow Lite models in Android:
- Using the recently released ML Model Binding in combination with the TensorFlow Lite Android Wrapper Code Generator, part of the TensorFlow Lite Android Support Library. Using the Model Binding explained earlier, Android Studio will automatically configure settings for the project and generate wrapper classes to enable the integration on Android based on the model metadata. Hence, the implementation is very straightforward, as we will not need to interact directly with the internal ByteBuffers. The following dependencies are used:
-
- implementation „org.tensorflow:tensorflow-lite-support:0.1.0“
- implementation „org.tensorflow:tensorflow-lite-metadata:0.1.0“
- implementation „org.tensorflow:tensorflow-lite-gpu:2.4.0“ (Optional, for accelerating model inference through the use of delegates and the number of threads)
Note: TensorFlow Lite wrapper code generator is in experimental (beta) phase.
Note: the ML Model Binding is a new component and it requires Android Studio version 4.1 or above.
- Using the Object Detector API from the TensorFlow Lite Task Library. It provides clean and easy-to-use model interfaces for popular machine learning tasks. It also includes image processing functions and a label map locale. To use the Support Library in the app, use the AAR hosted at MavenCentral for Task Vision library. The following dependencies are used:
-
- implementation „org.tensorflow:tensorflow-lite-task-vision:0.2.0“
- Using the TensorFlow Lite Interpreter Java API. This option does not provide high-level methods as the previous ones. That’s why choosing this approach is recommended only if you are using a platform other than Android or iOS, or if you are already familiar with the TensorFlow Lite APIs and not as a first choice for beginners. The following dependencies are used:
-
- implementation „org.tensorflow:tensorflow-lite:2.5.0“
- implementation „org.tensorflow:tensorflow-lite-metadata:0.2.0“
The support for TensorFlow Models for domains other than Image Classification and Style Transfer by ML Model Binding is, as mentioned earlier, currently still limited. However, it advances very rapidly and in Android Studio Version 4.2 it will officially support Object Detection models. Thus, in this post we will focus on the first option, and use ML Model Binding and the Wrapper Code Generator in our app.
Using the generated code, running the inference takes just one line of code in our analyzer method and takes as an argument the TensorImage we created earlier:
1 |
val outputs = carsModel.process(tImage) |
The variable outputs will contain the raw results, which we need to be interpreted to extract meaningful information.
Interpreting the results
Number of boxes and detection scores
The number of boxes, as well as the confidence scores values can be used as exposed from the API by extracting them from the TensorBuffers converted to arrays without further manipulation. For the number of boxes, we get an integer value and for the scores an array of floating point values between 0 and 1 representing the probability that a class was detected.
The higher the confidence is, the most probable it is that the class is indeed right. Results with low confidence should be mostly ignored as it is not probable that they represent the objects in the image. The most suitable value for this threshold will depend on the application. The threshold below which we ignore results in our application is 0.4.
1 |
val numBoxes = outputs.numberOfDetectionsAsTensorBuffer.intArray[0] |
1 |
val detectionScores = outputs.scoreAsTensorBuffer.floatArray |
Categories
Perhaps because it is still under development in experimental phase (see the version numbers all start with 0) and, as explained in the release notes of Android Studio 4.1, the current implementation officially only supports image classification and style transfer models, we experienced errors when trying to use the categoryAsCategoryList property suggested by the code generator, even with our model enhanced with metadata.
That’s why we came up with a workaround for the meantime based on the utils functions and a TensorLabel to parse the category as integer to its corresponding name, which is finally shown to the user graphically. This conversion works as follows, using the Android Support Library once again:
- We include the labels text file in the assets folder of the Android project
- Load them into memory using the Util function and the activity context:
1var associatedAxisLabels: List<String> = FileUtil.loadLabels(context, “labelmap.txt”) - Create a TensorProcessor:
1val processor = TensorProcessor.Builder().build() - Create a TensorBuffer of the fixed size of the number of labels in the file
1val buffer = TensorBuffer.createFixedSize(intArrayOf(1, 196), DataType.UINT8) - Finally, create a TensorLabel with those labels
1val labels = TensorLabel(associatedAxisLabels, processor.process(buffer)) - Extract the detected categories from the TensorBuffer output the same way we did with the others and convert the object into an integer array
1val detectionClasses = outputs.categoryAsTensorBuffer.intArray - Use our Array of N integers, each indicating the index of a class label from the labels file to map the results to obtain category labels as strings
1labels.categoryList[detectionClasses[i]].label
Bounding Boxes
The output includes an array of bounding boxes. However, we can’t simply draw those on top of the preview, since that wouldn’t make much sense. They need to be interpreted and afterwards adapted before being shown to the user. Remember that the pixel values output by the model refer to the position in the cropped and scaled image, so we must undo those operations and then translate between the different coordinate systems to fit the preview image where they will be displayed. This means we need to do some math!
For each detected object, the model will return an array of four numbers representing a bounding rectangle that surrounds its position with the numbers ordered as follows: [ top, left, bottom, right ].
So, first of all, we extract this information from the Float Array of size numBoxes returned from the inference to create a 2-dimensional array:
1 2 3 4 5 6 7 8 |
val boxes = outputs.locationAsTensorBuffer val detectionBoxes = Array(numBoxes) { FloatArray(4) } for (i in detectionBoxes.indices) { detectionBoxes[i] = boxes.floatArray.copyOfRange( 4 * i, 4 * i + 4 ) } |
The numbers are floating point values between 0 and 1 indicating the position in the processed image. So we multiply by the height and width, respectively, to obtain the position in that image in pixels and form a RectF object with them.
After that, we apply the inverse transformation of what we did in the image processing step earlier for each of the boxes, in order to translate them to the initial image captured with the Image Analysis use-case. Don’t forget the exact order (top, left, bottom, right), which is different from the order in the RectF constructor (left, top, right, bottom)!
This would be the code for those operations:
1 2 3 4 5 6 7 8 |
imageProcessor.inverseTransform( RectF( detectionBoxes[i][1] * WIDTH, detectionBoxes[i][0] * HEIGHT, detectionBoxes[i][3] * WIDTH, detectionBoxes[i][2] * HEIGHT ), imageProxy.height, imageProxy.width ) |
Now, we have the exact position of the boxes in the initial image, which is what the analyzer function is in charge of computing. The rest of the transformations are part of displaying them.
Once we have all the information we need, we will store it as a list of Recognition item objects, which have fields for the label, the probability and the location of the bounding box.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
for (i in 0 until MAX_RESULT_DISPLAY) { items.add( Recognition( labels.categoryList[detectionClasses[i]].label, detectionScores[i], imageProcessor.inverseTransform( RectF( detectionBoxes[i][1] * WIDTH, detectionBoxes[i][0] * HEIGHT, detectionBoxes[i][3] * WIDTH, detectionBoxes[i][2] * HEIGHT ), imageProxy.height, imageProxy.width ) ) ) } |
These will be fed to a LiveData object in our ViewModel and the Activity will update the views with the results whenever new data is available.
Displaying the results
Finally, we show the user the results graphically on top of the camera preview in the form of a bounding box for the recognized car and a label indicating the class and its confidence every time the values change.
As promised, there is still one last step to be done: to translate the boxes into yet another coordinate system, the one from our PreviewView. We do that by mapping the coordinates using a correction matrix we build with this method, which takes as arguments both the ImageProxy obtained and our PreviewView.
Now, by means of the matrix, we can finally translate our RectF object. We also add some width compensation to the sides (orientation-aware) to our box to improve the fitting, since the aspect ratio of both systems differs, and they’re ready to be drawn onto our view using our utils class.
1 2 3 4 5 6 |
(...) matrix.mapRect(i.location) val widthCompensation = originalImageWidth - WIDTH i.location = addCompensation(i.location, widthCompensation) recognizedBoundingBox.drawRect(canvas, i.location) recognizedLabelText.drawText(canvas, i.location.left, i.location.top, i.label, i.confidence) |
And voilà! We built an Android application that is able to recognize cars from the smartphone’s camera stream by using a neural network we trained with the TensorFlow Object Detection API. I hope you enjoyed this article!
Sorry, I know only English.
First off, thank you very much for the detailed steps.
Q: how to compute mean & std for our own models?
Ex: you have the following:
mean=[127.5],
std=[127.5],
Should it be the same for all models?
Tks in Adv
BN Sundar
Hi Sundar,
thanks for your feedback! The mean and the standard deviation values will depend on your dataset. They can be calculated for your specific data and will not always be the same as the ones we have with this dataset. It is a standard procedure. Let us know if you have further questions or need some help.
Silvia