Enabling robots to perceive the world around us and interact with it as we do requires object tracking, ie. the location and classification of objects in their environment. This post summarizes the experience gained while developing such an object detection and tracking system on an embedded device.
As inovex offered me to work on an object tracking project, my first thoughts were full of excitement. The geek in me was only thinking about the interesting algorithms involved in the project. But on second thought, I was a bit concerned where this technology might be misused, not at inovex for sure, but this blog post is available to everybody. Obviously, with an evil mind nearly every technology can be used in a malicious way. I believe that it is the responsibility of every (software) engineer to think about the implications of their programs and algorithms. So, brave software developer, when you read this think about where your implementation will be used. Nevertheless, I am confident that you use this technology to aid people in their daily life rather than harming them. With this out of the way, let’s start with the geeky part of this blog.
Nearly every sci-fi story includes autonomous robots, which react to their environment similar to geeks like you and me. Enabling the robots of the present to behave similar as the ones shown in movies, they have to have real time information about the objects in their environment. Sending the camera feed to a distant server farm over the internet and waiting for the results to come is neither scalable nor feasible having the real time constraint. Apparently, this does not only apply to robots. Thinking of (semi-) autonomous factories, cameras could monitor the whole production line, sending only the necessary information about the status and position of the produced goods instead of streaming a video feed from each camera to a massive computation cluster.
In order to extract this information from a camera stream in real time on an edge device, one has to cope with two problems, or as we say challenges. First we want to know where which object is within our current camera frame. Second, likewise we might want to know where the object of the last frame is in our current frame and if objects entered or left our field of view. I will tackle these „challenges“ one by one in the next sections. Since I don’t consider my laptop an edge device, I used a Raspberry Pi 3 B+, a webcam and the Coral Edge TPU for prototyping.
The Detection Part
Since I have the Coral Edge TPU at hand, I will use a Convolutional Neural Network (CNN) for the detection part. If now you are thinking: What the hell is he talking about? I recommend reading our blog post on Deep Learning Fundamentals. Nevertheless, even for most master degree students in computer vision, these CNNs are magical black boxes, which you serve your image and pray that it shows mercy with you and returns your craved results. In our case that would be the location of the object(s) in the images in the form of a hopefully not too arbitrarily bounding box around these objects as well as a class like cat or dog assigned to it.
To get a feeling for how computationally expensive these magical black boxes are, I ran the Tiny YOLOv3 (just a fancy name for a specific architecture of a CNN) on the Raspberry Pi and it took around 30 seconds to compute the results of one frame.
Obviously, we are far away from real time. But thanks to Moore’s law there now is specialized hardware for neural networks named Tensor Processing Units (TPUs), fitting in your hands. A quick Google search shows that there are three major options: The NVIDIA Jetson, the Coral Edge TPU and the Intel Neural Compute Stick. It seemed to me that it would be a good idea to go with the Coral Edge TPU, because it supports TensorFlow lite and comes from Google, which has already proven to have experience in AI, see AlphaGo for example.
Since I chose to use the Coral Edge TPU (a small gray box) to pack my magical black box in, the magical black box has to fulfill some requirements, which are stated here. As are all programmers, I’m quite a lazy person, so I decided to use one of the pre-compiled object detection CNNs provided by Coral’s website. That way I can avoid a lot of work on fixing possibly unsupported features.
Note: The coral TPU does not support every feature of TensorFlow or even Tensorflow lite. If it is not supported by the TPU, it has to run partially on the host, which can have a dramatic effect on the inference time (the total time computing a single input image).
Currently, there are 3 different object detection networks available on Coral’s website, which are MobileNet SSD v1 and v2 trained on the COCO dataset as well as MobileNet SSD v2 trained on faces. So since v2 is always better than v1 and I want to detect objects not only faces I choose MobileNet SSD v2 trained on COCO. With this setting, the magical black box inside the gray box is now capable of detecting cats and dogs in the image (and another 88 objects as well).
Note: If you want to detect other classes or use this in any professional setting, I highly recommend you to fine tune the model with your own data.
Due to execution speed and the limited computational power of the Raspberry, I decided to program everything in C++. In retrospective, it was not the best idea to use the C++ TensorFlow Lite API. But since I’m a little bit stubborn, I managed to get everything up and running. If one of the readers of this blog post is also as naive as I am or for some reason has to use the C++ TensorFlow Lite API, here are the pitfalls which cost me most of my time and nerve. So check that
- you use at least TensorFlow version 1.14 (as stated in some forum entry). Neither the documentation nor the compiler will tell you that. Otherwise it will only fail allocating the tensors on the TPU at runtime. I used TensorFlow 2.0 for this project.
- you use the same version of Edge TPU API as stated in the documentation (there are two versions by now)
- you are in the plugdev user group to have the rights to read the USB device
- your input image is (this depends on the model used)
- in RGB format
- of type uint_8
- 300×300 pixels in size (unsqueezed)
- you copy the image to the input pointer
C++12uint8_t* input = interpreter->tensor(interpreter->inputs())->data.uint8;std::memcpy(input, img_data.data(), img_data.size());
- you interpret the bounding box output in the order: y, x, height, width
Once the TPU is running, we successfully decreased the inference time from 30 seconds to around 100 milliseconds. This can be even more reduced to around 14 milliseconds when using a USB 3.0 port, which is not available on the Raspberry Pi 3, but on my laptop and on the Raspberry Pi 4 (untested!). So for now 100 milliseconds or 10 frames per second is something I can work with. Still we are not at real time now, since I consider real time at 30 frames per second, which is the frame-rate of the webcam. But how do we increase the frame-rate of the detection to 30 fps without a USB 3.0 connection to the TPU? The simple answer is, we don’t. How we can archive 30 fps without increasing the detection frame-rate is part of the next section.
So the speed of the detection is fine now. What can we expect from our magical black box without fine tuning it to our needs? Well, during development I tested it quite often with my own face and it heavily depends on lightning conditions as well as object poses and random noise. To spare you from seeing my distressed debugging face, I tested it with my afternoon snack for you.
As you can see, our magical black box does a good job for the caffeinated tea/soft drink, but fails hard when it comes to an apple or banana. Of course, if I had trained the magical black box with my own data of the apple and the banana, it would be much better. But for now I can stick to my caffeinated tea/soft drink and go on. For a real application, it should be definitely trained on a specialized dataset. But even with training on your own dataset, it’s not guaranteed that there will be no misclassifications. If you are interested to dive deeper into this topic, you can start with this blog post. Without proper training, however, it is also possible to get giant cell phones on a parking lot in front of the office, which serve as buttons for the largest keyboard I have ever seen.
As far as you can trust an educated guess from a random master student in computer science (ie. me), I would say that there is no gigantic keyboard built with smartphones as its buttons. So I guess that this heavy misclassification comes from the viewing angle onto the cars, which might not be present in the COCO dataset. So if I forgot to mention it: if you want to use this technology in a professional setting, retrain the network with your own dataset!
Another issue with these detections is that in the case of the bottle, it sometimes misses some detections. Thinking of the real world, it is quite unlikely that the bottle vanishes in one frame and reappears in the next. So we might want to somehow stick with the last detection or do better, if the object was detected in many frames before. To do that we also need to know if the bottle of the last frame is the same bottle as in the current frame. All these “challenges” will be tackled in the next part.
The Object Tracking Part
So for now, we have an object detector which provides us with the locations and classes of objects in nearly every third frame. Certainly one could stick with that and make decisions based on each frame independently. But wouldn’t it be much cooler if we extended our knowledge of the objects surrounding us to the next dimension? By that I mean the time dimension, not another depth dimension. Knowing that a person who you asked for their name a minute ago is the same person standing in front of you now might save you from looking like an idiot and asking for the name again.
To expand our system to the time dimension, we have to track our objects. Basically that’s a matching task from one detection to the next. As stated in the first part, we want to also filter noisy detections and hold on to objects which were present in the past but are not present in the current frame. In addition to this we’ll have to estimate where our objects are in the frames which are dropped by the magical black box due to the low detection rate.
Finally we want to scale well with the number of objects we track. Unlike the object detector which provides us with 0-20 detections in a single inference, each object has its own tracker assigned to it (in most cases). This means that the computation needed to calculate all trackers is somewhat linear with the number of objects currently tracked.
With these requirements at hand, I dived into the haystack of object tracking papers and looked for something that meets our requirements. Unfortunately, most papers about object tracking I came across made the implicit assumption that the world waits until their algorithm is ready for the next frame to calculate. Or in other words, they simply don’t care about the execution speed and process each frame of the video file. Certainly this does not represent the real world and skipping frames makes the tracking challenge more difficult. Additionally, most papers do not take the time the detection needs to process a frame into account, which introduces a time gap between the current tracking frame and the last detection. Diving deeper I found the needle or paper we were looking for: [SORT 2016].
In this paper, the authors proposed a tracking method and named it creatively: Simple Online and Realtime Tracking (SORT). As the name suggests, the method is quite simple. They split the tracking problem into an assignment and a filtering problem. For both problems well established methods exists, which they only had to choose from and stick them together.
In the interest of comprehensibility, I will start with the filter problem and then come back to the assignment problem. So for now let’s assume that we know the assignment of each detection to its tracker. What’s left is to smooth out our noisy detections and estimate the location of the object between the detection frames and possible further into the future. The control engineers amongst you might have guessed it: we are about to use a Kalman Filter.
You can imagine a Kalman Filter as a blind person that guesses the position and size of our object over time using the description of different people.
For those who want it a little more precise: A Kalman Filter keeps an internal state, let’s say the position and the velocity of the tracked object. Furthermore, it has a motion model, which describes how the object is moving through the environment. For example a car has other movement constraints than a human. With that motion model, it can predict where the object will be, considering the current state. As a simple example, I am at some starting point and walk straight ahead, one meter per second. So assuming I don’t change my direction and speed, everyone of you can tell me where I will be in the future relative to my starting point. Since we don’t know the state of us or the object, we have to measure it.
Therefore we give the Kalman Filter our measurements as well as how much it can trust our measurements, ie. how noisy our measurements are. Now the only thing left for the filtering part is generating these measurements and defining a state for our object tracking purpose. The state in our case will be the center position of the object (x,y) as well as the width to height ratio and the area. You can find more details on that in the paper. Aditionally, I calculated the velocity of the position and area from two detections and fit these in the measurements. After a few tests, I was convinced that calculating the growth of the area is a bad idea, since it gets added (or subtracted) linearly but the Area is growing exponentially. So I tracked the squared area and calculated the squared growth of the area, which turns out to be fine [Kalman Filter 1960].
Once a new detection appears, we initialize a new Kalman Filter. All Kalman Filters can now predict where their objects should be according to their internal motion model and state. We use this predictions to assign the new detections to the tracker. To do this, an Algorithm called the Hungarian Method is used. This algorithm calculates the best match according to some weight/cost between pairs. The authors of the SORT paper used the Intersection over Union (IoU) score for that. This score measures how much the detection overlaps with the prediction of the Kalman Filter. So one can say that the Hungarian Method maximizes the overlap of all detections with the Kalman Filters. Sidenote: the Hungarian Method actually minimizes the cost/scores, so we need to put in 1-IoU.
If a detection has no IoU score above a certain threshold, then it is considered a new object and a new Kalman Filter is initialized. Before we trust the new Kalman Filter, it has to get assignments within the next <insert a number which fits your needs> frames. If a Kalman Filter gets no detection assigned <insert a number, which fits your needs> times in a row, then it is considered to be lost, and deleted. Since a Kalman Filter update/prediction is only a small number of relatively small matrix multiplications, it’s computation time is negligible. The Hungarian Method is also so efficient that I could run it for each tracked class separately, which simplifies the assignment problem even more.
So now everything is set and we can run our object tracking algorithm on top of the object detection algorithm. In my experiments, I discovered moving a caffeinated drink sidewards quickly often results in a tracking loss. This happens due to the thin shape of the bottle and that the assignment score is based on overlapping. To counteract this, I used a distance metric instead of a IoU metric. But this introduces frequent ID switching when there are a lot of similar objects. Keep in mind that the Hungarian Method minimizes the total cost, so we also need a normalized score for the distance, the lower the better. So I calculated the distance and normalized it by the diameter of the image in pixels. And there you have it: a real time object detection and tracking system on a Raspberry Pi.
So, brave reader, you’ve made it to the end. And you hopefully learned something new on the way here. Maybe you want to build your own more or less functional low cost real time object tracker and detector. Maybe you can afford a new Raspberry Pi 4 with USB 3.0 and enjoy the higher detection rate from that.
And if you want to build your own object tracking and detection system, you might want to put an additional tracker in between the detection and the Kalman Filter, such as a KCF, MOSSE or MedianFlow tracker. These make actual use of the information inside the image, which should help in case of problems with switching assignments or recovery of lost trackings. To learn more about our computer vision portfolio head on over to our website!
- [Kalman Filter 1960] Kalman, Rudolph Emil (1960) „A New Approach to Linear Filtering and Prediction Problems“ In: Transactions of the ASME–Journal of Basic Engineering Volume 82 Page 35-45
- [SORT 2016] Bewley et al (2016) „Simple online and realtime tracking“ In:2016 IEEE International Conference on Image Processing (ICIP) pp. 3464-3468