Chat with your data: Unternehmensdaten als Basis für einen eigenen KI-Assistenten nutzen.
Zum Angebot 
Embedded Systems

Edge AI: From Idea to Production With NVIDIA DeepStream

17 ​​min

Do you have an amazing idea for an Edge AI application and a model ready to go? Great! But what comes next? How do you turn it into a real product? We’ve created an AI-based Live Video processing application utilizing NVIDIA Jetson in conjunction with NVIDIA’s DeepStream framework.

We use our application as a real-world example to discuss various aspects of the application and product development. We’ll start with the basics such as framework, tooling, and hardware considerations. Then we’ll move on to testing and performance enhancements. Finally, we’ll round off by discussing the process and benefits of setting up our own Yocto-based OS, underlining the significant value this step brings to the overall product.

Join us as we dive into the process of bringing an Edge AI application to life.

Every product begins with an idea

We started off with our initial product idea: a smart AI-based camera designed to label the body poses of individuals in a live video stream, processing all data on the device. Much like audio subtitles in videos and movies, it aims to describe the actions of the visually detected person in the stream.

In our initial implementation, our primary focus is on detecting basic body poses. We envision expanding this foundation in future development by incorporating additional features, such as generating descriptive sentences that can be displayed as subtitles or read aloud by a natural speech AI. This expansion aims to assist the visually impaired in comprehending the content of the video.

Our focus is not on the development of the AI model itself. Designing architectures for AI models, creating training data sets, training and validation are highly specific to the designated task of the model. Therefore, we tap into the NVIDIA NGC repository for its pre-trained models, streamlining our process by skipping straight to application development.
This approach is suitable for demonstrations. For deployment in production environments, one could either use models for which NVIDIA provides enterprise support, or create and train custom models.

Without overly restrictive requirements on the hardware side, we have decided for our application to run on an NVIDIA Jetson Orin NX with 8GB RAM on a reComputer J401 carrier board.

Picking the perfect tools: selection and reasoning

For rapid development, we utilize the NVIDIA DeepStream framework, which is a set of building blocks for AI video processing applications based on the open-source GStreamer framework. GStreamer operates on a pipeline principle, where media is passed through a series of modular elements — each performing a specific function, such as decoding, filtering, encoding, or streaming. These elements can be linked together in virtually endless combinations, allowing for highly customizable and flexible media handling. GStreamer’s plugin-based architecture enables developers to extend its capabilities and adapt it to various use cases, with DeepStream providing a set of such plugins that add AI inference capabilities into GStreamer pipelines.

The NVIDIA DeepStream framework offers two modes of operation: a user-friendly Graph Composer for visually constructing your video pipeline with a flowchart-like interface, and a more traditional programming approach using its Python or C++ APIs in conjunction with GStreamers C API. While the graphical Graph Composer is intuitive for smaller projects, it tends to become less practical for larger, more complex applications. The challenge of navigating and managing extensive diagrams with a multitude of nodes and connections can hinder efficiency and clarity. Recognizing that the graphical method falls short of large-scale, real-world scenarios, we opted to develop our application in C++. We used one of the existing deepstream demo applications as a starting point, providing a basic framework upon which we could build and tailor our application.

CI/CD & toolchains

When opting for a traditional development approach without the use of graphical low-code methods, it’s essential to consider the necessary tools for success.

Our objective is to work within a modern development environment. This environment should include a modern build system with fast incremental builds. We prioritize off-device development capabilities and seamless CI/CD integration. Ensuring reproducible build results is essential. We also aim for a stable build environment that all developers can share. This consistency extends to our CI systems for uniform outcomes.

To meet these requirements, we established our own CMake-based build infrastructure and devised a Build Environment Container that includes a cross-compilation toolchain. Although NVIDIA currently offers multiple containers for development purposes, we cannot use those off-the-shelf containers for our purpose. The docker containers for running graphical tools like TAO and Graph Composer require the NVIDIA GPUs to be passed through into the docker environment, which may neither be feasible for every developer due to the unavailability of a GPU on their working machine nor for CI servers. Additionally, we have not come across an existing cross-compilation container that is readily available and enables compilation for ARM-based NVIDIA Jetson Orin boards on x86/64 Linux hosts.

Leveraging our build docker container, we configured a Gitlab CI/CD pipeline to build and validate newly pushed code within our Git repositories. This approach allows us to follow a modern, efficient, and reliable development process.


The Orin NX System-on-Module (SOM) features a 6-core ARM Cortex-A78 CPU with ARMv8.2-A 64-bit instruction set, as well as a 1024-core NVIDIA Ampere architecture GPU with 32 Tensor Cores capable of achieving an AI performance of 70 trillion operations per second (TOPS). Additionally, it can utilize an NVIDIA Deep Learning Accelerator (NVDLA) engine for enhanced learning performance.

Our chosen development kit, the reComputer J4011, includes a carrier board that hosts the NVIDIA Jetson Orin NX 8GB SOM and provides access to its peripherals through the SOM’s 260-pin SO-DIMM connector. The carrier board offers various peripheral connectors, such as 2 out of 4 possible CSI camera channels, USB, HDMI, Network, GPIOs, serial communication interfaces, and more.

While this development kit is well-suited for initial prototyping development of Edge AI applications, it is not suitable for production. Various vendors offer Jetson modules in systems suitable for various deployment conditions. Depending on the requirements and scale of your use-case, a custom design built around one of the Jetson SoMs or SoCs may also be appropriate.


For our demo setup, we chose the Raspberry Pi High-Quality camera due to its availability, low cost, changeable lens, and out-of-the-box compatibility with the Orin NX developer kit. However, when considering a camera for actual production deployment, we would select a camera module based on the specific needs of the project. Factors such as the sensor quality, resilience to dust and moisture, as well as the preferred method of connection—whether it’s direct CSI linkage or via RTSP streaming across Ethernet—will influence the choice of camera module.

With the camera we chose, the out-of-the-box compatibility of its IMX477 sensor allowed us to enable the camera simply by using the “jetson-io“ tool provided by Jetson Linux to select the attached camera device. In the case of our development board, this level of support is provided only for IMX477 and IMX219 image sensors. For other cameras or platforms, you will need to carefully evaluate if the sensor is supported by the linux-tegra kernel and provide your integration in the form of a device tree overlay.

Edge AI application development with NVIDIA DeepStream

NVIDIA’s model catalog has a few ready-to-use options. For example, ActionRecognitionNet can recognize actions like bike riding. However, given our limited space for testing and our preference to use a model without additional training, ActionRecognitionNet’s default poses are impractical for us. In contrast, PoseClassificationNet is more suitable as it specializes in identifying body positions. It does, however, require some pre-processing, since it analyzes the positions of joints in 3D space over time rather than working directly with image data. Nevertheless, PoseClassificationNet is the best choice for our needs, so we decided to build our demo based on it, resulting in a pipeline using three models.

Step-by-Step: video processing pipeline

Functional view of our Edge AI Application

In the first step, the application needs to be able to identify and select the correct objects (persons, in our case) in the video. For this, the obviously-named PeopleNet is used.
From a single image, it can be impossible to tell if a person is standing still or walking, sitting or in the process of standing up, etc. Therefore, we rather need to be able to tell how a person’s pose changes over time. If processing multiple persons at the same time in the same video stream, we also need to be able to tell them apart and track their movements to be able to distinguish poses individually. This tracking of moving objects across multiple video frames is performed by a so-called tracker, implemented by the “nvtracker“ DeepStream element.
Finally, for each tracked person, the action performed has to be classified. This happens in two stages: First, the “skeleton“ of the person is determined by the BodyPose3DNet model. The detected poses, encoded as joint positions moving through a 3D space over a certain time, are then passed to the PoseClassificationNet model. We developed a custom plugin in C++ to perform the required transformation of the BodyPose3DNet output into a format that can be fed into PoseClassificationNet. The PoseClassificationNet then determines the pose. In the demo app the six poses “sitting“, “standing“, “walking“, “jumping“, “getting up“ and “sitting down“ are supported.

Challenges in development

The most significant issue we encountered stemmed from the custom plugin, which initially had multiple bugs related to the data layout and copying to CUDA memory for the PoseClassificationNet input tensors. It took some time to locate and fix these issues because the AI model acts as a “black box“. Since we get no error messages, only incorrect classification results, it is hard to locate whether an issue is due to bad input data, incorrect processing in the model, or incorrect interpretation of the output tensors. This shows that solid testing (e.g. with unit tests) of the non-AI components of an application is important. Overall, the DeepStream with C++ development environment is satisfactory, although the documentation could be better, with some parts requiring the reading of source code to fully understand plugin behavior.

Testing AI applications

Testing Edge AI applications presents its own unique set of challenges. Without preliminary filtering or other countermeasures, AI models will inevitably produce output for any input, including fundamentally incorrect or erroneous inputs – without any indication of faulty or unlikely inputs.

The lack of a direct, predictable way to understand the conclusion for a given input makes testing AI applications difficult. As a result, testing must occur on multiple levels. For instance, in our example, ensuring the correctness of the model inputs can be achieved through unit testing. GStreamer supports us here.

The fact that the models yield deterministic results given the same inputs and preconditions aids in testing. However, minor variations in initial conditions can lead to significantly different results. Does this remind anyone else of chaos theory?! 😉

Typically, when testing video-analysis AI applications at the component or integration level, a specified and manually labeled set of test videos is used, which were intentionally not included in the original model training data set.
Instead, we opted for a simpler approach due to time constraints for our small demonstration project. We used pre-recorded videos and added a file export function for classification results to the program. This allowed us to estimate the impact of changes in code or configuration, although we were unable to qualitatively measure it. Testing at the component or integration level was still possible using this method.

Performance and profiling

NVIDIA offers a comprehensive suite of analytical tools designed to evaluate the performance across the various components of the Jetson DeepStream AI stack. These tools include NVIDIA Nsight Systems, which is utilized for system-level tracing, NVIDIA Nsight Compute which specializes in the performance analysis of CUDA kernels, and NVIDIA Nsight Graphics, a tool dedicated to assessing graphics performance. Together, these instruments enable developers to fine-tune and optimize their AI applications by providing detailed insights into system operations and performance bottlenecks.

Of these tools, NVIDIA Nsight Systems is a good starting point for whole system performance analysis. It can capture and analyze system-wide performance data, including general information such as CPU and memory usage, and NVIDIA-specifics such as CUDA and TensorRT data, providing a comprehensive view of the application’s behavior across the entire system.

We utilized Nsight Systems to investigate a performance issue where the frame rate dropped to approximately 1 fps when more than one person appeared in the video. Looking at the Nsight Systems trace, we discovered that the problem was not directly related to performance, as the frames were still being processed at the desired 30 fps in the inference pipeline. Instead, we identified a latency issue, with each additional person causing an increase in latency. This resulted in the sink element dropping frames as they arrived „too late“. Only occasionally did a frame arrive in time to be presented, leading to the appearance of a low frame rate. With this insight, we were able to address the problem for the demo by increasing the allowed latency of the video pro

cessing pipeline.

Screenshot of NSight Systems UI showing trace data

The other tools, Nsight Compute and Nsight Graphics can be used for deeper analysis of GPU computing and rendering performance respectively. We did not evaluate these tools for our application.

Yocto Linux: a solution to security and maintenance challenges

The default operating system available for the development kit we use is a pre-configured Ubuntu 20.04 LTS / Jetson Linux 35.3 with JetPack 5.1.1.
While this is a perfectly fine starting point for initial experimentation, it is not well-suited as an embedded product platform.

When developing an actual product, it is important to also pay close attention to the supporting system components, such as the operating system running on the device. The OS not only hosts and runs the application software and links it to the underlying hardware, but also oversees update mechanisms, auxiliary services, security aspects, and other functions. Employing the pre-configured, versatile development environment provided with the development kit leads to a substantial dependency on the continuous development and support of the operating system from upstream sources. This reliance can make it more difficult to effectively track and manage the software that is installed, introducing ambiguities into the Software Bill of Materials (SBOM) and contributing to its increase in size and complexity.

Instead, we have decided to use our own customized OS based on the Yocto meta-tegra layer. In our experience, creating our own custom Yocto-based Linux distribution for the Jetson Orin NX board has been a great success.
Expanding on this groundwork, we could further extend the OS’s capabilities by integrating additional Yocto layers, such as those from the Mender project, to enhance and ensure the robustness of the update mechanisms, making our solution production-ready and future-proof considering the Cyber Resilience Act recently adopted by the EU parliament.

Wrapping things up

We met our goal of developing a live video processing app with AI on Jetson, learned from the hurdles and solutions in edge AI implementation, and reviewed the steps for making an edge AI app production-ready. This experience has equipped us with the knowledge to construct a robust, production-ready, and field-updatable Edge AI platform, poised to empower our customers to innovate at the edge.

Portraitbild von Dominik Helleberg
Dominik Helleberg
Head of Mobile Development and Smart Devices
inovex Logo
Portraitbild von Dominik Helleberg

Ich freue mich auf Ihre Anfrage.

Dominik Helleberg

Ihr Partner für die digitale Transformation.

Sie haben eine technische Vision oder eine Produktidee? Gerne unterstützen wir Sie bei der Umsetzung. Nehmen Sie jetzt Kontakt auf!

Portraitbild von Dominik Helleberg
Dominik Helleberg
Head of Mobile Development and Smart Devices
  • Individuelle Lösungen für Ihr Unternehmen
  • Über 25 Jahre Erfahrung
  • Vielseitige Expertise in zahlreichen Branchen

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert