This blog post describes the integration of federated learning (FL) into the KOSMoS platform, an ecosystem for cross-company exchange of production data. We first introduce the main components of the KOSMoS ecosystem before we explain the concepts used to achieve federated training. Basic knowledge about FL is required to understand this blog post. If you need a refresher, check out the previous part of this series “Federated Learning: A Guide to Collaborative Training with Decentralized Sensitive Data – Part 1“.
The research project “Collaborative Smart Contracting Platform for Digital Value Networks“, KOSMoS for short, enables the realisation of cross-company data-driven business models based on secure infrastructures – in particular with the help of a blockchain and FL. The special target group of the project are machine tool manufacturers and their customers. The KOSMoS system can easily be extended and applied to other industries. Use cases realized in KOSMoS, such as dynamic leasing or transparent maintenance, are based on decentralized data of the customers which are recorded at the production machine during operation.
The KOSMoS ecosystem, visualised in Figure 1, is divided into two logical parts, a global part that is available in the cloud (top) and a local part (bottom). The global KOSMoS system provides access to the system and handles the processing of data in the blockchain or the analytics platform. The KOSMoS edge is localised at the machine operator and collects various data that are recorded by sensors during production. The KOSMoS edge can be located multiple times at one machine operator as well as multiple times at different customers as shown in Figure 2. For more details, visit https://www.kosmos-bmbf.de/.
Collaborative Predictive Maintenance in KOSMoS
For the implementation of the transparent maintenance use case, it is desirable to train predictive maintenance models to detect costly failures before they occur. In order to train a model for the prediction of the remaining useful lifetime, a variety of machine outage recordings are necessary. In contrast, machine outages are avoided by any cost which leads to a very low number of available recordings. Additionally, the records could hold sensitive information about the operator’s production and thus cannot be directly accessed.
In KOSMoS, the machine data is constantly recorded and available on the KOSMoS edge. Thus, rare outages are always recorded by all operators. As soon as a sufficient number of outages are available, FL can be applied to these.
FL collaboratively combines the scarce outage recordings of the operators. Thereby it preserves the privacy of the training data by design and enables utilization of the whole collection of outage recordings for FL. The upcoming sections describe how we utilize different open source technologies for the integration of FL in KOSMoS.
Integration of Federated Learning in KOSMoS
The KOSMoS ecosystem is based on several components that use Docker and Kubernetes for deployment and execution. In order to integrate FL into the existing systems, Flower has been used as FL framework and MLflow as a logging instance. All the software used is open source which is an important requirement for the KOSMoS project.
Docker and Kubernetes
The FL in KOSMoS consists of three docker images: FL server, FL client and MLflow server. The FL server and the MLflow server are deployed at the KOSMoS global component using Kubernetes and are permanently operational. The FL clients are deployed at the edge of the machine operators and are connected to the FL server via the internet. The server then checks whether the clients meet predefined criteria, such as providing a certain number of training samples. This is an additional functionality we have added to enable clients selection and to make FL a constantly available service, as usually the session is terminated after each federated training session. When enough suitable clients are available, the actual FL takes place with Flower.
Flower
The main advantage of Flower is that it can be used with any machine learning framework and the federated training progress can be easily adapted. The methods of aggregation, training and evaluation can be simply defined and exchanged if necessary. The decision to use Flower was made based on an evaluation presented in a previous blog post “Federated Learning: Frameworks for Decentralized Private Training – Part 2“.
The Flower server is executed as a sub-process of the FL server. In the same way Flower clients are executed as a part of the FL clients. After enough suitable clients are registered, the actual training takes place, with the Flower clients training a model on their local data and sending it to the Flower server. The data for this training is loaded from the local database which continuously tracks the machine data. The Flower server then aggregates all models and sends them back to the clients. This process is repeated until the number of predefined training rounds is reached. As we use Keras, a GPU is used depending on the hardware of the edge device, if available.
MLflow
Both the server and the clients evaluate the global model on their specific data. The resulting metrics such as loss and accuracy are then logged to an MLflow server. MLflow is an open-source tool that provides a simple and clear overview of machine learning training sessions and their process. It can be easily integrated into projects and is also used by us to log the progress of the federated learning sessions. It is deployed as a constant running server in the KOSMoS global component.
Conclusion
Incorporating generic FL into the KOSMoS ecosystem was a challenging task. But with the help of Docker and Kubernetes, deployment of the new federated server and client is possible. Besides, an MLflow server enables tracking of the distributed learning process on a central logging server. The actual training is done with the Flower FL framework, which allows the use of any machine learning framework and other necessary libraries. The setup described enables federated training with any dataset and machine learning framework. The implementation of this use case is open source and can be found on Github. It is divided into three repositories one for the federated learning server, the federated learning client and for the federated learning resources.
Acknowledgments
This blog post describes the implementation of FL into the KOSMoS ecosystem. The research behind this started in my bachelor thesis “Evaluation of Federated Learning in Deep Learning“ at inovex Lab and meanwhile takes place in the research project “KOSMoS – Collaborative Smart Contracting Platform for Digital Value Networks“. The research project is funded by the Federal Ministry of Education and Research (BMBF) under reference number 02P17D026 and supervised by Projektträger Karlsruhe (PTKA). The responsibility for the content is with the authors.