For this to be successful, quality assurance that ensures the recorded measurement data provides robust and reproducible results is required. This, in turn, requires key performance indicators for the quality of measurement data and a clear quality statement.
In order to be able to deliver such a statement, large amounts of measurement data need to be efficiently processed. Scalable processes that allow the algorithms to be parallelised and automatically distributed to several systems are necessary for this purpose.
In the project ‘Expert system for quality assessment and evaluation of three-dimensional mass spectroscopy data (EM²Q)’ in collaboration with Mannheim University of Applied Sciences, a scalable data analysis platform has been developed, with which measurement data from imaging mass spectrometry can be examined in terms of its quality and comparability, in order to achieve the most objective and clear quality statement possible. The required computing capacity and parallelisation of the algorithms can easily be flexibly increased as required through the provision of additional resources.
The possible applications of the platform are not restricted to mass spectrometry, but rather can be transferred to a wide variety of areas of professional application of scalable data analysis platforms.
Challenges of imaging mass spectrometry
The technology of imaging mass spectrometry allows local and spatial measurement of tissue samples or sections at the molecular level. In this context, it is not only the distribution of one individual substance that is analysed; all substances present in the tissue that can be captured within the measurement area of the device are simultaneously detected.
Naturally, this generates a large amount of data, because the local and/or spatial intensity distribution of every potential substance represented by a measurement value is stored. The current generation of devices, which includes those used at Mannheim University of Applied Sciences, generally records 200,000 mass values simultaneously, producing 200,000 intensity distributions. The task now is to suitably process that quantity of measurement data in an appropriate time in order to be able to answer biological and medical questions.
Another major challenge for mass spectrometry is generating reproducible and comparable measurements. Due to the many parameters in sample preparation and the various measuring methods and adjustment possibilities on the devices, it is still almost impossible to compare measurements from different research groups or to evaluate the combined measurement data collected.
Support for typical workflows
Essentially, there are two different user groups intended to be supported in their daily work. On the one hand, there are the operators of the mass spectrometers: they take the measurements using the device, record the measurement data by uploading it – supplemented by appropriate metadata – to the EM²Q platform, and carry out further data processing steps there. Typical steps include transforming file formats to optimise data access, peak detection, noise reduction and calculating various quality metrics.
The second user group are the data scientists, who develop new algorithms based on the data sets stored in the EM²Q platform, test new data analysis methods and optimise existing approaches.
Flexibility through container technology
Of course, many scripts, programs and libraries for the evaluation of mass spectrometry data have already been created in the context of academic research (e.g. MALDIquant), but in various programming languages, such as Python, R, C++ and others.
In order to be able to recycle them, rather than have to reprogram them, the program artefacts are each packaged together with all dependencies in Docker container images. In a first stage, basic images with the operating system, runtime environment, system tools and libraries are made available for all relevant programming languages. The images for the applications, which contain the code for data transformations or analyses and basic configurations, are then based on them.
This results in lightweight, stand-alone, executable software packages that are highly universally operational on different platforms that provide Docker engines.
Scalability through Kubernetes
Kubernetes (K8S) is one platform for automation of the deployment, scaling and administration of container applications. Kubernetes was originally conceived by Google and handed over to the Cloud Native Computing Foundation (CNCF). Nowadays, it has developed into the de facto standard, at least in the cloud sector, and is also used in the EM²Q project.
An application is instantiated in Kubernetes as a pod, i.e. as a group of containers that share their storage, their network and their runtime configuration.
Kubernetes can use its scheduler to scale the number of containers in the pod, thus adapting to current load requirements on the one hand and making optimal use of the available resources in the eCluster on the other.
Kubernetes is available as a service on all popular public clouds, but it can also be operated on private clouds or on bare metal in its own data centre. This also makes the container applications installed on it universally operational. As part of the EM²Q project, we have implemented installations on the Google Kubernetes Engine, the inovex private cloud (inovex Cloud Services, based on OpenStack) and as a bare-metal installation on a GPU computing cluster at Mannheim University of Applied Sciences.
Acceleration through GPUs
Among other things, this was able to show how graphic processors (GPUs) can be accessed from Kubernetes.
For this purpose, Dockerfiles were used, which on the one hand dynamically find the Nvidia drivers on the cluster side, and on the other hand install the CUDA libraries. In addition, the cuDF Library from RAPIDS was tested.
The cuDF Library is a pandas-like API, allowing developers to benefit from GPU acceleration even without detailed knowledge of CUDA programming. In our tests, a data frame (DF) with 100,000 lines to which a mathematical operation was applied was processed about 32 times faster with cuDF than with the corresponding CPU version.
CUDA programming for Python with the JIT compiler Numba was also tested. To this end, a simple average filter, with which the spectrometry intensities can be smoothed out, was implemented. The intensities of each image point are presented in columns in a data frame. Using the GPU, it is possible to smooth out the columns of the entire DF in parallel, whereas the CPU version of the filter can only edit individual columns of the DF serially. Compared to the CPU version (implementation by SciPy), we were able to achieve a speed-up of 1.5 to 2 times.
The images provided by the Nvidia and CUDA/cuDF functionalities were used as base images for additional Dockerfiles with their own applications. The images were deployed as an Argo workflow.
Data storage and data processing capacity are decoupled in the EM²Q system. While the required computing capacity is provided via Kubernetes, different, ideally distributed file systems, such as NFS, S3 or hdfs, can be mounted, depending on the environment.
The workflow steps in data processing to be carried out by the operators are generally dependent on each other or build on each other. Argo is used to orchestrate them into data processing pipelines on the EM²Q platform.
The dependencies between the processing containers, the detailed configurations of the containers and their scaling are described through a declarative description (YAML) and passed on to the Kubernetes Scheduler when a workflow begins.
In the EM²Q project, we achieve parallel processing of a data set by partitioning it and processing each partition, which contains a defined number of spectra, with a pod.
Transformations and quality metrics
In the first step, the standard workflow implemented transforms the proprietary data input format into Parquet, which is ideal for analytical calculations due to its column orientation and efficient compression.
Afterwards, different standard algorithms are applied to the spectra:
first, the Savitzky-Golay filter, which is widely used in spectrometry, is used to smooth them out and ensure noise reduction.
In the next step, unwanted, systematic background disturbances are removed and the baseline of the spectra is thus set to zero.Each spectrum is then calibrated according to the total ion count method, i.e. the number of registered molecules in each spectrum is normalised to make the spectra comparable with each other.
The next essential step is peak detection in all spectra:
local maxima, which stand out significantly from the determined local noise, are searched for in a complete spectrum.
This creates a list of peak masses and intensities for each spectrum, which can be analysed or used for imaging algorithms in further steps, in order to locally resolve interesting properties such as the distribution of individual molecules in the tissue sample.
In order to be able to assess the quality of the measurements, the following standard quality metrics were additionally calculated for the peak detection per spectrum:
- Peak width (narrow peaks mean a high resolution, so are usually desirable)
- Signal-to-noise ratio (a high value suggests meaningful data)
- Number of peaks (the number of peaks should be within a moderate range)
Quality characteristics such as these can also be depicted locally, and thus allow conclusions to be drawn about the development of the recording quality in the course of a measurement, for example when spectra that are scanned later by the analysing device have wider peaks. This allows a quality drift like that to be taken into account in the recording and compared with other measurements.
Through the use of the platform, many data sets are processed by different algorithms with multiple processing approaches. As a result, it is easy to lose track of which experiments were carried out with which (hyper-)parameters, where the results can be found and how they can be compared with other experiments. With several users, this problem only becomes greater and demands data and model governance. That is why we have developed a metadata management component.
The metadata is captured automatically through integration with Kubernetes, Argo workflows and the various processing containers, and ultimately supplemented with information collected manually by the operator via a GUI.
A simple web application through which data can be uploaded and appropriate processing workflows can be selected and started is a sufficient user interface for the operator. The results obtained are systematically stored and can be visualised. The collected metadata can be queried and used for targeted searches of experiments that have already been performed.
Jupyter Notebooks, made available via Jupyter Hub, have been integrated into the EM²Q system as a tool for data scientists. This allows exploratory data analysis to be carried out. With Apache Spark, in addition to standard libraries (scikit learn), a distributed analytics engine for big data or distributed TensorFlow for deep learning are available for computing-intensive analyses in particular.
In order to successfully master the interaction of the various apps and the platform, the installation and maintenance are largely automated (‘Infrastructure as Code’): Gitlab is used for source code management, CI/CD pipelines and Docker Registry, the construction of Kubernetes, including VPN, is programmed in Terraform scripts, and the technical dependencies between the applications are resolved by Helm charts.
For operational monitoring of the EM²Q platform, the relevant metrics of the platform and the various pods are collected, stored in Prometheus and visualised via Grafana.
The log files that are generated and distributed in each app are collected via an ELK stack and can thus be evaluated centrally.
The EM²Q project has developed a scalable data analysis platform that can be used to process and analyse measurement data from imaging mass spectrometry in a scalable manner.
For this purpose, a wide range of frameworks and algorithms, which can be used both exploratively and via structured workflows and can be supplemented very flexibly, are available via lightweight container applications.
This extremely modular solution is operated on Kubernetes platforms, making it independent of specific hardware or cloud virtualisation. The EM²Q platform enables analysis of large quantities of data by taking advantage of the scaling capability, scheduling, and configuration management of Kubernetes.
The project is being funded by the German Federal Ministry of Economics and Technology as a cooperative project between inovex and Mannheim University of Applied Sciences as part of the German Central Innovation Programme for small and medium-sized enterprises.