This is the second part of a three-part series. We cover a real-life microservice migration with all its flaws and benefits. In the first part, we described why we saw the need to change a monolithic application into a microservice oriented architecture. In this part, we will handle all the changes that have actually been made to the system—and why we opted not to follow some ideas.
What We Changed
We started the slow migration to a microservice oriented system by setting up a single additional service with an easily separable feature set. This allowed a slow redistribution of existing data and tasks within the system without having to adjust or rebuild the existing monolithic application. From a technical point of view, this migration nonetheless meant a significant number of changes. We are going to talk about these steps in the next sections.
The preparations for the migration were negligible and we basically just started building. Many of the requirements were therefore discovered during the process of setting up the first new service. In our case, early mistakes were not severe since the new parts of the application were not enabled for any users yet. This meant we could test and refine for a longer period of time. Testing itself was more difficult, of course. We sadly had no feedback coming in from any real users during the first months after the initial setup of the new services. Luckily we got some early testers from within our customer’s company later. It was at that point that more feedback started pouring in.
More than two years had passed after the initial commit to the project. We obviously had gone through many versions with most of our dependencies therefore. Setting up completely new services allowed us to now start over with all components, allowing us to use the latest versions of libraries and frameworks such as Spring and Gradle. This was clearly more difficult to do in the old application since major version upgrades often turned out to bring along breaking changes and incompatibilities, especially with this type of bigger applications.
Some of the experience we gained from setting up a service with new versions luckily helped us later on. We were able to use those versions in the previously existing old application as well as we knew the differences between the versions.
Setting up the first new microservice we started to separate data by moving it into a one database per service schema. This means in our setup creating additional databases inside the existing DBMS. We continued with this method for each of the services that were created later on and up until today we run a single DBMS with one individual schema for each service. To guarantee the availability of the database we additionally set up a distributed Patroni cluster on top of our Postgres DBMS.
Due to the limited availability of resources for virtual machines etc. we did not have any option to set up multiple independent DBMS. Had we had it, each system could have handled only the data of a single service, which could have meant an independent load distribution. With the current setup, we can distribute the load from reading operations across all instances of the database but if one service started to generate a high load on the database due to numerous writing operations other services might be impacted as well. So far we are in a comfortable position where any operations on the database are nowhere close to the limits of what the DBMS can handle at any given time.
We knew from the beginning that
- we would not have many services
- the services would be introduced slowly over a longer period of time
This led to the decision that we neither needed a discovery service nor an API gateway. In the mindset of keeping it simple, we continued to use an existing HAProxy where we use prefix-matching to mask different internal addresses under the same domain and with identical ports. This means that the location of services is transparent for the client code in both our browser-based web applications and mobile apps. The HAProxy is also able to distribute requests over multiple available instances of running services. Currently we must extend the configuration of the proxy manually for every new service in the proxy. Nonetheless this is an easy win and does not require any additional resources.
The changes in the architecture of our applications also affected the operational side of things. We operate on a more traditional VM Setup inside an Openstack Cluster instead of something like Kubernetes and had to add machines for the new service. This form of setup is sadly rigid and not dynamic. This means the system cannot easily react on load created by incoming requests. At the same time it makes costs very predictable.
If you can avoid doing things more than once, you will probably want to do so. One such thing that means repetitive work but also can be avoided is reconfiguring applications one by one manually. You do not have to change every single config file by hand if some shared attribute has to change its value. Central configuration management is great for this purpose, but also requires resources on its own. Sadly we did not have access to these resources at this point.
Additional effort also arises from having to update each service and its dependencies individually. Yet this separation of components makes the overall system less fragile. If one component cannot be updated in one application due to conflicts, this does not have to affect any other application. This takes out some pressure in the case of a required update for single services.
Considering the last two sections and generally speaking, the costs arising from running and managing the machines in such a small project will be quite high in comparison. Mainly because everything is still supposed to be failsafe. This leads to redundancy and may cost considerably more money in the microservice context, simply because one has to duplicate more than one application. Whether any other system architecture would have helped to reduce those costs, we can’t tell with certainty. Cloud-service based systems, Kubernetes or others will bring their own drawbacks along with them.
Security is always a complex topic and authentication is a big part of it. In our project we have to handle the usual user logins for our web and mobile apps, but there are additional concerns:
- A hierarchical system for users, groups, roles and authorities. This is used to specify which entity is allowed to access resources and execute actions.
- A big number of external IoT devices that access the system and are being accessed or controlled by the system.
- An external application that is tightly linked with our system and shares the user authentication data.
The aforementioned hierarchical groups are built as trees and can be built individually by users to manage access rights within their companies. This means they can get quite big and the permission system becomes even more complex this way.
The first step we took to make the authentication process more manageable was to use JSON Web Tokens (JWT) for all applications. Before this, users used a form of temporary cached UUID stored on the client and the servers’ database. The external devices already had a token-based authentication mechanism. This was a good step ahead, especially in terms of overall security and access management across multiple backend services.
One remaining problem with this arose from the information on groups that was required to manage resource access for individual users. All of the information for this still lay only inside the main application service and was not directly available to the other services. Due to this, we decided to add some required information into the JWT, but this token became a little too big for our taste, since this information also contained numerous group IDs.
Using the token as a carrier for additional information was later replaced by a Redis cache. Now all the services can access relevant information when needed without having to bloat the token. If you have the resources to create an additional service for making information available across services, you might want to do so.
In our case, the resources sadly are not available, but the used cache is also a simple and sufficient solution. Within our system, Redis now holds information about:
- Groups and information on the hierarchy of a specific group in the tree:
- The root group for it.
- All parents, siblings and children.
- Permissions for individual roles.
A new requirement that first arose with the direct communication between users and the additional services was authentication on the interservice communication. Interservice authentication was necessary in singular use cases where we had to break the pattern of independent services and one service required information from the main application on special types of user requests. One example for this would be requests for interaction with the IoT devices. To this day this feature remains within the old main service and it must be called in this context.
Another one of the use cases were requests on group hierarchies. This information used to be only available from the old main service but were now moved to the Redis storage. In the event of such an interservice request, we have to make sure a user has the required permissions to access the information that was requested. The solution we chose was to include the user token from the initial request. Since the information exchanged between our services is not supposed to be directly accessible by any users, we additionally blocked the outside access for any of those endpoints.
In our case this interservice communication was mainly caused by the slow migration process. This meant that the separation of concerns could not always be cut clearly and there would be data lying within a single service but required by multiple services.
Besides the authentication issue, interservice communication also requires a solution for distributed transactions. At the moment this is handled using HTTP Results, which is not completely safe. For example, a timeout could occur despite the HTTP call already having been made. A better solution for this problem would have been to use message queues or correlation IDs for retries.
Besides the changes we have made to the overall architecture already, we have a wishlist of things we would like to implement in the future. These mainly concern the distribution of features across services:
- Authentication Service: A central point within the system that handles authentication and authorization for all other components. This would solve the duplication of information about permissions across multiple services.
- Mail Service: Currently multiple services send emails to users. This mainly means a duplication of code, templates and translations. While this is also partially solvable with a library, a separate service with queuing capabilities would be preferred.
- A regrouping of components between services. Some components are still together in one application for historical reasons. We would like them to be together based on functional and technical connections.
On the other end of things, we decided against micro frontends. Due to this, the frontend is still being served entirely by the old main application. One of the reasons for this was that we serve a single page application (SPA) built with Angular. The switch to multiple frontends was expected to bring along a much higher complexity here. Moreover we also expected this to deliver a worse performance than one single SPA. We had already experienced both of these aspects before, mainly during the migration from Angular 1 to Angular 7. This migration impacted our frontend development for several months and we still have singular components in the Angular application that have been developed with AngularJS. For this reason we did not want to risk a similar explosion in complexity right then.
One aspect arising from a single frontend for multiple backends is the question of compatibility. Before the introduction of multiple services, backend and frontend were automatically deployed in one step and required a page reload at the most. Now we have independent deployments that have to be built considering compatibility with the respective other side of the application.
Interestingly enough the separation of backend applications also had an effect on the frontend despite having one single frontend. We previously had end-to-end tests running as part of the build and release pipelines and wanted to preserve this behaviour. An alternative to this approach would have been to run frontend tests on the development environment after a new version got deployed, but we would rather make sure that any code is running as expected before deploying it. This means we have to run multiple backend services and other components like Redis inside the pipeline, making pipelines more complex and their execution more time-consuming. In order to prevent the worst effects of this we built base images that contain external components and can be spun up quickly.
As you can see in the illustration, we also added Elasticsearch to our infrastructure. This is almost irrelevant for most parts of the migration, since only one of the services is working with elastic. Since this service has to be available during testing as well, it had a small impact on our testing pipeline, though.
Currently there is no testing matrix allowing multiple services to be tested with each other on different versions. We only test the latest version from our master branches of the respective services with each other.
One of the things we could not yet address sufficiently is duplicate code. Some examples for this are our solutions for logging, health checks and data format converters. So far this did not lead to any severe consequences, but we strive to optimize this in the future anyway.
The new architecture has gotten bigger overall and components are better interconnected.
In case you are interested we also have a little fact sheet for you:
|Operating System||OpenStack, Ubuntu Server|
|Message Broker||emqx (MQTT)|
|Service and libraries||Java, Spring Boot, Jetty embedded, Hystrix, Spring Security, liquibase|
|Operation tools||Sentry on-premise, Grafana, prometheus, Gitlab CI, PhraseApp|
In this part we described what we changed on our way to a partial multiservice architecture and which challenges we met. In the upcoming last article to this series we will talk about whether we recommend making this change.