Kubernetes Logging with Fluentd and the Elastic Stack

Kubernetes and Docker are great tools to manage your microservices, but operators and developers need tools to debug those microservices if things go south. Log messages and application metrics are the usual tools in this cases. To centralize the access to log events, the Elastic Stack with Elasticsearch and Kibana is a well-known toolset. In this blog post I want to show you how to integrate the logging of Kubernetes with the Elastic Stack. To start off, I will give an introduction to the log mechanism of Kubernetes, then I’ll show you how to collect the resulting log events and ship them into the Elastic Stack. I also provide a GitHub repository with a working demo. Finally, I highlight some considerations for the production deployment.

Logging in Kubernetes

Kubernetes recommends applications to log to the standard streams stdout and stderr. Logging to these streams has many advantages: First of all, both streams have been part of Unix systems for many decades. Thus the standard library of every major programming language does have support for logging to these streams. Nevertheless it is advisable to use a mature logging framework to manage the log verbosity and the log format. Secondly, logging to stdout and stderr does not involve any network protocol like syslog. Network logging protocols like syslog or GELF solve the problem of shipping the log messages to a central log destination. However, they come at a cost. Developers need to implement those protocols and handle errors in the network. Finally, the two streams are endless by nature. Thus they are a natural fit for an endless stream of log messages. Therefore many modern manifestos for good application design like the twelve factor app manifesto are recommending logging to stdout.

Kubernetes logs the content of the stdout and stderr streams of a pod to a file. It creates one file for each container in a pod. The default location for these files is /var/log/containers . The filename contains the pod name, the namespace of the pod, the container name, and the container id. The file contains one JSON object per line of the two streams stdout and stderr. Kubernetes exposes the content of the log file to clients via its API. The following example shows the content of the file for a Kubernetes dashboard pod and the output of the kubectl logs command.

Hence, logging to stdout and the kubectl logs command are a powerful combination to troubleshoot problems of applications running inside a pod. However, Kubernetes deletes all log files of a pod when it gets deleted, so you can’t troubleshoot errors in already deleted pods. To solve this problem I used fluentd together with the Elastic Stack to store and view the logs via a central tool.

Fluentd

Fluentd is a flexible log data collector. It supports various inputs like log files or syslog and supports many outputs like elasticsearch or Hadoop. Fluentd converts each log line to an event. Those events can be processed and enriched in the fluentd pipeline. I have chosen fluentd since there is a good Kubernetes metadata plugin. This plugin parses the filename of the log file and uses this information to fetch additional metadata from the kubernetes API. The metadata like labels and annotations are attached to the log event as additional fields so you can search and filter by this information. Furthermore, we use the metadata to route the log events to the proper elasticsearch indices. I use one index per pod, so I can implement a log rotation policy for each kind of pod. E.g. you may want to store the logs of a back-end system for two weeks but the access logs of the front-end for two days only.

To deploy fluentd into the Kubernetes cluster I have chosen a DaemonSet. A DaemonSet ensures that a certain pod is scheduled to each kubelet exactly once. The fluentd pod mounts the /var/lib/containers/ host volume to access the logs of all pods scheduled to that kubelet as well as a host volume for a fluentd position file. This position file saves which log lines are already shipped to the central log store. The fluentd setup described here can be created with the following yaml file. It contains the configuration of the DaemonSet and a ConfigMap. The configmap holds the fluentd configuration:

The complete demonstration based on minikube can be found in this GitHub Repository.

Considerations for Production Deployments

In a production environment you have to implement a log rotation of the stored log data. Since the above fluentd configuration generates one index per day this is easy. Elasticsearch Curator is a tool made for exactly this job. The minikube demonstration provides a good starting point to setup Curator in a Kubernetes environment.

As discussed above logging to stdout is very easy. However, you’ll want to log the data in a structured fashion to allow more efficient search in the log data. The demo linked above contains two minimal example applications, normal_logging and structured_logging. The following snippet shows that their log outputs contain the same amount of information.

The normal logging application prints the log message in a custom log format resembling a key=value format. The structured logging application logs the same amount of information as JSON, one complete JSON event per log message. Fluentd will recognize that JSON event per line and use it as a base for the stored log event in the Elasticsearch index so it can use those keys as fields in the Kibana front-end. The following screenshot shows a log message of the structured logger in the Kibana front-end. No additional parsing was configured in the fluentd pipeline.

Structured log message in the kibana frontend

The same effect can be achieved by parsing the log messages of the normal logging application in the fluentd pipeline, but not without a cost: Somebody has to maintain the parsing code, which usually is a set of regular expressions. Since each application tends to have its own log format, the set of regular expressions will grow. Furthermore fluentd has to invest CPU time to execute the parsing.

My personal preference is to establish the following policy: Logs to stdout have to be in JSON format. The disadvantage of that policy is that JSON is not that human readable and developers read logs a lot during development. A mature log handling library allowing different configurable log formats might help the developer to implement this policy, so they can chose a more readable log format while developing a new feature.

Read on

Have a look at our website to find out more about the services we offer in data center automation, write an email to info@inovex.de for more information or call +49 721 619 021-0.

Join us!

Looking for a job where you can work with cutting edge technology on a daily basis? We’re currently hiring Linux Systems Engineers in Karlsruhe, Pforzheim, Munich, Cologne and Hamburg!

comments powered by Disqus