Chat with your data: Unternehmensdaten als Basis für einen eigenen KI-Assistenten nutzen.
Zum Angebot 
The Snorkel Logo
Artificial Intelligence

Dive into Snorkel: Weak-Supervision on German Texts

16 ​​min

How do we proceed if we have almost no labeled data for a machine learning model? One answer may be: combining all the knowledge we have (labeled data, distant supervision, subject matter experts) in one framework to get to the best of each world. A trending framework to apply this data programming pattern is Snorkel. In a nutshell, instead of relying on ground truths, snorkel computes probabilistic labels that are noisy and not perfectly accurate. The hypothesis is that large but noisy datasets outperform small hand-labeled datasets. This blogpost investigates Snorkel for the task of detecting bad language on German texts.


The advent of recent machine learning (ML) such as Deep Neural Networks (DNNs) shifts the bottleneck from feature engineering to labeling immense amounts of data. „The key limiting factor to the development of human-level artificial intelligence“ are not algorithms, but datasets. In research, many results rely on quality-labeled datasets that scale up to millions of samples. In real-world scenarios, however, a common scenario is that historic data exists but labeled data is very limited. Hence, a prerequisite step for any kind of machine learning model is data labeling.

In practice, there are many ways to label data. First, performing hand-labeling by experts or advised crowd-workers sounds reasonable for smaller datasets. However, it requires time and becomes uneconomical for smaller businesses as datasets grow. Second, there are programmatic approaches based on rules, heuristics, or distant supervision. These may be effective, but given the complexity in various fields of machine learning, hand-crafted rules do not generalize over complex tasks. Therefore, a data programming framework that combines the advantages of both worlds in a single place seems to have high potential.

Scenario and Setup

This blogpost addresses a task to detect offensive language in German Twitter posts about refugees and immigration. More specifically, we use the binary task from the GermEval2018 dataset which classifies tweets as offensive or non-offensive. As a matter of fact, the dataset provides gold labels itself, which are removed from the training set for this scenario. However, we will use parts of the labeled dataset for validation and testing.

It should be mentioned that this blog post is a practical guide stepping through features of Snorkel. The implementation is reproducible and linked as Colab Notebook at the end of the blogpost. For more details on the concepts and theory, the original paper is highly recommended. That being said, let’s get started.

Data Preparation

In the first step, we fetch the GermEval dataset from the Github repository and read the two (training, testing) CSV files in pandas DataFrame. We use the binary classification (offense, no offense) labels in column 1.

Subsequently, we create four (instead of the common three) subsets for training, validation, testing, and development. The training set is a large dataset to be labeled by snorkel, whereas the other subsets provide gold labels. The development set contains hand-labeled instances used to evaluate and optimize our labeling functions as described in subsequent sections. We sample 100 instances per class to counteract any imbalance in the dataset. On top of this, the validation dataset is used to evaluate the label models‘ predictions and the test dataset is used to evaluate the classifier scores.

Let’s Snorkel

Having the four datasets, we head on labeling the training data. Herefore, labeling functions are a core component of snorkels framework and provide a flexible interface to embed knowledge from different sources. The minimal requirement of labeling functions is to return a value that represents one or none (abstain) of the target classes.

Labeling Functions

For a better understanding, see the below example of identifying keywords in a tweet. We hypothesize that tweets containing references of politicians might have a higher probability that they are offensive tweets. Hence, we search for keywords indicating any of these references.

In order to define a labeling function, Snorkel provides the LabelingFunction or labeling_function decorator, respectively. Specially made for NLP, we use a decorator for the NLPLabelingFunction which preprocesses the text with spacy (on German). We configure spacy as a preprocessor of labeling functions and use its German model. For more details about the available options in spacy, see its documentation.

For reusability, we can generalize the keyword lookup. Given the tokenized document, we check if any of the keywords exist in the given text. If a keyword matches, the labeling function defines the text as the given label (in our case offensive), otherwise as indefinite (abstain).

That’s it! The first labeling function is able to identify tweets.

Validating and Improving Labeling Functions

Given this simple labeling function as a starting point, we want to apply it to the training data and see how it actually performs. Snorkel provides a class PandasLFApplier, which makes it easy to apply labeling functions on a pandas DataFrame.

In the above code snippet, lfs represents a list of labeling functions beeing applied on the training set. The apply-method returns a labeling matrix which we define as L_train. This is a m x n matrix, where m equals the number of instances in the training set and n is the number of labeling functions in lfs. The entries indicate the prediction for each instance and labeling function. We can evaluate this matrix with the LFAnalytics which outputs the following matrix.

  P Coverage Overlaps Conflicts
lf_has_politician [1] 0.098565 0.0 0.0

We cover around 7% of the training instances and the polarity (P) of our labeling function is towards 1, i.e. towards classifying offensive tweets. The overlaps and conflicts become important having multiple labeling functions. In order to show the accuracy of the function, we apply our labeling functions on the development set.

  P Coverage Correct Incorrect Accuracy
lf_has_politician [1] 0.1 13 7 0.65

We achieve an empirical accuracy of 65% on the dev set which is a good starting point. Be aware that the assumption of a generalization from this small development set to a large-scale corpus is overly simple. Nonetheless, the objective is to develop and tune the labeling function in order to balance the coverage and accuracy, which turned out to be quite difficult in many cases. On the one hand, we require sufficient coverage to label enough data. On the other hand, the accuracy should be satisfactory.

Following this idea, one should define tens or hundreds of labeling functions. In our use-case, we experiment with:

  • Further keyword-based functions.
  • Common German swear words
  • Upper Case: Texts containing long uppercase words may indicate the offensive language.
  • Emoji Polarity: We adapt the word sentiment in a similar manner to emojis.
  • BERT Sentiment: A pre-trained sentiment analysis model on German tweets using huggingface transformers.

In total, we implemented nine labeling functions resulting in the following matrix:

 Labeling function P Coverage Overlaps Conflicts
lf_insult [1]  0.169266  0.156166 0.147640
lf_has_nazi_words [1]  0.039925 0.037014  0.032439
[1] 0.098565 0.095654 0.090040
lf_has_angry_puncts [1]  0.085881  0.081306  0.072364
lf_stopword_ratio [1]  0.045956 0.042628  0.041381
lf_upper_case [1] 0.025577 0.023913  0.019755 
lf_emoji_polarity [1]  0.022250  0.020586  0.017675
lf_bert_sentiment [0, 1] 0.913496 0.366604 0.324392
lf_positive_keywords [0]  0.059472 0.056769 0.018299

In terms of coverage, many labeling functions analyse only small portions of the dataset. However, the labeling functions for sentiment analysis and insult detection cover much larger portions. All implementations and more exploration can be found in the linked Colab notebook at the end of the blogpost. At this point it should be mentioned that this dataset is not suitable for real scenarios, because labeling functions may cause a strong bias. However, in the scope of this blogpost we skip this import analysis and focus on the technical aspects.

Snorkel Label Model

Up until this point, we defined functions to label our textual data. We could use these functions to create a simple label model ourselves. However, Snorkel provides useful tools that ease and speed up this process.

A simple approach is to hold a majority vote of labeling functions for each instance. A majority voter can provide a baseline to show the usefulness of more advanced generative models for labeling. For this reason, Snorkel by default comes with a majority voting approach.

We achieve an empirical accuracy on the validation dataset of 63,5%. Note that we use the validation dataset which has not been considered so far. Further, the majority voter predicts 86 of 340 samples (25,3%) as bad language.

However, using the majority model, we lose information in a stalemate situation or if the majority of the labeling functions abstain for an instance. To resolve this and to improve the performance, the real snorkel-magic comes into play. Instead of a majority vote, we use a label model by Snorkel that learns to weight labeling functions based on their correlations. In other words, the label model is the crux of the matter in the snorkel. This paper and youtube video provide detailed information about the Model. In practice, the generative label model is completely abstract and only provides a couple of parameters.

Using the generative snorkel model on the training dataset, we achieve an accuracy of 65,0% which is slightly less than the majority voter. Regarding the class distribution, the model predicts 113 of the total 340 (33,2%) as offensive. The generative approach of snorkel predicts more offensive tweets while achieving a similar accuracy, even though having only nine labeling functions.

So far, we created labeling functions to classify text as offensive or non-offensive and used a majority voting approach as well as the generative snorkel label model to create labels. In the scope of this blogpost, we snorkel on the water surface and do not dive deeper. Instead, let’s move on using the labels in a classification task.


The first question is: Why not using the labeling model for the classification itself? In fact, the trained labeling model is a binary classifier that outputs probabilities for each instance. One reason for an additional classifier is that the labeling functions are overfitted for the given training dataset. Hence, the hypothesis is that the snorkel model itself cannot generalize as well as a discriminative classifier trained on the probabilistic labels. Another problem is that we cannot generate labels for each instance when all labeling functions abstain. Thus, we need a model that makes a discriminative decisions p(y|x).

For this reason, we will implement a simple decision tree classifier using scikit-learn to investigate the predictions on the snorkeled labels. The inputs are preprocessed using the tweet tokenizer of nltk. Further, the features consist of unigrams, bigrams, and 3-grams using the tfidif vectorizer of scikit-learn.

Regarding the imbalance of the snorkeled dataset as described above, a decision tree classifier is chosen over other ml models like linear regression. The decision tree may counteract that models achieve better results on predicting non-offensive texts.

Predictions on Snorkeled Labels

Having this setup, the snorkeled dataset is fitted against the decision tree classifier. However, in the binary classification model of scikit-learn, we cannot use probabilistic labels but have to map our probabilistic labels to 0 or 1, respectively. In a more advanced setup, one would define the problem as multi-classification and use the probabilistic labels directly. For the sake of simplicity, we accept the loss of information at this point.

Thereupon, we evaluate the trained classifier on the test dataset, which has neither been used for labeling nor for the training of the classifier. We transform the test sample, predict the probabilites, and get the classification report of the test instances.

 Class Recall Precision F1
No-Offense 0.71 0.84 0.77
Offense 0.51 0.33 0.40
Weighted Avg. 0.64 0.67  0.65

The decision tree classifier achieves a weighted f1 score of 0.64. Further, the results show that offensive language is still difficult to detect. One suggestion is that the labeling functions do not generalize as well as expected or that important features miss. On top of this, the decision tree classifier is neither thoroughly investigated nor optimized. However, instead of pushing the above results, we keep the model and data labels as they are and continue by comparing the results against other approaches to show if snorkel benefits at all.

Comparison to Other Models

Is the overhead of hand-crafting labeling functions worth it? In order to answer this question, the following table compares the results of the snorkeled dataset against the results of the decision tree classifier on dataset created by the majority voting approach. Further, the baseline is a decision tree trained on the pure validation dataset (340 samples) which represents the case without a data programming approach at all.

 Model No Offense – F1 Offense – F1 F1 – Weighted
Snorkel Label Dataset 0.77 0.40 0.65
Majority Voting Dataset 0.74 0.36 0.61
Validation Set (Baseline) 0.71 0.37 0.59

Even though the overall results are still not sufficient, the results suggest that labeling models (Snorkel and majority voting) outperform the baseline. Further, the Snorkel model has slight improvements over the majority voter.


Snorkel is a neat and easy-handy framework to label unlabeled data. Without any doubt, however, one requires domain-knowledge in order to identify offensive language patterns. In general, the offensive language task has a large variety of different types of offense (abuses, insults, profanities, …) which makes it difficult to create labeling functions having high coverage and still good accuracy. Hence, the time of crafting, analyzing, and improving labeling functions requires tens or hundreds of thousands of data instances to pay back the effort.

The benefit of Snorkel to solve the problem of unlabeled datasets is still vague. As of today, less complex problems perform quite well on ML approaches without requiring much data. On the other hand, complex tasks like the detection of offensive language make data programming time-consuming and expensive progress. Nonetheless, Snorkel proves its improvements and may be of practical usage on much larger datasets than used in this blogpost.

The Google colab notebook can be found here. Many thanks to Maximilian Blanck and Sebastian Blank for their input.

The classification of hate speech and offense is a very sensitive topic. On the one hand, platforms should inhibit the spread of hate speech and with current legislation, they are also required to do so. Due to the amount of user-generated content, AI can significantly help to do that. On the other hand, the line between suppressing such content to censorship is thin. Therefore, such a system, when employed in production should always be used in combination with human judgments, the training data must always be checked for biases and the model for edge cases.

2 Kommentare

  1. Lovely tutorial, thanks!
    One question: should you also not report on Abstain F1 or is that only part of the LabelModel (generating examples)?

    1. Hi Jordy, thanks for your feedback. In response to your question, yes, abstaining is only part of the Snorkel model. If all labeling functions abstain, the Snorkel model cannot generate a label. These samples are therefore filtered with the filter_unlabeled_dataframe before classification.
      So an interesting (and missing) detail is how many training samples Snorkel actually labeled. However, I found that most samples are labeled, which is due to the high coverage of the Bert sentiment function. So an interesting future investigation could be to examine the importance of labeling functions (e.g. how many labels are actually just BERT sentiment scores?).

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert