Dive into Snorkel: Weak-Supervision on German Texts

Notice:
This post is older than 5 years – the content might be outdated.

How do we proceed if we have almost no labeled data for a machine learning model? One answer may be: combining all the knowledge we have (labeled data, distant supervision, subject matter experts) in one framework to get to the best of each world. A trending framework to apply this data programming pattern is Snorkel. In a nutshell, instead of relying on ground truths, snorkel computes probabilistic labels that are noisy and not perfectly accurate. The hypothesis is that large but noisy datasets outperform small hand-labeled datasets. This blogpost investigates Snorkel for the task of detecting bad language on German texts.

Background

The advent of recent machine learning (ML) such as Deep Neural Networks (DNNs) shifts the bottleneck from feature engineering to labeling immense amounts of data. „The key limiting factor to the development of human-level artificial intelligence“ are not algorithms, but datasets. In research, many results rely on quality-labeled datasets that scale up to millions of samples. In real-world scenarios, however, a common scenario is that historic data exists but labeled data is very limited. Hence, a prerequisite step for any kind of machine learning model is data labeling.

In practice, there are many ways to label data. First, performing hand-labeling by experts or advised crowd-workers sounds reasonable for smaller datasets. However, it requires time and becomes uneconomical for smaller businesses as datasets grow. Second, there are programmatic approaches based on rules, heuristics, or distant supervision. These may be effective, but given the complexity in various fields of machine learning, hand-crafted rules do not generalize over complex tasks. Therefore, a data programming framework that combines the advantages of both worlds in a single place seems to have high potential.

Scenario and Setup

This blogpost addresses a task to detect offensive language in German Twitter posts about refugees and immigration. More specifically, we use the binary task from the GermEval2018 dataset which classifies tweets as offensive or non-offensive. As a matter of fact, the dataset provides gold labels itself, which are removed from the training set for this scenario. However, we will use parts of the labeled dataset for validation and testing.

It should be mentioned that this blog post is a practical guide stepping through features of Snorkel. The implementation is reproducible and linked as Colab Notebook at the end of the blogpost. For more details on the concepts and theory, the original paper is highly recommended. That being said, let’s get started.

Data Preparation

In the first step, we fetch the GermEval dataset from the Github repository and read the two (training, testing) CSV files in pandas DataFrame. We use the binary classification (offense, no offense) labels in column 1.

import pandas as pd

def read_germeval(path):

    df = pd.read_csv(path, sep='\t', usecols=[0, 1], names=["text", "label"])

    df.replace({"label":  {"OTHER": 0, "OFFENSE": 1}}, inplace=True)

    return df

df_train = read_germeval("./GermEval-2018-Data/germeval2018.training.txt")

df_test = read_germeval("./GermEval-2018-Data/germeval2018.test.txt")

import pandas as pd

def read_germeval(path):

df = pd.read_csv(path, sep='\t', usecols=[0, 1], names=["text", "label"])

df.replace({"label": {"OTHER": 0, "OFFENSE": 1}}, inplace=True)

return df

df_train = read_germeval("./GermEval-2018-Data/germeval2018.training.txt")

df_test = read_germeval("./GermEval-2018-Data/germeval2018.test.txt")

Subsequently, we create four (instead of the common three) subsets for training, validation, testing, and development. The training set is a large dataset to be labeled by snorkel, whereas the other subsets provide gold labels. The development set contains hand-labeled instances used to evaluate and optimize our labeling functions as described in subsequent sections. We sample 100 instances per class to counteract any imbalance in the dataset. On top of this, the validation dataset is used to evaluate the label models‘ predictions and the test dataset is used to evaluate the classifier scores.

df_dev = df_train.groupby('label').apply(lambda s: s.sample(100)).reset_index(level=0, drop=True)

df_train.drop(df_dev.index, inplace=True)

df_valid = df_test.sample(frac=0.1)

df_test.drop(df_valid.index, inplace=True)

print('Train:', len(df_train), '\t Dev:', len(df_dev), '\t Test:', len(df_test), '\t', 'Valid:', len(df_valid))

# Train: 4809    Dev: 200    Test: 3058      Valid: 340

df_dev = df_train.groupby('label').apply(lambda s: s.sample(100)).reset_index(level=0, drop=True)

df_train.drop(df_dev.index, inplace=True)

df_valid = df_test.sample(frac=0.1)

df_test.drop(df_valid.index, inplace=True)

print('Train:', len(df_train), '\t Dev:', len(df_dev), '\t Test:', len(df_test), '\t', 'Valid:', len(df_valid))

# Train: 4809 Dev: 200 Test: 3058 Valid: 340

Let’s Snorkel

Having the four datasets, we head on labeling the training data. Herefore, labeling functions are a core component of snorkels framework and provide a flexible interface to embed knowledge from different sources. The minimal requirement of labeling functions is to return a value that represents one or none (abstain) of the target classes.

ABSTAIN = -1

NO_OFFENSE = 0

OFFENSE = 1

ABSTAIN = -1

NO_OFFENSE = 0

OFFENSE = 1

Labeling Functions

For a better understanding, see the below example of identifying keywords in a tweet. We hypothesize that tweets containing references of politicians might have a higher probability that they are offensive tweets. Hence, we search for keywords indicating any of these references.

politicans = ['merkel', 'obama', 'trump', 'putin', 'macron']

@nlp_labeling_function(language=SPACY_LANGUAGE_ID)

def lf_has_politician(x):

  return keyword_lookup(x, politicans, OFFENSE)

politicans = ['merkel', 'obama', 'trump', 'putin', 'macron']

@nlp_labeling_function(language=SPACY_LANGUAGE_ID)

def lf_has_politician(x):

return keyword_lookup(x, politicans, OFFENSE)

In order to define a labeling function, Snorkel provides the LabelingFunction or labeling_function decorator, respectively. Specially made for NLP, we use a decorator for the NLPLabelingFunction which preprocesses the text with spacy (on German). We configure spacy as a preprocessor of labeling functions and use its German model. For more details about the available options in spacy, see its documentation.

For reusability, we can generalize the keyword lookup. Given the tokenized document, we check if any of the keywords exist in the given text. If a keyword matches, the labeling function defines the text as the given label (in our case offensive), otherwise as indefinite (abstain).

def keyword_lookup(x, keywords, label):

    tokens = ''.join([token.lower_ for token in x.doc])

    if any(word.lower() in tokens for word in keywords):

        return label

    return ABSTAIN

def keyword_lookup(x, keywords, label):

tokens = ''.join([token.lower_ for token in x.doc])

if any(word.lower() in tokens for word in keywords):

return label

return ABSTAIN

That’s it! The first labeling function is able to identify tweets.

Validating and Improving Labeling Functions

Given this simple labeling function as a starting point, we want to apply it to the training data and see how it actually performs. Snorkel provides a class PandasLFApplier, which makes it easy to apply labeling functions on a pandas DataFrame.

from snorkel.labeling import PandasLFApplier

lfs = [lf_has_politician]

applier = PandasLFApplier(lfs)

L_train = applier.apply(df_train)

from snorkel.labeling import PandasLFApplier

lfs = [lf_has_politician]

applier = PandasLFApplier(lfs)

L_train = applier.apply(df_train)

In the above code snippet, lfs represents a list of labeling functions beeing applied on the training set. The apply-method returns a labeling matrix which we define as L_train. This is a m x n matrix, where m equals the number of instances in the training set and n is the number of labeling functions in lfs. The entries indicate the prediction for each instance and labeling function. We can evaluate this matrix with the LFAnalytics which outputs the following matrix.

LFAnalysis(L=L_train, lfs = lfs).lf_summary()

1	LFAnalysis(L=L_train, lfs = lfs).lf_summary()

	P	Coverage	Overlaps	Conflicts
lf_has_politician	[1]	0.098565	0.0	0.0

We cover around 7% of the training instances and the polarity (P) of our labeling function is towards 1, i.e. towards classifying offensive tweets. The overlaps and conflicts become important having multiple labeling functions. In order to show the accuracy of the function, we apply our labeling functions on the development set.

L_dev = applier.apply(df_dev)

LFAnalysis(L=L_dev, lfs = lfs).lf_summary(Y = df_dev.label.values)

L_dev = applier.apply(df_dev)

LFAnalysis(L=L_dev, lfs = lfs).lf_summary(Y = df_dev.label.values)

	P	Coverage	Correct	Incorrect	Accuracy
lf_has_politician	[1]	0.1	13	7	0.65

We achieve an empirical accuracy of 65% on the dev set which is a good starting point. Be aware that the assumption of a generalization from this small development set to a large-scale corpus is overly simple. Nonetheless, the objective is to develop and tune the labeling function in order to balance the coverage and accuracy, which turned out to be quite difficult in many cases. On the one hand, we require sufficient coverage to label enough data. On the other hand, the accuracy should be satisfactory.

Following this idea, one should define tens or hundreds of labeling functions. In our use-case, we experiment with:

Further keyword-based functions.
Common German swear words
Upper Case: Texts containing long uppercase words may indicate the offensive language.
Emoji Polarity: We adapt the word sentiment in a similar manner to emojis.
BERT Sentiment: A pre-trained sentiment analysis model on German tweets using huggingface transformers.

In total, we implemented nine labeling functions resulting in the following matrix:

Labeling function	P	Coverage	Overlaps	Conflicts
lf_insult	[1]	0.169266	0.156166	0.147640
lf_has_nazi_words	[1]	0.039925	0.037014	0.032439
lf_has_politician	[1]	0.098565	0.095654	0.090040
lf_has_angry_puncts	[1]	0.085881	0.081306	0.072364
lf_stopword_ratio	[1]	0.045956	0.042628	0.041381
lf_upper_case	[1]	0.025577	0.023913	0.019755
lf_emoji_polarity	[1]	0.022250	0.020586	0.017675
lf_bert_sentiment	[0, 1]	0.913496	0.366604	0.324392
lf_positive_keywords	[0]	0.059472	0.056769	0.018299

In terms of coverage, many labeling functions analyse only small portions of the dataset. However, the labeling functions for sentiment analysis and insult detection cover much larger portions. All implementations and more exploration can be found in the linked Colab notebook at the end of the blogpost. At this point it should be mentioned that this dataset is not suitable for real scenarios, because labeling functions may cause a strong bias. However, in the scope of this blogpost we skip this import analysis and focus on the technical aspects.

Snorkel Label Model

Up until this point, we defined functions to label our textual data. We could use these functions to create a simple label model ourselves. However, Snorkel provides useful tools that ease and speed up this process.

A simple approach is to hold a majority vote of labeling functions for each instance. A majority voter can provide a baseline to show the usefulness of more advanced generative models for labeling. For this reason, Snorkel by default comes with a majority voting approach.

from snorkel.labeling import MajorityLabelVoter

majority_model = MajorityLabelVoter()

preds_train = majority_model.predict(L=L_train)

majority_acc = majority_model.score(L=L_valid, Y=df_valid.label, tie_break_policy="random")["accuracy"]

from snorkel.labeling import MajorityLabelVoter

majority_model = MajorityLabelVoter()

preds_train = majority_model.predict(L=L_train)

majority_acc = majority_model.score(L=L_valid, Y=df_valid.label, tie_break_policy="random")["accuracy"]

We achieve an empirical accuracy on the validation dataset of 63,5%. Note that we use the validation dataset which has not been considered so far. Further, the majority voter predicts 86 of 340 samples (25,3%) as bad language.

However, using the majority model, we lose information in a stalemate situation or if the majority of the labeling functions abstain for an instance. To resolve this and to improve the performance, the real snorkel-magic comes into play. Instead of a majority vote, we use a label model by Snorkel that learns to weight labeling functions based on their correlations. In other words, the label model is the crux of the matter in the snorkel. This paper and youtube video provide detailed information about the Model. In practice, the generative label model is completely abstract and only provides a couple of parameters.

from snorkel.labeling import LabelModel

label_model = LabelModel(cardinality = 2, verbose = True)

label_model.fit(L_train, n_epochs = 500, log_freq= 50, seed =42)

from snorkel.labeling import LabelModel

label_model = LabelModel(cardinality = 2, verbose = True)

label_model.fit(L_train, n_epochs = 500, log_freq= 50, seed =42)

Using the generative snorkel model on the training dataset, we achieve an accuracy of 65,0% which is slightly less than the majority voter. Regarding the class distribution, the model predicts 113 of the total 340 (33,2%) as offensive. The generative approach of snorkel predicts more offensive tweets while achieving a similar accuracy, even though having only nine labeling functions.

So far, we created labeling functions to classify text as offensive or non-offensive and used a majority voting approach as well as the generative snorkel label model to create labels. In the scope of this blogpost, we snorkel on the water surface and do not dive deeper. Instead, let’s move on using the labels in a classification task.

Classification

The first question is: Why not using the labeling model for the classification itself? In fact, the trained labeling model is a binary classifier that outputs probabilities for each instance. One reason for an additional classifier is that the labeling functions are overfitted for the given training dataset. Hence, the hypothesis is that the snorkel model itself cannot generalize as well as a discriminative classifier trained on the probabilistic labels. Another problem is that we cannot generate labels for each instance when all labeling functions abstain. Thus, we need a model that makes a discriminative decisions p(y|x).

For this reason, we will implement a simple decision tree classifier using scikit-learn to investigate the predictions on the snorkeled labels. The inputs are preprocessed using the tweet tokenizer of nltk. Further, the features consist of unigrams, bigrams, and 3-grams using the tfidif vectorizer of scikit-learn.

from nltk.tokenize import TweetTokenizer

from sklearn.feature_extraction.text import TfidfVectorizer

tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)

vectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenize, ngram_range=(1, 3))

from nltk.tokenize import TweetTokenizer

from sklearn.feature_extraction.text import TfidfVectorizer

tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)

vectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenize, ngram_range=(1, 3))

Regarding the imbalance of the snorkeled dataset as described above, a decision tree classifier is chosen over other ml models like linear regression. The decision tree may counteract that models achieve better results on predicting non-offensive texts.

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(random_state=0)

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(random_state=0)

Predictions on Snorkeled Labels

Having this setup, the snorkeled dataset is fitted against the decision tree classifier. However, in the binary classification model of scikit-learn, we cannot use probabilistic labels but have to map our probabilistic labels to 0 or 1, respectively. In a more advanced setup, one would define the problem as multi-classification and use the probabilistic labels directly. For the sake of simplicity, we accept the loss of information at this point.

snorkel_label_probs = label_model.predict_proba(L=L_train)

X, y = filter_unlabeled_dataframe(X=df_train.text, y=snorkel_label_probs, L=L_train)

X = vectorizer.fit_transform(X)

classifier.fit(X, probs_to_preds(probs=y))

snorkel_label_probs = label_model.predict_proba(L=L_train)

X, y = filter_unlabeled_dataframe(X=df_train.text, y=snorkel_label_probs, L=L_train)

X = vectorizer.fit_transform(X)

classifier.fit(X, probs_to_preds(probs=y))

Thereupon, we evaluate the trained classifier on the test dataset, which has neither been used for labeling nor for the training of the classifier. We transform the test sample, predict the probabilites, and get the classification report of the test instances.

X = vect.transform(df_test.text.tolist())

y_true = df_test.label.values

y_pred = clf.predict(X)

from sklearn.metrics import classification_report

classification_report(y_true, y_pred, labels=[0,1])

X = vect.transform(df_test.text.tolist())

y_true = df_test.label.values

y_pred = clf.predict(X)

from sklearn.metrics import classification_report

classification_report(y_true, y_pred, labels=[0,1])

Class	Recall	Precision	F1
No-Offense	0.71	0.84	0.77
Offense	0.51	0.33	0.40
Weighted Avg.	0.64	0.67	0.65

The decision tree classifier achieves a weighted f1 score of 0.64. Further, the results show that offensive language is still difficult to detect. One suggestion is that the labeling functions do not generalize as well as expected or that important features miss. On top of this, the decision tree classifier is neither thoroughly investigated nor optimized. However, instead of pushing the above results, we keep the model and data labels as they are and continue by comparing the results against other approaches to show if snorkel benefits at all.

Comparison to Other Models

Is the overhead of hand-crafting labeling functions worth it? In order to answer this question, the following table compares the results of the snorkeled dataset against the results of the decision tree classifier on dataset created by the majority voting approach. Further, the baseline is a decision tree trained on the pure validation dataset (340 samples) which represents the case without a data programming approach at all.

Model	No Offense – F1	Offense – F1	F1 – Weighted
Snorkel Label Dataset	0.77	0.40	0.65
Majority Voting Dataset	0.74	0.36	0.61
Validation Set (Baseline)	0.71	0.37	0.59

Even though the overall results are still not sufficient, the results suggest that labeling models (Snorkel and majority voting) outperform the baseline. Further, the Snorkel model has slight improvements over the majority voter.

Conclusion

Snorkel is a neat and easy-handy framework to label unlabeled data. Without any doubt, however, one requires domain-knowledge in order to identify offensive language patterns. In general, the offensive language task has a large variety of different types of offense (abuses, insults, profanities, …) which makes it difficult to create labeling functions having high coverage and still good accuracy. Hence, the time of crafting, analyzing, and improving labeling functions requires tens or hundreds of thousands of data instances to pay back the effort.

The benefit of Snorkel to solve the problem of unlabeled datasets is still vague. As of today, less complex problems perform quite well on ML approaches without requiring much data. On the other hand, complex tasks like the detection of offensive language make data programming time-consuming and expensive progress. Nonetheless, Snorkel proves its improvements and may be of practical usage on much larger datasets than used in this blogpost.

The Google colab notebook can be found here. Many thanks to Maximilian Blanck and Sebastian Blank for their input.

The classification of hate speech and offense is a very sensitive topic. On the one hand, platforms should inhibit the spread of hate speech and with current legislation, they are also required to do so. Due to the amount of user-generated content, AI can significantly help to do that. On the other hand, the line between suppressing such content to censorship is thin. Therefore, such a system, when employed in production should always be used in combination with human judgments, the training data must always be checked for biases and the model for edge cases.

Generative AI Basics für Product Manager

Dieses Training vermittelt Product Managern die Kernkompetenz des Prompt Engineerings und die gezielte Nutzung von Generative AI als strategischen Assistenten zur Automatisierung des Backlogs, Beschleunigung des Prototypings und souveränen Bewältigung komplexer Entscheidungsfindungen.

Zum Training

2 Kommentare

Jordy sagt:

31.07.2020 um 8:46 a.m. Uhr

Lovely tutorial, thanks!
One question: should you also not report on Abstain F1 or is that only part of the LabelModel (generating examples)?

Antworten
1. Pascal Fecht sagt:
  
  31.07.2020 um 12:33 p.m. Uhr
  
  Hi Jordy, thanks for your feedback. In response to your question, yes, abstaining is only part of the Snorkel model. If all labeling functions abstain, the Snorkel model cannot generate a label. These samples are therefore filtered with the filter_unlabeled_dataframe before classification.
  So an interesting (and missing) detail is how many training samples Snorkel actually labeled. However, I found that most samples are labeled, which is due to the high coverage of the Bert sentiment function. So an interesting future investigation could be to examine the importance of labeling functions (e.g. how many labels are actually just BERT sentiment scores?).
  
  Antworten

Name	Borlabs Cookie
Anbieter	Eigentümer dieser Website
Zweck	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Jahr

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Cookie von Google für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Datenschutzerklärung	https://policies.google.com/privacy?hl=de
Cookie Name	_ga,_gat,_gid
Cookie Laufzeit	2 Jahre

Akzeptieren
Name	Hotjar
Anbieter	Hotjar Ltd.
Zweck	Hotjar ist ein Analysewerkzeug für das Benutzerverhalten von Hotjar Ltd. Wir verwenden Hotjar, um zu verstehen, wie Benutzer mit unserer Website interagieren.
Datenschutzerklärung	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Laufzeit	Sitzung / 1 Jahr

Akzeptieren
Name	HubSpot
Anbieter	HubSpot Inc.
Zweck	HubSpot ist ein Verwaltungsdienst für Benutzerdatenbanken bereitgestellt von HubSpot, Inc. Wir nutzen HubSpot auf dieser Website für unsere Online Marketing-Aktivitäten.
Datenschutzerklärung	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Laufzeit	Sitzung / 30 Minuten / 1 Tag / 1 Jahr / 13 Monate

Akzeptieren
Name	Leadfeeder
Anbieter	Dealfront Group GmbH

Dive into Snorkel: Weak-Supervision on German Texts

Background

Scenario and Setup

Data Preparation

Let’s Snorkel

Labeling Functions

Validating and Improving Labeling Functions

Snorkel Label Model

Classification

Predictions on Snorkeled Labels

Comparison to Other Models

Conclusion

Generative AI Basics für Product Manager

2 Kommentare

Hat dir der Beitrag gefallen? Antwort abbrechen

Ähnliche Artikel

The inovex Zero-Trust Reasoning (ZTR) Framework: A Concise Overview

Sustainable AI – Nachhaltig Programmieren mit Coding Assistants

A Batch Made In Heaven? Efficient Prompt Processing with Ray & vLLM

Akzeptieren
Name	OpenStreetMap
Anbieter	OpenStreetMap Foundation
Zweck	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Datenschutzerklärung	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Laufzeit	1-10 Jahre

Akzeptieren
Name	Podigee
Anbieter	Podigee
Zweck	Wird verwendet, um Podigee-Inhalte automatisch zu entsperren.
Datenschutzerklärung	https://www.podigee.com/de/ueber-uns/datenschutz
Host(s)	podigee., podigee.com, podigee.io

Dive into Snorkel: Weak-Supervision on German Texts

Background

Scenario and Setup

Data Preparation

Let’s Snorkel

Labeling Functions

Validating and Improving Labeling Functions

Snorkel Label Model

Classification

Predictions on Snorkeled Labels

Comparison to Other Models

Conclusion

Generative AI Basics für Product Manager

2 Kommentare

Hat dir der Beitrag gefallen? Antwort abbrechen

Ähnliche Artikel

The inovex Zero-Trust Reasoning (ZTR) Framework: A Concise Overview

Sustainable AI – Nachhaltig Programmieren mit Coding Assistants

A Batch Made In Heaven? Efficient Prompt Processing with Ray & vLLM

inoNews