{"id":17169,"date":"2020-07-30T13:26:11","date_gmt":"2020-07-30T11:26:11","guid":{"rendered":"https:\/\/www.inovex.de\/blog\/?p=17169"},"modified":"2025-08-21T07:09:31","modified_gmt":"2025-08-21T05:09:31","slug":"snorkel-weak-superversion-german-texts","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/","title":{"rendered":"Dive into Snorkel: Weak-Supervision on German Texts"},"content":{"rendered":"<p>How do we proceed if we have almost no labeled data for a machine learning model? One answer may be: combining all the knowledge we have (labeled data, distant supervision, subject matter experts) in one framework to get to the best of each world. A trending framework to apply this <strong>data programming<\/strong> pattern is\u00a0Snorkel. In a nutshell, instead of relying on ground truths, snorkel computes probabilistic labels that are noisy and not perfectly accurate. The hypothesis is that large but noisy datasets outperform small hand-labeled datasets.\u00a0This blogpost investigates Snorkel for the task of detecting bad language on German texts.<\/p>\n<p><!--more--><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#Background\" >Background<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#Scenario-and-Setup\" >Scenario and Setup<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#Data-Preparation\" >Data Preparation<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#Lets-Snorkel\" >Let&#8217;s Snorkel<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#Labeling-Functions\" >Labeling Functions<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#Validating-and-Improving-Labeling-Functions\" >Validating and Improving Labeling Functions<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#Snorkel-Label-Model\" >Snorkel Label Model<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#Classification\" >Classification<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#Predictions-on-Snorkeled-Labels\" >Predictions on Snorkeled Labels<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#Comparison-to-Other-Models\" >Comparison to Other Models<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#Conclusion\" >Conclusion<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Background\"><\/span>Background<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The advent of recent machine learning (ML) such as Deep Neural Networks\u00a0(DNNs)\u00a0shifts the bottleneck from feature engineering to labeling immense amounts of data.\u00a0&#8222;The key limiting factor to the development of human-level artificial intelligence&#8220; are <a href=\"http:\/\/www.spacemachine.net\/views\/2016\/3\/datasets-over-algorithms\">not algorithms, but datasets<\/a>. In research, many results rely on quality-labeled datasets that scale up to millions of samples. In real-world scenarios, however, a common scenario is that historic data exists but labeled data is very limited. Hence,\u00a0a prerequisite step for any kind of machine learning model is data labeling.<\/p>\n<p>In practice, there are many ways to label data. First, performing hand-labeling by experts or advised crowd-workers sounds reasonable for smaller datasets. However, it requires time and becomes uneconomical for smaller businesses as datasets grow. Second, there are programmatic approaches based on rules, heuristics, or distant supervision. These may be effective, but given the complexity in various fields of machine learning, hand-crafted rules do not generalize over complex tasks. Therefore, a data programming framework that combines the advantages of both worlds in a single place seems to have high potential.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Scenario-and-Setup\"><\/span>Scenario and Setup<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>This blogpost addresses a task to detect offensive language in German Twitter posts about refugees and immigration. More specifically, we use the binary task from the <a href=\"https:\/\/github.com\/uds-lsv\/GermEval-2018-Data\">GermEval2018 dataset<\/a> which classifies tweets as offensive or non-offensive. As a matter of fact, the dataset provides gold labels itself, which are removed from the training set for this scenario. However, we will use parts of the labeled dataset for validation and testing.<\/p>\n<p>It should be mentioned that this blog post is a practical guide stepping through features of Snorkel. The implementation is reproducible and linked as Colab Notebook at the end of the blogpost. For more details on the\u00a0concepts and theory, the original <a href=\"https:\/\/arxiv.org\/abs\/1903.05844\">paper<\/a>\u00a0is highly recommended. That being said, let&#8217;s get started.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Data-Preparation\"><\/span>Data Preparation<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>In the first step, we fetch the GermEval dataset from the Github <a href=\"https:\/\/github.com\/uds-lsv\/GermEval-2018-Data\">repository<\/a> and read the two (training, testing) CSV files in pandas DataFrame. We use the binary classification (offense, no offense) labels in column 1.<\/p>\n<pre class=\"lang:python decode:true \" title=\"Load the Germeval dataset\">import pandas as pd\r\n\r\ndef read_germeval(path):\r\n\r\n    df = pd.read_csv(path, sep='\\t', usecols=[0, 1], names=[\"text\", \"label\"])\r\n\r\n    df.replace({\"label\":  {\"OTHER\": 0, \"OFFENSE\": 1}}, inplace=True)\r\n\r\n    return df\r\n\r\ndf_train = read_germeval(\".\/GermEval-2018-Data\/germeval2018.training.txt\")\r\n\r\ndf_test = read_germeval(\".\/GermEval-2018-Data\/germeval2018.test.txt\")\r\n\r\n<\/pre>\n<p>Subsequently, we create four (instead of the common three) subsets for training, validation, testing, and <strong>development<\/strong>. The training set is a large dataset to be labeled by snorkel, whereas the other subsets provide gold labels. The development set contains hand-labeled instances used to evaluate and optimize our labeling functions as described in subsequent sections. We sample 100 instances per class to counteract any imbalance in the dataset. On top of this, the validation dataset is used to evaluate the label models&#8216; predictions and the test dataset is used to evaluate the classifier scores.<\/p>\n<pre class=\"lang:default decode:true\" title=\"Splitting the GermEval dataset in 4 subsets\">df_dev = df_train.groupby('label').apply(lambda s: s.sample(100)).reset_index(level=0, drop=True)\r\n\r\ndf_train.drop(df_dev.index, inplace=True)\r\n\r\ndf_valid = df_test.sample(frac=0.1)\r\n\r\ndf_test.drop(df_valid.index, inplace=True)\r\n\r\nprint('Train:', len(df_train), '\\t Dev:', len(df_dev), '\\t Test:', len(df_test), '\\t', 'Valid:', len(df_valid))\r\n\r\n# Train: 4809    Dev: 200    Test: 3058      Valid: 340\r\n\r\n<\/pre>\n<h2><span class=\"ez-toc-section\" id=\"Lets-Snorkel\"><\/span>Let&#8217;s Snorkel<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Having the four datasets, we head on labeling the training data. Herefore, labeling functions are a core component of snorkels framework and provide a flexible interface to embed knowledge from different sources. The minimal requirement of labeling functions is to return a value that represents one or none (abstain) of the target classes.<\/p>\n<pre class=\"lang:python decode:true\">ABSTAIN = -1\r\n\r\nNO_OFFENSE = 0\r\n\r\nOFFENSE = 1\r\n\r\n<\/pre>\n<h3><span class=\"ez-toc-section\" id=\"Labeling-Functions\"><\/span>Labeling Functions<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>For a better understanding, see the below example of identifying keywords in a tweet. We hypothesize that tweets containing references of politicians might have a higher probability that they are offensive tweets. Hence, we search for keywords indicating any of these references.<\/p>\n<pre class=\"lang:python decode:true\">politicans = ['merkel', 'obama', 'trump', 'putin', 'macron']\r\n\r\n@nlp_labeling_function(language=SPACY_LANGUAGE_ID)\r\n\r\ndef lf_has_politician(x):\r\n\r\n  return keyword_lookup(x, politicans, OFFENSE)<\/pre>\n<p>In order to define a labeling function, Snorkel provides the <a href=\"https:\/\/snorkel.readthedocs.io\/en\/master\/packages\/_autosummary\/labeling\/snorkel.labeling.LabelingFunction.html\">LabelingFunction<\/a>\u00a0or <a href=\"https:\/\/snorkel.readthedocs.io\/en\/master\/packages\/_autosummary\/labeling\/snorkel.labeling.labeling_function.html\">labeling_function<\/a> decorator, respectively. Specially made for NLP, we use a decorator for the <a href=\"https:\/\/snorkel.readthedocs.io\/en\/master\/packages\/_autosummary\/labeling\/snorkel.labeling.lf.nlp.NLPLabelingFunction.html?highlight=NLPLabelingFunction#snorkel.labeling.lf.nlp.NLPLabelingFunction\">NLPLabelingFunction<\/a>\u00a0which preprocesses the text with spacy (on German).\u00a0We configure spacy as a preprocessor of labeling functions and use its German model. For more details about the available options in spacy, see its\u00a0<a href=\"https:\/\/spacy.io\/api\/doc\">documentation<\/a>.<\/p>\n<p>For reusability, we can generalize the keyword lookup. Given the tokenized document, we check if any of the keywords exist in the given text. If a keyword matches, the labeling function defines the text as the given label (in our case\u00a0<em>offensive<\/em>), otherwise as indefinite (<em>abstain<\/em>).<\/p>\n<pre class=\"lang:python decode:true\">def keyword_lookup(x, keywords, label):\r\n\r\n    tokens = ''.join([token.lower_ for token in x.doc])\r\n\r\n    if any(word.lower() in tokens for word in keywords):\r\n\r\n        return label\r\n\r\n    return ABSTAIN<\/pre>\n<p>That&#8217;s it! The first labeling function is able to identify tweets.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Validating-and-Improving-Labeling-Functions\"><\/span>Validating and Improving Labeling Functions<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Given this simple labeling function as a starting point, we want to apply it to the training data and see how it actually performs. Snorkel provides a class <em>PandasLFApplier<\/em>, which makes it easy to apply labeling functions on a pandas DataFrame.<\/p>\n<pre class=\"lang:python decode:true\">from snorkel.labeling import PandasLFApplier\r\n\r\nlfs = [lf_has_politician]\r\n\r\napplier = PandasLFApplier(lfs)\r\n\r\nL_train = applier.apply(df_train)<\/pre>\n<p>In the above code snippet, <em>lfs<\/em> represents a list of labeling functions beeing applied on the training set<em>.<\/em>\u00a0The\u00a0<em>apply-<\/em>method\u00a0returns a labeling matrix which we define as\u00a0<em>L_train.\u00a0<\/em>This is a <em>m<\/em> x <em>n<\/em> matrix, where <em>m<\/em> equals the number of instances in the training set and <em>n<\/em> is the number of labeling functions in <em>lfs<\/em>. The entries indicate the prediction for each instance and labeling function. We can evaluate this matrix with the\u00a0<em>LFAnalytics\u00a0<\/em>which outputs the following matrix.<\/p>\n<pre class=\"lang:default decode:true\">LFAnalysis(L=L_train, lfs = lfs).lf_summary()<\/pre>\n<div class=\"table-1\">\n<table style=\"height: 48px;\" width=\"100%\">\n<thead>\n<tr style=\"height: 24px;\">\n<th style=\"height: 24px; width: 29%;\" align=\"left\"><span style=\"font-size: 12pt;\">\u00a0<\/span><\/th>\n<th style=\"height: 24px; width: 3%;\" align=\"center\"><span style=\"font-size: 12pt;\">P<\/span><\/th>\n<th style=\"height: 24px; width: 7%;\" align=\"center\"><span style=\"font-size: 12pt;\">Coverage<\/span><\/th>\n<th style=\"height: 24px; width: 7%;\" align=\"center\"><span style=\"font-size: 12pt;\">Overlaps<\/span><\/th>\n<th style=\"height: 24px; width: 10%;\" align=\"center\"><span style=\"font-size: 12pt;\">Conflicts<\/span><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 29%;\" align=\"left\">lf_has_politician<\/td>\n<td style=\"height: 24px; width: 3%;\" align=\"center\"><span style=\"font-size: 12pt;\">[1]<\/span><\/td>\n<td style=\"height: 24px; width: 7%;\" align=\"center\">0.098565<\/td>\n<td style=\"height: 24px; width: 7%;\" align=\"center\"><span style=\"font-size: 12pt;\">0.0<\/span><\/td>\n<td style=\"height: 24px; width: 10%;\" align=\"center\"><span style=\"font-size: 12pt;\">0.0<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>We cover around 7% of the training instances and the polarity (P) of our labeling function is towards 1, i.e. towards classifying offensive tweets. The overlaps and conflicts become important having multiple labeling functions.\u00a0In order to show the accuracy of the function, we apply our labeling functions on the development set.<\/p>\n<pre class=\"lang:python decode:true\">L_dev = applier.apply(df_dev)\r\n\r\nLFAnalysis(L=L_dev, lfs = lfs).lf_summary(Y = df_dev.label.values)<\/pre>\n<div class=\"table-1\">\n<table style=\"width: 99.1062%; height: 57px;\" width=\"100%\">\n<thead>\n<tr style=\"height: 48px;\">\n<th style=\"width: 27%; height: 48px; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">\u00a0<\/span><\/th>\n<th style=\"width: 3.03688%; height: 48px; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">P<\/span><\/th>\n<th style=\"width: 8.662%; height: 48px; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">Coverage<\/span><\/th>\n<th style=\"width: 7.75732%; height: 48px; text-align: center;\"><span style=\"font-size: 12pt;\">Correct<\/span><\/th>\n<th style=\"width: 9.01145%; height: 48px; text-align: center;\"><span style=\"font-size: 12pt;\">Incorrect<\/span><\/th>\n<th style=\"width: 17.8833%; height: 48px; text-align: center;\"><span style=\"font-size: 12pt;\">Accuracy<\/span><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"width: 27%; height: 10px; text-align: left;\" align=\"left\">lf_has_politician<\/td>\n<td style=\"width: 3.03688%; height: 10px; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">[1]<\/span><\/td>\n<td style=\"width: 8.662%; height: 10px; text-align: center;\" align=\"left\">0.1<\/td>\n<td style=\"width: 7.75732%; height: 10px; text-align: center;\"><span style=\"font-size: 12pt;\">13<\/span><\/td>\n<td style=\"width: 9.01145%; height: 10px; text-align: center;\">7<\/td>\n<td style=\"width: 17.8833%; height: 10px; text-align: center;\"><span style=\"font-size: 12pt;\">0.65<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>We achieve an empirical accuracy of 65%\u00a0on the dev set which is a good starting point. Be aware that the assumption of a generalization from this small development set to a large-scale corpus is overly simple. Nonetheless, the objective is to develop and tune the labeling function in order to\u00a0<strong>balance the coverage and accuracy<\/strong>, which turned out to be quite difficult in many cases. On the one hand, we require sufficient coverage to label enough data. On the other hand, the accuracy should be satisfactory.<\/p>\n<p>Following this idea, one should define tens or hundreds of labeling functions. In our use-case, we experiment with:<\/p>\n<ul>\n<li>Further <strong>k<\/strong><strong>eyword-based\u00a0<\/strong>functions.<\/li>\n<li>Common German <strong>s<\/strong><strong>wear words<\/strong><\/li>\n<li><strong>Upper Case<\/strong>: Texts containing long uppercase words may indicate the offensive language.<\/li>\n<li><strong>Emoji Polarity<\/strong>: We adapt the word sentiment in a similar manner to emojis.<\/li>\n<li><strong>BERT Sentiment:\u00a0<\/strong>A pre-trained sentiment analysis model on German tweets using huggingface transformers.<\/li>\n<\/ul>\n<p>In total, we implemented nine labeling functions resulting in the following matrix:<\/p>\n<div class=\"table-1\">\n<table style=\"height: 240px; width: 98.3456%;\" width=\"100%\">\n<thead>\n<tr style=\"height: 24px;\">\n<th style=\"height: 24px; width: 50.5014%;\" align=\"left\"><span style=\"font-size: 12pt;\">\u00a0Labeling function<\/span><\/th>\n<th style=\"height: 24px; width: 12.0344%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">P<\/span><\/th>\n<th style=\"height: 24px; width: 12.1777%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">Coverage<\/span><\/th>\n<th style=\"height: 24px; width: 11.533%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">Overlaps<\/span><\/th>\n<th style=\"height: 24px; width: 12.9656%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">Conflicts<\/span><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 50.5014%;\"><span style=\"font-size: 12pt;\">lf_insult<\/span><\/td>\n<td style=\"height: 24px; width: 12.0344%; text-align: center;\"><span style=\"font-size: 12pt;\">[1]<\/span><\/td>\n<td style=\"height: 24px; width: 12.1777%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.169266\u00a0<\/span><\/td>\n<td style=\"height: 24px; width: 11.533%; text-align: center;\"><span style=\"font-size: 12pt;\"> 0.156166<\/span><\/td>\n<td style=\"height: 24px; width: 12.9656%; text-align: center;\"><span style=\"font-size: 12pt;\"> 0.147640<\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 50.5014%;\"><span style=\"font-size: 12pt;\">lf_has_nazi_words<\/span><\/td>\n<td style=\"height: 24px; width: 12.0344%; text-align: center;\"><span style=\"font-size: 12pt;\">[1]<\/span><\/td>\n<td style=\"height: 24px; width: 12.1777%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.039925 <\/span><\/td>\n<td style=\"height: 24px; width: 11.533%; text-align: center;\"><span style=\"font-size: 12pt;\"> 0.037014<\/span><\/td>\n<td style=\"height: 24px; width: 12.9656%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.032439<\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 50.5014%;\">\n<div>\n<div><span style=\"font-size: 12pt;\">lf_has_politician<\/span><\/div>\n<\/div>\n<\/td>\n<td style=\"height: 24px; width: 12.0344%; text-align: center;\"><span style=\"font-size: 12pt;\">[1]<\/span><\/td>\n<td style=\"height: 24px; width: 12.1777%; text-align: center;\"><span style=\"font-size: 12pt;\"> 0.098565<\/span><\/td>\n<td style=\"height: 24px; width: 11.533%; text-align: center;\">0.095654<\/td>\n<td style=\"height: 24px; width: 12.9656%; text-align: center;\"><span style=\"font-size: 12pt;\"> 0.090040<\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 50.5014%;\"><span style=\"font-size: 12pt;\">lf_has_angry_puncts<\/span><\/td>\n<td style=\"height: 24px; width: 12.0344%; text-align: center;\"><span style=\"font-size: 12pt;\">[1]<\/span><\/td>\n<td style=\"height: 24px; width: 12.1777%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.085881<\/span><\/td>\n<td style=\"height: 24px; width: 11.533%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.081306<\/span><\/td>\n<td style=\"height: 24px; width: 12.9656%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.072364<\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 50.5014%;\"><span style=\"font-size: 12pt;\">lf_stopword_ratio<\/span><\/td>\n<td style=\"height: 24px; width: 12.0344%; text-align: center;\"><span style=\"font-size: 12pt;\">[1]<\/span><\/td>\n<td style=\"height: 24px; width: 12.1777%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.045956<\/span><\/td>\n<td style=\"height: 24px; width: 11.533%; text-align: center;\"><span style=\"font-size: 12pt;\"> 0.042628<\/span><\/td>\n<td style=\"height: 24px; width: 12.9656%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.041381<\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 50.5014%;\"><span style=\"font-size: 12pt;\">lf_upper_case<\/span><\/td>\n<td style=\"height: 24px; width: 12.0344%; text-align: center;\"><span style=\"font-size: 12pt;\">[1]<\/span><\/td>\n<td style=\"height: 24px; width: 12.1777%; text-align: center;\"><span style=\"font-size: 12pt;\">0.025577<\/span><\/td>\n<td style=\"height: 24px; width: 11.533%; text-align: center;\"><span style=\"font-size: 12pt;\">0.023913<\/span><\/td>\n<td style=\"height: 24px; width: 12.9656%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.019755\u00a0<\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 50.5014%;\"><span style=\"font-size: 12pt;\">lf_emoji_polarity<\/span><\/td>\n<td style=\"height: 24px; width: 12.0344%; text-align: center;\"><span style=\"font-size: 12pt;\">[1]<\/span><\/td>\n<td style=\"height: 24px; width: 12.1777%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.022250<\/span><\/td>\n<td style=\"height: 24px; width: 11.533%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.020586<\/span><\/td>\n<td style=\"height: 24px; width: 12.9656%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.017675<\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 50.5014%;\"><span style=\"font-size: 12pt;\">lf_bert_sentiment<\/span><\/td>\n<td style=\"height: 24px; width: 12.0344%; text-align: center;\"><span style=\"font-size: 12pt;\">[0, 1]<\/span><\/td>\n<td style=\"height: 24px; width: 12.1777%; text-align: center;\"><span style=\"font-size: 12pt;\">0.913496<\/span><\/td>\n<td style=\"height: 24px; width: 11.533%; text-align: center;\"><span style=\"font-size: 12pt;\">0.366604<\/span><\/td>\n<td style=\"height: 24px; width: 12.9656%; text-align: center;\"><span style=\"font-size: 12pt;\"> 0.324392<\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 50.5014%;\"><span style=\"font-size: 12pt;\">lf_positive_keywords<\/span><\/td>\n<td style=\"height: 24px; width: 12.0344%; text-align: center;\"><span style=\"font-size: 12pt;\">[0]<\/span><\/td>\n<td style=\"height: 24px; width: 12.1777%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.059472<\/span><\/td>\n<td style=\"height: 24px; width: 11.533%; text-align: center;\"><span style=\"font-size: 12pt;\"> 0.056769<\/span><\/td>\n<td style=\"height: 24px; width: 12.9656%; text-align: center;\"><span style=\"font-size: 12pt;\">0.018299<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>In terms of coverage, many labeling functions analyse only small portions of the dataset. However, the labeling functions for sentiment analysis and insult detection cover much larger portions. All implementations and more exploration can be found in the linked Colab notebook at the end of the blogpost. At this point it should be mentioned that this dataset is not suitable for real scenarios, because labeling functions may cause a strong bias. However, in the scope of this blogpost we skip this import analysis and focus on the technical aspects.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Snorkel-Label-Model\"><\/span>Snorkel Label Model<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Up until this point, we defined functions to label our textual data. We could use these functions to create a simple label model ourselves. However,\u00a0Snorkel provides useful tools that ease and speed up this process.<\/p>\n<p>A simple approach is to hold a majority vote of labeling functions for each instance. A majority voter can provide a baseline to show the usefulness of more advanced generative models for labeling. For this reason, Snorkel by default comes with a majority voting approach.<\/p>\n<pre class=\"lang:python decode:true\" title=\"Using a majority voter label model\">from snorkel.labeling import MajorityLabelVoter\r\n\r\nmajority_model = MajorityLabelVoter()\r\n\r\npreds_train = majority_model.predict(L=L_train)\r\n\r\nmajority_acc = majority_model.score(L=L_valid, Y=df_valid.label, tie_break_policy=\"random\")[\"accuracy\"]<\/pre>\n<p>We achieve an empirical accuracy on the validation dataset of 63,5%. Note that we use the validation dataset which has not been considered so far. Further, the majority voter predicts 86 of 340 samples (25,3%) as bad language.<\/p>\n<p>However, using the majority model, we lose information in a stalemate situation or if the majority of the labeling functions abstain for an instance. To resolve this and to improve the performance, the real snorkel-magic comes into play.\u00a0Instead of a majority vote, we use a label model by Snorkel that learns to weight labeling functions based on their correlations. In other words, the label model is the crux of the matter in the snorkel. This <a href=\"https:\/\/arxiv.org\/abs\/1903.05844\">paper<\/a> and\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=RUPbYvzSrg0\">youtube video<\/a> provide detailed information about the Model.\u00a0In practice, the generative label model is completely abstract and only provides a couple of parameters.<\/p>\n<pre class=\"lang:python decode:true\" title=\"Using a snorkel label model\">from snorkel.labeling import LabelModel\r\n\r\nlabel_model = LabelModel(cardinality = 2, verbose = True)\r\n\r\nlabel_model.fit(L_train, n_epochs = 500, log_freq= 50, seed =42)\r\n\r\n<\/pre>\n<p>Using the generative snorkel model on the training dataset, we achieve an accuracy of 65,0% which is slightly less than the majority voter. Regarding the class distribution, the model predicts 113 of the total 340 (33,2%) as offensive. The generative approach of snorkel predicts more offensive tweets while achieving a similar accuracy, even though having only nine labeling functions.<\/p>\n<p>So far, we created labeling functions to classify text as offensive or non-offensive and used a majority voting approach as well as the generative snorkel label model to create labels. In the scope of this blogpost, we snorkel on the water surface and do not dive deeper. Instead, let&#8217;s move on using the labels in a classification task.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Classification\"><\/span>Classification<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The first question is:\u00a0<strong>Why not using the labeling model for the classification<\/strong> <strong>itself? <\/strong>In fact, the trained labeling model is a binary classifier that outputs probabilities for each instance.\u00a0One reason for an additional classifier is that the labeling functions are overfitted for the given training dataset. Hence, the hypothesis is that the snorkel model itself cannot generalize as well as a discriminative classifier trained on the probabilistic labels. Another problem is that we cannot generate labels for each instance when all labeling functions abstain. Thus, we need a model that makes a discriminative decisions p(y|x).<\/p>\n<p>For this reason, we will implement a simple decision tree classifier using scikit-learn to investigate the predictions on the snorkeled labels. The inputs are preprocessed using the tweet tokenizer of nltk. Further, the features consist of unigrams, bigrams, and 3-grams using the tfidif vectorizer of scikit-learn.<\/p>\n<pre class=\"lang:default decode:true\">from nltk.tokenize import TweetTokenizer\r\n\r\nfrom sklearn.feature_extraction.text import TfidfVectorizer\r\n\r\ntokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)\r\n\r\nvectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenize, ngram_range=(1, 3))<\/pre>\n<p>Regarding the imbalance of the snorkeled dataset as described above, a decision tree classifier is chosen over other ml models like linear regression. The decision tree may counteract that models achieve better results on predicting non-offensive texts.<\/p>\n<pre class=\"lang:default decode:true\">from sklearn.tree import DecisionTreeClassifier\r\n\r\nclassifier = DecisionTreeClassifier(random_state=0)<\/pre>\n<h3><span class=\"ez-toc-section\" id=\"Predictions-on-Snorkeled-Labels\"><\/span>Predictions on Snorkeled Labels<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Having this setup, the\u00a0snorkeled<strong>\u00a0<\/strong>dataset is fitted against the decision tree classifier.\u00a0However, in the binary classification model of scikit-learn, we cannot use probabilistic labels but have to map our probabilistic labels to 0 or 1, respectively. In a more advanced setup, one would define the problem as multi-classification and use the probabilistic labels directly. For the sake of simplicity, we accept the loss of information at this point.<\/p>\n<pre class=\"lang:default decode:true\">snorkel_label_probs = label_model.predict_proba(L=L_train)\r\n\r\nX, y = filter_unlabeled_dataframe(X=df_train.text, y=snorkel_label_probs, L=L_train)\r\n\r\nX = vectorizer.fit_transform(X)\r\n\r\nclassifier.fit(X, probs_to_preds(probs=y))<\/pre>\n<p>Thereupon, we evaluate the trained classifier on the test dataset, which has neither been used for labeling nor for the training of the classifier. We transform the test sample, predict the probabilites, and get the classification report of the test instances.<\/p>\n<pre class=\"lang:default decode:true\">X = vect.transform(df_test.text.tolist())\r\n\r\ny_true = df_test.label.values\r\n\r\ny_pred = clf.predict(X)\r\n\r\nfrom sklearn.metrics import classification_report\r\n\r\nclassification_report(y_true, y_pred, labels=[0,1])<\/pre>\n<div class=\"table-1\">\n<table style=\"height: 96px; width: 70.1034%;\" width=\"70.1034%\">\n<thead>\n<tr style=\"height: 24px;\">\n<th style=\"height: 24px; width: 40.4668%;\" align=\"left\"><span style=\"font-size: 12pt;\">\u00a0Class<\/span><\/th>\n<th style=\"height: 24px; width: 15.3184%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">Recall<\/span><\/th>\n<th style=\"height: 24px; width: 15.6074%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">Precision<\/span><\/th>\n<th style=\"height: 24px; width: 23.2483%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">F1<\/span><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 40.4668%;\" align=\"left\"><span style=\"font-size: 12pt;\">No-Offense<\/span><\/td>\n<td style=\"height: 24px; width: 15.3184%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">0.71<\/span><\/td>\n<td style=\"height: 24px; width: 15.6074%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">0.84<\/span><\/td>\n<td style=\"height: 24px; width: 23.2483%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">0.77<\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 40.4668%;\"><span style=\"font-size: 12pt;\">Offense<\/span><\/td>\n<td style=\"height: 24px; width: 15.3184%; text-align: center;\"><span style=\"font-size: 12pt;\">0.51<\/span><\/td>\n<td style=\"height: 24px; width: 15.6074%; text-align: center;\"><span style=\"font-size: 12pt;\">0.33<\/span><\/td>\n<td style=\"height: 24px; width: 23.2483%; text-align: center;\"><span style=\"font-size: 12pt;\">0.40<\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 40.4668%;\"><span style=\"font-size: 12pt;\">Weighted Avg.<\/span><\/td>\n<td style=\"height: 24px; width: 15.3184%; text-align: center;\"><span style=\"font-size: 12pt;\">0.64<\/span><\/td>\n<td style=\"height: 24px; width: 15.6074%; text-align: center;\"><span style=\"font-size: 12pt;\">0.67<\/span><\/td>\n<td style=\"height: 24px; width: 23.2483%; text-align: center;\"><span style=\"font-size: 12pt;\">\u00a00.65<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>The decision tree classifier achieves a weighted f1 score of 0.64. Further, the results show that offensive language is still difficult to detect. One suggestion is that the labeling functions do not generalize as well as expected or that important features miss. On top of this, the decision tree classifier is neither thoroughly investigated nor optimized. However, instead of pushing the above results, we keep the model and data labels as they are and continue by comparing the results against other approaches to show if snorkel benefits at all.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Comparison-to-Other-Models\"><\/span>Comparison to Other Models<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Is the overhead of hand-crafting labeling functions worth it? In order to answer this question, the following table compares the results of the snorkeled dataset against the results of the decision tree classifier on dataset created by the majority voting approach. Further, the baseline is a decision tree trained on the pure validation dataset (340 samples) which represents the case without a data programming approach at all.<\/p>\n<div class=\"table-1\">\n<table style=\"height: 96px; width: 100%;\" width=\"100%\">\n<thead>\n<tr style=\"height: 24px;\">\n<th style=\"height: 24px; width: 32.0326%;\" align=\"left\"><span style=\"font-size: 12pt;\">\u00a0Model<\/span><\/th>\n<th style=\"height: 24px; width: 22.6236%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">No Offense &#8211; F1<\/span><\/th>\n<th style=\"height: 24px; width: 21.0394%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">Offense &#8211; F1<\/span><\/th>\n<th style=\"height: 24px; width: 25.3438%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\">F1 &#8211; Weighted<\/span><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 32.0326%;\" align=\"left\"><span style=\"font-size: 12pt;\">Snorkel Label Dataset<\/span><\/td>\n<td style=\"height: 24px; width: 22.6236%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\"><strong>0.77<\/strong><\/span><\/td>\n<td style=\"height: 24px; width: 21.0394%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\"><strong>0.40<\/strong><\/span><\/td>\n<td style=\"height: 24px; width: 25.3438%; text-align: center;\" align=\"left\"><span style=\"font-size: 12pt;\"><strong>0.65<\/strong><\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 32.0326%;\"><span style=\"font-size: 12pt;\">Majority Voting Dataset<\/span><\/td>\n<td style=\"height: 24px; width: 22.6236%; text-align: center;\"><span style=\"font-size: 12pt;\">0.74<\/span><\/td>\n<td style=\"height: 24px; width: 21.0394%; text-align: center;\"><span style=\"font-size: 12pt;\">0.36<\/span><\/td>\n<td style=\"height: 24px; width: 25.3438%; text-align: center;\"><span style=\"font-size: 12pt;\">0.61<\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"height: 24px; width: 32.0326%;\"><span style=\"font-size: 12pt;\">Validation Set (Baseline)<\/span><\/td>\n<td style=\"height: 24px; width: 22.6236%; text-align: center;\"><span style=\"font-size: 12pt;\">0.71<\/span><\/td>\n<td style=\"height: 24px; width: 21.0394%; text-align: center;\"><span style=\"font-size: 12pt;\">0.37<\/span><\/td>\n<td style=\"height: 24px; width: 25.3438%; text-align: center;\"><span style=\"font-size: 12pt;\">0.59<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>Even though the overall results are still not sufficient, the results suggest that labeling models (Snorkel and majority voting) outperform the baseline. Further, the Snorkel model has slight improvements over the majority voter.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Snorkel is a neat and easy-handy framework to label unlabeled data. Without any doubt, however, one requires domain-knowledge in order to identify offensive language patterns. In general, the offensive language task has a large variety of different types of offense (abuses, insults, profanities, &#8230;) which makes it difficult to create labeling functions having high coverage and still good accuracy. Hence, the time of crafting, analyzing, and improving labeling functions requires tens or hundreds of thousands of data instances to pay back the effort.<\/p>\n<p>The benefit of Snorkel to solve the problem of unlabeled datasets is still vague. As of today, less complex problems perform quite well on ML approaches without requiring much data. On the other hand, complex tasks like the detection of offensive language make data programming time-consuming and expensive progress. Nonetheless, Snorkel proves its improvements and may be of practical usage on much larger datasets than used in this blogpost.<\/p>\n<p><em>The Google colab notebook can be found <a href=\"https:\/\/drive.google.com\/file\/d\/177MU2-v74QU3phSWdojDSuyUApf9ZLYP\/view?usp=sharing\">here<\/a>. Many thanks to Maximilian Blanck and Sebastian Blank for their input.<\/em><\/p>\n<p style=\"font-size: 12pt; padding: 24px 32px; border-top: 1px solid #CCC; border-bottom: 1px solid #CCC;\">The classification of hate speech and offense is a very sensitive topic. On the one hand, platforms should inhibit the spread of hate speech and with current legislation, they are also required to do so. Due to the amount of user-generated content, AI can significantly help to do that. On the other hand, the line between suppressing such content to censorship is thin. Therefore, such a system, when employed in production should always be used in combination with human judgments, the training data must always be checked for biases and the model for edge cases.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>How do we proceed if we have almost no labeled data for a machine learning model? One answer may be: combining all the knowledge we have (labeled data, distant supervision, subject matter experts) in one framework to get to the best of each world. A trending framework to apply this data programming pattern is\u00a0Snorkel. In [&hellip;]<\/p>\n","protected":false},"author":59,"featured_media":19389,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[509,206,141],"service":[76,431,75],"coauthors":[{"id":59,"display_name":"Pascal Fecht","user_nicename":"pfecht"}],"class_list":["post-17169","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-ai-2","tag-data-science","tag-nlp","service-artificial-intelligence","service-data-science","service-nlp"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Dive into Snorkel: Weak-Supervision on German Texts - inovex GmbH<\/title>\n<meta name=\"description\" content=\"How do we proceed if we have almost no labeled data for a machine learning model? One answer may be: combining all the knowledge we have in one framework to get to the best of each world.\u00a0This blogpost investigates the trending data programming framework Snorkel for the task of detecting bad language on German texts.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Dive into Snorkel: Weak-Supervision on German Texts - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"How do we proceed if we have almost no labeled data for a machine learning model? One answer may be: combining all the knowledge we have in one framework to get to the best of each world.\u00a0This blogpost investigates the trending data programming framework Snorkel for the task of detecting bad language on German texts.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2020-07-30T11:26:11+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-08-21T05:09:31+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/snorkel.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Pascal Fecht\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/snorkel-1024x576.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Pascal Fecht\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"12\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Pascal Fecht\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/snorkel-weak-superversion-german-texts\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/snorkel-weak-superversion-german-texts\\\/\"},\"author\":{\"name\":\"Pascal Fecht\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/8d749cbeb5212517fd4b27c45f5323f5\"},\"headline\":\"Dive into Snorkel: Weak-Supervision on German Texts\",\"datePublished\":\"2020-07-30T11:26:11+00:00\",\"dateModified\":\"2025-08-21T05:09:31+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/snorkel-weak-superversion-german-texts\\\/\"},\"wordCount\":2355,\"commentCount\":2,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/snorkel-weak-superversion-german-texts\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/snorkel.png\",\"keywords\":[\"Ai\",\"Data Science\",\"nlp\"],\"articleSection\":[\"Analytics\",\"English Content\",\"General\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/snorkel-weak-superversion-german-texts\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/snorkel-weak-superversion-german-texts\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/snorkel-weak-superversion-german-texts\\\/\",\"name\":\"Dive into Snorkel: Weak-Supervision on German Texts - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/snorkel-weak-superversion-german-texts\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/snorkel-weak-superversion-german-texts\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/snorkel.png\",\"datePublished\":\"2020-07-30T11:26:11+00:00\",\"dateModified\":\"2025-08-21T05:09:31+00:00\",\"description\":\"How do we proceed if we have almost no labeled data for a machine learning model? One answer may be: combining all the knowledge we have in one framework to get to the best of each world.\u00a0This blogpost investigates the trending data programming framework Snorkel for the task of detecting bad language on German texts.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/snorkel-weak-superversion-german-texts\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/snorkel-weak-superversion-german-texts\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/snorkel-weak-superversion-german-texts\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/snorkel.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/snorkel.png\",\"width\":1920,\"height\":1080,\"caption\":\"The Snorkel Logo\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/snorkel-weak-superversion-german-texts\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Dive into Snorkel: Weak-Supervision on German Texts\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/8d749cbeb5212517fd4b27c45f5323f5\",\"name\":\"Pascal Fecht\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/74fdacbda8dec937236631d015771512e6e5d8595fef0200e4533a606147d3d6?s=96&d=retro&r=gd60589c65d60352f316d07e8c6ba0240\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/74fdacbda8dec937236631d015771512e6e5d8595fef0200e4533a606147d3d6?s=96&d=retro&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/74fdacbda8dec937236631d015771512e6e5d8595fef0200e4533a606147d3d6?s=96&d=retro&r=g\",\"caption\":\"Pascal Fecht\"},\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/pfecht\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Dive into Snorkel: Weak-Supervision on German Texts - inovex GmbH","description":"How do we proceed if we have almost no labeled data for a machine learning model? One answer may be: combining all the knowledge we have in one framework to get to the best of each world.\u00a0This blogpost investigates the trending data programming framework Snorkel for the task of detecting bad language on German texts.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/","og_locale":"de_DE","og_type":"article","og_title":"Dive into Snorkel: Weak-Supervision on German Texts - inovex GmbH","og_description":"How do we proceed if we have almost no labeled data for a machine learning model? One answer may be: combining all the knowledge we have in one framework to get to the best of each world.\u00a0This blogpost investigates the trending data programming framework Snorkel for the task of detecting bad language on German texts.","og_url":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2020-07-30T11:26:11+00:00","article_modified_time":"2025-08-21T05:09:31+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/snorkel.png","type":"image\/png"}],"author":"Pascal Fecht","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/snorkel-1024x576.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Pascal Fecht","Gesch\u00e4tzte Lesezeit":"12\u00a0Minuten","Written by":"Pascal Fecht"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/"},"author":{"name":"Pascal Fecht","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/8d749cbeb5212517fd4b27c45f5323f5"},"headline":"Dive into Snorkel: Weak-Supervision on German Texts","datePublished":"2020-07-30T11:26:11+00:00","dateModified":"2025-08-21T05:09:31+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/"},"wordCount":2355,"commentCount":2,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/snorkel.png","keywords":["Ai","Data Science","nlp"],"articleSection":["Analytics","English Content","General"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/","url":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/","name":"Dive into Snorkel: Weak-Supervision on German Texts - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/snorkel.png","datePublished":"2020-07-30T11:26:11+00:00","dateModified":"2025-08-21T05:09:31+00:00","description":"How do we proceed if we have almost no labeled data for a machine learning model? One answer may be: combining all the knowledge we have in one framework to get to the best of each world.\u00a0This blogpost investigates the trending data programming framework Snorkel for the task of detecting bad language on German texts.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/snorkel.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/snorkel.png","width":1920,"height":1080,"caption":"The Snorkel Logo"},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/snorkel-weak-superversion-german-texts\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"Dive into Snorkel: Weak-Supervision on German Texts"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/8d749cbeb5212517fd4b27c45f5323f5","name":"Pascal Fecht","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/secure.gravatar.com\/avatar\/74fdacbda8dec937236631d015771512e6e5d8595fef0200e4533a606147d3d6?s=96&d=retro&r=gd60589c65d60352f316d07e8c6ba0240","url":"https:\/\/secure.gravatar.com\/avatar\/74fdacbda8dec937236631d015771512e6e5d8595fef0200e4533a606147d3d6?s=96&d=retro&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/74fdacbda8dec937236631d015771512e6e5d8595fef0200e4533a606147d3d6?s=96&d=retro&r=g","caption":"Pascal Fecht"},"url":"https:\/\/www.inovex.de\/de\/blog\/author\/pfecht\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/17169","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/59"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=17169"}],"version-history":[{"count":4,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/17169\/revisions"}],"predecessor-version":[{"id":63288,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/17169\/revisions\/63288"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/19389"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=17169"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=17169"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=17169"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=17169"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}