Entity Matching by Contrastive Learning

In this blog post, we examine whether entity matching’s blocking can be used for the sample selection of contrastive learning on text data.

To identify the same entity in different datasets, as it occurs in data integration, for example, the concept of entity matching is used. Simplified, entity matching uses binary classification to determine whether two samples are the same entity (match) or not (non-match).

Contrastive learning is a deep-learning technique to enhance the representation of a minority class. As the name suggests, a differentiation is made between samples. This requires one or more similar samples (positives) as well as opposite samples (negatives). A common approach is to use random negatives for contrastive learning. In this regard, related work shows optimization potential through the careful selection of negatives. Accordingly, suitable negatives have the property of being similar but not a match to a sample. This similarity search problem also exists in the blocking step in an entity matching pipeline.

In this blog post, we examine whether entity matching’s blocking can be used for the sample selection of contrastive learning on text data. Additionally, the impact on learning domain-specific representations is investigated. We introduce a novel method for combining entity matching and contrastive learning in a mutually beneficial way: On the one hand, entity matching is characterised by an unbalanced ratio between its two classes. Thus, contrastive learning seems a viable means to address entity matching’s unbalanced classification problem. On the other hand entity matching’s blocking is used for the sample selection of contrastive learning.

Let’s start with an introduction to entity matching, entity matching’s blocking, contrastive learning, and how they relate to each other. This is followed by the description of the approach used in the context of my master thesis and the evaluation results.

Entity Matching

Entity matching is the identification of the same entity in different datasets through classification. For example, if sample a from dataset A and sample b from dataset B describe the same entity, they are considered matches, otherwise, they are considered as non-matches. Entity matching is based on an unbalanced classification problem, as the number of matches is significantly smaller than the number of non-matches.

This article focuses on text data that has at least a product description and a name. Entity matching is used in online retailing, especially in comparison portals, which show the user an overview of prices, product names, and descriptions of various retailers and marketplaces for the same product. Depending on the seller’s website, the associated product descriptions may be available in an inconsistent data format, e.g., unstructured in text form or structured as a table. Challenges are incompleteness, redundancy, and incorrect information. Furthermore, text data includes different spellings, misspellings, abbreviations, and the ambiguity of natural language.

To exemplify further, the following table shows three product names and descriptions. Line 1 and 2 form a match, and line 3 matches neither line 1 nor line 2. For human readers, it is obvious that the distinction can be made using the headphone generation in the product name (2nd Gen., 2nd Generation). By looking at the product description this is no longer quite so simple, as there is no direct matching information in lines 1 and 2, for example.

Line	Product Name	Description
1	SOUNDWAVE headphones 2nd Gen. with charging case	Experience music in a whole new way with the new in-ear headphones from SOUNDWAVE. The wireless headphones support attachments in 5 different sizes.
2	SOUNDWAVE headphones with charging case 2nd Generation (123N6D)	- In-ear headphones - 5 hours battery life
3	SOUNDWAVE headphones with charging case 3nd Generation (123N7X)	- 6 hours battery life - Optional wired

Table 1: examples of matches and non-matches in product data.

Deep Learning based Entity Matching

Classic entity matching approaches originate from the use case of data integration and are rule-based, among other things, but they find their limitations in unstructured and heterogeneous data, see (Barlaug & Gulla, 2021) and (Christen, 2012). Deep-learning-based entity matching, on the other hand, focuses on heterogeneous schemas and text.

A detailed introduction to types of data (structured, unstructured, semi-structured) and entity matching in the context of Big Data and is given in this article: Similarity Search and Deduplication at Scale.

Entity matching with deep neural networks differs from the classical approach. A deep learning-based pipeline consists of a trainable encoder and trainable classification component, as shown in figure 1. The figure shows a reference model for deep learning-based entity matching (Barlaug & Gulla, 2021).

Figure 1: reference model for deep learning based entity matching

With methods like DeepMatcher, Seq2SeqMatcher and HierMatcher there are a variety of machine learning based frameworks. A detailed comparison of the architectures can be found in the survey of (Barlaug & Gulla, 2021). Worth mentioning is the entity matching system Ditto, which is one of the first approaches based on Transformer encoders, e.g., BERT and RoBERTa. Therefore, Ditto is able to process homogeneous as well as heterogeneous data.

Blocking for Entity Matching

Blocking is a step in the inference of entity matching and is motivated by two problems of naive entity matching. First, entity matching is an unbalanced classification problem, as the amount of matches is low. Second, the number of classifications required for two datasets A and B corresponds to their Cartesian product \(A \times B\).

The objective of blocking is to identify a smaller set of match candidates from the set of all possible matches, to which the classification step is then applied. In order to select these candidates, blocks of similar samples are created that meet a filter criterion. Then, only candidates of the same block are classified. The candidate set should ideally contain all true positives in order to minimise the number of classifications.

Character-based blocking is one category of blocking, that filters by equality of words or parts of words, e.g., Standard Blocking, N-gram Blocking, or Canopy Clustering Blocking. An overview can be found in the work of (Christen, 2012). However, semantics and context are not taken into account by testing for equality.

In order to process the semantics and context of text data by machine learning, words, sentences, or paragraphs are transformed into a vector. By using these embeddings, similarity or distance metrics such as Euclidean distance can be used for filtering. Due to the high dimensionality of embeddings, approaches based on the approximation of nearest neighbours are mostly used.

Approximate Nearest Neighbour-based Blocking includes the use of hashing-based approaches, such as locality-sensitive hashing (LSH). (Azzalini et al., 2020) compare hashing-based approaches combined with dimension reduction techniques (t-SNE, PCA) and clustering (K-MEANS, BIRCH). Clustering by Product Quantization is also viable.

A complete entity matching pipeline is only provided by DeepER, which uses LSH for blocking. Other frameworks are limited to a classifier approach. Focused on a machine learning-based blocking approach, Autoblock uses a bidirectional LSTM to optimise LSH. Further literature on deep learning for blocking, can be found in the work of (Thirumuruganathan et al., 2021).

Contrastive Learning

Contrastive learning is a self-supervised training technique for machine learning to map similarities or differences between input data as a property in their embeddings. Thus, the similarity of samples is modeled as a short distance between their embeddings. This can help to improve the embedding of minority classes, whose underlying problem also occurs in entity matching.

To train a contrastive encoder in a self-supervised fashion, the following is needed. Figure 2 illustrates the corresponding simplified model:

positives samples (green) – also called positives
negative samples (red) – also called negatives
a contrastive loss function (grey)

Figure 2: core components of contrastive learning

Self-supervised means that required labels do not require manual labeling. Positives are samples similar to a reference sample. For each reference sample, one or more such positives are generated. What similarity means depends on the input type. For images, corresponding samples are generated, e.g., by rotating or changing the colour and brightness values. For text data, this can be implemented by swapping letters or words, among other things. In the work of (Li et al., 2022) different methods for text augmentation are contrasted.

Negatives are used for differentiation. A common approach is to use samples as negatives that are not explicitly labeled as positives. Usually, these are selected in batch, meaning, all the remaining non-match samples are selected in a mini-batch. The advantage of this approach is simplicity, because the negatives are random. It is shown for SimCLR und Momentum Contrast that this approach requires a large number of negatives or very large batch sizes.

To complete the training of a contrastive encoder, negative and positive are considered in a contrastive loss function. Common contrastive loss functions are shown in table 2.

Loss Name	Number of Negatives per Reference Sample	Number of Positives per Reference Sample	Comment
Contrastive loss	- Either 1 positive or 1 negative	- Either 1 positive or 1 negative	- Self-supervised - Only negatives are used which are located within a margin - Based on Euclidean distance
Triplet loss	1	1	- Self-supervised - Based on Euclidean distance
N-Pair loss	N-1 for batch size N	1	- Self-supervised
SupCon	- At least 1, - If the current training batch contains further matches, these will also be taken into account.	- Max. N-1 positives for batch size N. - Additionally, all true positive matches in the batch are excluded	- Supervised

Table 2: overview of common contrastive loss functions.

Contrastive Entity Matching

Currently, RoBERTa-SupCon is the only concept that combines contrastive learning with entity matching. A special feature of the work is the source-aware sampling strategy to avoid falsification by unlabelled matches. For each dataset to be compared, a special training dataset is created that also contains only explicitly known matches. One training dataset is randomly selected per batch.

This also creates cluster IDs per match, so contrastive learning can be used in a supervised setting. The loss function is SupCon, which means that negatives are selected in batch. A positive is created by augmentation. Additionally, all matches occurring in the batch are considered additional positives.

Synergy of Entity Matching and Contrastive Learning

Related work in the area of textual contrastive learning described below, shows that there is potential for optimisation of negative samples through selection criteria: Accordingly, suitable negative samples have the property of being similar to a reference sample but being a non-match at the same time. We recognise this, as the described nearest neighbour problem as part of the blocking of an entity matching pipeline:

(Gillick et al., 2019) propose a two-stage contrastive training to express similarity between a tuple (text, entity). In the first stage, random negatives are selected from the training mini-batches. Based on this, the ten nearest neighbour entities are calculated for each training text using hashing. These are used as negatives in the second training iteration.

To answer open questions, (Karpukhin et al., 2020) train an encoder using an in-batch selection of k negatives. Random negatives, Okapi BM25, and positives from other reference samples were tested as the selection strategy. It turns out that more than 20 selected non-random negatives no longer have any effect on the encoder. This contrasts with the observation that adding a single similar negative to random in-batch negatives, found using Okapi BM25, greatly improves the contrastive encoder.

In their paper, Xiong et al. (2020) present a concept for contrastive document and passage retrieval using selected negatives by nearest neighbour approximation. The authors invoke proof from Importance Sampling, the application of which here means that suitable negative documents have a higher training loss and thus accelerate convergence. The nearest neighbour problem is implemented by cyclically updating the blocking index over the entire document corpus, which returns a random sample of the 200 most similar documents for each sample of a mini-batch.

For sentiment analysis on product ratings, (Robinson et al., 2020), show that better negatives reduce the requirement for more training data in contrastive learning. The authors are able to confirm that increasing negatives does not necessarily lead to better training results. The suitability of negatives is based on a minimum distance to the corresponding reference sample without being a match.

We note the following properties of negatives:

Negative samples for contrastive learning are non-matches.
Good negatives are similar to their sample

As we examine the mutual influence of both concepts, contrastive learning is mapped into the flow of entity matching in order to design a contrastive encoder based on the properties above, as shown in figure 3. The initial encoder is an instance of RoBERTa. Positives are generated self-supervised by data augmentation using noise and switching words. As motivated by related work before, to select suitable negatives by nearest neighbour approximation, blocking is transferred into the training loop for contrastive learning. The index is generated from the embeddings of the contrastive encoder at a given time.

Figure 3: training loop of the contrastive entity matching concept presented in this article.

The blocking index was implemented in the form of an inverted file index (IVF index) of the faiss Python library. An IVF index combines \(K\)-Means clustering with a complete search. To select the negatives, first, the most similar cluster centroids are determined, and then the most similar samples are identified in these clusters. Negative candidates are sorted by minimum Euclidean distance. Of these negative candidates \(k\) samples are used. It is to be noted, that the number of negatives is not determined via the batch size, as is the case with other loss functions, but via a parameter \(k > 0\).

To control the degree of similarity, the distance parameter \(m >= 0\) is introduced, which determines the offset in the nearest neighbour order: The \(\tilde{k}\) samples are returned that occupy the \(\tilde{k}\)-first positions in the rank order after subtracting the distance \(m\). For \(m = 0\) for example, all direct \(k\)-nearest neighbours are used, i.e. \(k = \tilde{k}\). For \(m = 10\) and \(k = 25\), the \(\tilde{k}\)-nearest neighbours with ranking [11, 35] are used.

It should be noted that there is a two-way dependency between the blocking index and the contrastive encoder. The quality of the blocking index influences the contrastive encoder and the quality of the contrastive encoder influences the quality of the embeddings on which the next iteration of the encoder is based. To evaluate this, an update criterion for the blocking index is introduced.

To sum up, three parameters were introduced:

Number of negatives \(k\)
Similarity distance \(m\)
Update criterion \(a\) for the blocking index

Finally, this approach introduces Filtered SupCon (FSupCon) as a loss function. This is based on SupCon loss and takes into account the adjusted negative selection:

\(L = \sum_{i \in I}L_{i} =\sum_{i \in I}\frac{-1}{|P(i)|} \sum_{p\in P(i)} log \frac{exp(z_{i}\cdot z_{p}/\tau)}{\sum_{n_{\tilde k} \in N_{\tilde k}(i)}exp(z_{i}\cdot z_{n_{\tilde k}}/\tau)}\)

Let \(I \equiv \{1…N\}\) be an index over a mini-batch. \(i \in I\) is the index of any reference sample. \(x_{i}\) und \(A(i)\) refer to the index of the train dataset without \(i\).

The index of potential negatives is \(N(i) := \{n \in A(i)| y_{i} \neq\ y_{n} \vee y_{i} = y_{single} \} \). For \(N(i)\), it holds that the cluster-ID \(y_{n} \) of the associated samples is different from the cluster-ID \(y_{i}\) of \(x_{i}\). In addition, single entities with \(y_{single}\) are included, as these are non-matches to any other sample.

The similarity is determined by a similarity metric or a distance measure sim. In this work, Euclidean distance was used.

After training the contrast encoder is completed, a binary classifier is trained, which uses the contrastive encoder for embedding generation. The resulting prototype can be considered as an extension to RoBERTa-SupCon, as the preprocessing with source-aware sampling as well as the classifier architecture are retained.

Results

The approach was evaluated on the abt-buy dataset, which contains product descriptions and names of various electronic products. The dataset split of Primpeli & Bizer, 2020 was used, which ensures 15 % matches in each of the training, validation, and test datasets. The contrastive encoder was evaluated based on the convergence of the loss and the classifier was evaluated using the F1 score metric.

Before we go into detail about the observations on the individual parameters and their effects, it may be pointed out that tables 3 and 4 show extracts of the tested settings.

k	a	Test F1 Score in %	Test Recall in %	Test Precision in %
2	1	49.25	81.93	35.21
2	5	63.03	82.74	50.91
4	1	72.76	87.56	62.24
4	5	74.46	91.26	62.88
8	1	64.24	83.49	52.21
8	5	72.33	87.73	61.11
16	1	20.48	68.02	19.64
16	5	53.71	81.52	40.05

Table 3: F1 score using different numbers of negatives (k) and update criterion (a)

Number of Negatives

The evaluation of the prototype on the product dataset abt-buy shows that the use of more negatives does not increase the F1 score, as shown in Table 3. The best classifier is obtained with four negative samples.

In the upper section it was mentioned that for a more complex problem with whole documents instead of text paragraphs, more than 20 selected negatives have no improvement in the contrastive training. In the context of this thesis, the assumption is that for a simpler problem, fewer selected negatives are sufficient.

Update Criterion

Table 3 shows the F1 scores for different update rates of the blocking encoder, using \(a\) = {1,5}. It was observed that when the blocking index is updated every epoch (\(a\) = 1), overfitting occurs in all cases within the first ten epochs. If updating is too infrequent (\(a\) = 10), the encoder converges to a local minimum, due to too little correction of the blocking index. A tradeoff should be made between updating the blocking index too frequently and too rarely to correct convergence at regular intervals. For the abt-buy dataset, every 5 epochs proved to be a good argument.

Similarity Distance

Table 4 shows different arguments for the similarity distance. The argument \(m\) = -1 means that the least similar negative candidates are used.

An offset in nearest neighbour ranking from 100 to 200 for a total number of 1725 negative candidates produced the highest F1 score of all experiments. So, the best classifier uses a contrastive encoder, which uses 4 negative samples, a blocking index update rate every 5 epochs, and a similarity distance of 100.

This is a similarity distance of 9.8 and 11.6 % of the test dataset size, respectively. The experiments also showed that using this similarity distance to most unique negative samples, as measured by cluster IDs.

Thus, the assumption about the similarity of suitable negatives is only conditionally true.

k	m	a	Test F1 Score	Test Recall	Test Precision
4	0	5	74.46	91.26	62.88
4	10	5	70.88	88.18	59.26
4	100	5	78.72	93.42	68.01
4	200	5	78.18	94.73	67.51
4	-1	5	75.30	92.67	63.42

Table 4: F1 score for different similarity distance (m) arguments.

Comparison with the state of research

The baseline RoBERTa-SupCon achieves an F1 score of 93.70 % with a batch size of 1024. Without source-aware sampling, 34.24 % F1 score is reached. If this is replicated with a batch size of 4, in order to be able to compare the best results of this work, RoBERTa-SupCon’s F1 score is 73.58 %. In this setting, this work exceeded the replicated RoBERTa-SupCon F1 score by 5.14 percentage points.

Additionally, for this approach, less than 32 epochs of training are sufficient in all settings, whereas RoBERTa-SupCon requires 200 epochs of training.

Compared to Ditto and DeepER + t-SNE, which do not use contrastive learning, the F1 score is between 10.61 and 12.26 percentage points lower and the number of training epochs is similar.

The metric recall is higher than the metric precision in all evaluated cases, i.e. more False Positives are generated than False Negatives. Nevertheless, the comparison of Precision with Roberta-SupCon and batch size 4 shows an increase of 16.64 % in favour of our approach. It can be assumed that the concept generates a stronger delineation effect for potential False Positives.

In addition, it should be noted, that in contrast to RoBERTa-SupCon an unfrozen encoder works better. RoBERTa-SupCon proposes to freeze the weights of the contrastive encoder while training the classifier.

Conclusion

Finally, the following observations can be made:

Blocking methods originating from entity matching generate a flexible contrastive encoder in combination with contrastive learning.
The encoder built in this work converges faster in direct comparison with RoBERTa-SupCon.
Only a few negative samples and a short training time are sufficient to achieve over 78 % F1 score for entity matching on the abt-buy dataset.
The initial hypothesis of the mutual benefit of contrastive learning and entity matching is confirmed by the presented methodology.
The sensitive and at the same time decisive parameter, which accommodates the unbalanced, binary classification, is the similarity distance.

Name	Borlabs Cookie
Anbieter	Eigentümer dieser Website
Zweck	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Jahr

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Cookie von Google für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Datenschutzerklärung	https://policies.google.com/privacy?hl=de
Cookie Name	_ga,_gat,_gid
Cookie Laufzeit	2 Jahre

Akzeptieren
Name	Hotjar
Anbieter	Hotjar Ltd.
Zweck	Hotjar ist ein Analysewerkzeug für das Benutzerverhalten von Hotjar Ltd. Wir verwenden Hotjar, um zu verstehen, wie Benutzer mit unserer Website interagieren.
Datenschutzerklärung	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Laufzeit	Sitzung / 1 Jahr

Akzeptieren
Name	HubSpot
Anbieter	HubSpot Inc.
Zweck	HubSpot ist ein Verwaltungsdienst für Benutzerdatenbanken bereitgestellt von HubSpot, Inc. Wir nutzen HubSpot auf dieser Website für unsere Online Marketing-Aktivitäten.
Datenschutzerklärung	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Laufzeit	Sitzung / 30 Minuten / 1 Tag / 1 Jahr / 13 Monate

Akzeptieren
Name	OpenStreetMap
Anbieter	OpenStreetMap Foundation
Zweck	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Datenschutzerklärung	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Laufzeit	1-10 Jahre

Entity Matching by Contrastive Learning