{"id":39092,"date":"2022-12-23T10:28:05","date_gmt":"2022-12-23T09:28:05","guid":{"rendered":"https:\/\/www.inovex.de\/?p=39092"},"modified":"2026-02-18T08:11:45","modified_gmt":"2026-02-18T07:11:45","slug":"similarity-search-deduplication","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/","title":{"rendered":"Similarity Search and Deduplication at Scale"},"content":{"rendered":"<p>With the increase in the application of large-scale data collection, analytics, and accompanying big data platforms over the past decade, the necessity for reliable entity-matching solutions is proportionately growing. The newly available data volumes need to be integrated, processed, and made usable before further value can be generated.<!--more--><\/p>\n<p>The fundamental problem of entity matching (also known as<a href=\"https:\/\/en.wikipedia.org\/wiki\/Record_linkage\" target=\"_blank\" rel=\"noopener\"> record linkage<\/a>, data linkage, reference reconciliation, data matching, and<a href=\"https:\/\/dl.acm.org\/doi\/10.14778\/2367502.2367564\" target=\"_blank\" rel=\"noopener\">\u00a0entity resolution<\/a> just to name a few) lies at the core of many commercial and enterprise applications such as:<\/p>\n<ul>\n<li>Deduplication<\/li>\n<li>Similarity search<\/li>\n<li>Data integration (merging of different data sources)<\/li>\n<\/ul>\n<p>which cover many different domains of applications like:<\/p>\n<ul>\n<li>Retail and logistics<\/li>\n<li>E-commerce<\/li>\n<li>Medical and healthcare<\/li>\n<li>Advertising<\/li>\n<li>Knowledge management<\/li>\n<\/ul>\n<p>Given multiple collections of entity representations like tables, files, or text, the entity matching problem is defined as identifying all sets of entity representations that reference the same entity. Most studies in the domain of entity matching work under the assumption that the collections of entity entries on which the algorithm is applied have homogeneous structures and similar schemas. It is also often assumed that the matching identifies only the uniquely identifiable representations, ignoring the partial matches and similarity ranks. Thus, a broader problem definition was introduced under the name of<a href=\"https:\/\/arxiv.org\/abs\/2106.08455\" target=\"_blank\" rel=\"noopener\"> Generalized Entity Matching<\/a>.<\/p>\n<p>The<a href=\"https:\/\/doi.org\/10.1145\/3418896\" target=\"_blank\" rel=\"noopener\"> challenges of generalized entity matching<\/a> in the Big Data context cover almost all the general challenges of big data solutions including dealing with data variety (heterogeneous data sources), velocity (the algorithms must be performant enough to process the data volume and provide the results at the required time by the use-case), and budget-aware processing (even if the data volume can be processed under the time restrictions, the compute-budget is usually a hard constraint for a solution to be profitable). In this post, I will focus on understanding the most important challenges and aspects of deduplication and similarity search which are critical for developing an effective solution based on these methods.<\/p>\n<p>Deduplication, in broad terms, deals with data that is already available or processed into a single data source and groups the data entries into sets that represent the same entity. A natural extension, which is usually part of the deduplication solution, is a merge and resolution algorithm that can reduce the groups of semantically equal entries into a new single entry. Deduplication is usually deployed as a background ad-hoc process in an enterprise with the goal of significantly improving the data quality. It is often part of a larger restructuring process, which can be followed by other services that ensure future data quality and prevent the need for deduplication in the future. These types of solutions are either developed internally or are provided by a B2B service.<\/p>\n<p>Similarity search on the other hand can be seen as a subtask within a deduplication solution, finding the same or similar entries in a data source, given a query entry or a set of attributes. The application of similarity search is broader and can often be customer-facing (B2C) through features such as document search, recommender systems, and auto-completion in data entry systems.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_82_2 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#Challenges-and-constraints\" >Challenges and constraints<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#Heterogeneity-of-data-sources\" >Heterogeneity of data sources<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#Controlling-the-computational-cost\" >Controlling the computational cost<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#Identifier-based-matching\" >Identifier based matching<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#Attribute-based-matching\" >Attribute-based matching<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#Deep-learning-approaches\" >Deep learning approaches<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Challenges-and-constraints\"><\/span>Challenges and constraints<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>When designing an entity matching-based solution, we have to take into account different considerations and nuanced relations between the methodology used and the business impact. The main question that needs to be considered is: do we need the matches between entities to have 100% confidence or not (hard matches vs soft matches)? Just imagine the logistical hell if the result of deduplication leads to the merging of two unrelated products into a single entity or the legal consequences of merging two different people into a single entry.<\/p>\n<p>In practice, the different scenarios usually follow one of two patterns:<\/p>\n<ol>\n<li>The solution is applied to the data relevant to the core business, so the matches need to be done with absolute certainty. As a consequence, we will find fewer matches because of the strict matching criteria.<br \/>\n\u2192 There must be no false positives. Precision has to be 1.<\/li>\n<li>The matches are not mission critical and are used in a recommender, auto-completion, search, or similar system, both customer or internal facing. This allows for a more flexible definition of a match and we will find more matches overall.<br \/>\n\u2192 Both precision and recall are maximized.<\/li>\n<\/ol>\n<p>We will keep these two scenarios in mind when discussing the different methodologies.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Heterogeneity-of-data-sources\"><\/span>Heterogeneity of data sources<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The datasets or the collections of entity entries can be represented in various data formats. We can categorize those as:<\/p>\n<ul>\n<li>Structured (tabular data, relational database)<\/li>\n<li>Semi-structured (XML, JSON, graph, tagged document)<\/li>\n<li>Unstructured (PDF, text, document)<\/li>\n<\/ul>\n<p>When the data format is the same we can further differentiate if the collections have the same schema or not by categorizing them to:<\/p>\n<ul>\n<li>Homogeneous<\/li>\n<li>Heterogeneous<\/li>\n<\/ul>\n<p>Independently of the data formats and schemas, the entries in collections can contain multiple data types:<\/p>\n<ul>\n<li>Text<\/li>\n<li>Boolean flags<\/li>\n<li>Enum &#8211; generalization of boolean<\/li>\n<li>List\/set &#8211; generalization of many enum attributes<\/li>\n<li>Binary<\/li>\n<li>Graph<\/li>\n<\/ul>\n<p>When developing an entity-matching solution, the above properties of the specific datasets constrain the applicable methodologies we can use.<\/p>\n<p>For example, if working with an unstructured dataset, we can exclude the attribute-based matching and blocking algorithms. However, if we are confident that a majority of data in the unstructured dataset reliably contains certain information, we can exploit that knowledge and extract that knowledge into attributes, and then apply the standard attribute-based matching.<\/p>\n<p>Another example might be the case of working in a heterogeneous setting where we are matching semi-structured JSON data with a structured relational database. In this case, we can seek to flatten and normalize the JSON. If this is not possible we can use any of the matching and blocking methods applicable to semi-structured or unstructured datasets as we can always easily cast a more structured dataset to a less structured one.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-39089\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/image2-1.png\" alt=\"example of semi-structured and unstructured data in text\" width=\"1200\" height=\"678\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/image2-1.png 1443w, https:\/\/www.inovex.de\/wp-content\/uploads\/image2-1-300x170.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/image2-1-1024x579.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/image2-1-768x434.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/image2-1-400x226.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/image2-1-720x406.png 720w, https:\/\/www.inovex.de\/wp-content\/uploads\/image2-1-360x204.png 360w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Controlling-the-computational-cost\"><\/span>Controlling the computational cost<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>During deduplication, the algorithm needs to consider all pairs of entries. It is obvious that a naive comparison of every pair of entities in the data would lead to the O(n\u00b2) run time. In the context of big data, such a solution is infeasible even on relatively small datasets. To overcome this limitation there exist many techniques such as<a href=\"http:\/\/ceur-ws.org\/Vol-789\/paper14.pdf\" target=\"_blank\" rel=\"noopener\"> efficient indexing<\/a>,<a href=\"https:\/\/arxiv.org\/abs\/1006.5309\" target=\"_blank\" rel=\"noopener\"> blocking,<\/a> or<a href=\"https:\/\/arxiv.org\/abs\/1103.2410\" target=\"_blank\" rel=\"noopener\"> message-passing techniques<\/a>. These techniques reduce the search space and facilitate parallelization of entity matching algorithms and deployment on the distributed computing platforms which can significantly increase the efficiency and reduce time to solution.<\/p>\n<p>Blocking is primarily used to prevent the unfeasible evaluation of the Cartesian product of all entities, but as a consequence, the reduced search space might also exclude some positive matches. Thus the question of reliability and other properties of the blocking index, such as pair completeness (recall), pair quality (precision), and reduction ratio, is important to consider within the specific problem context. Referencing back to the challenges and constraints section, let\u2019s consider the two defined scenarios. In scenario 1, no false positives are tolerated due to implied business consequences. A lower recall and a number of false negatives can be tolerated to a degree. In scenario 2 however, a balance between precision and recall is desirable. Some minimum level of both precision and recall is needed (true positives should outnumber false positives and if possible false negatives) in order to have a useful result in the recommender system.<\/p>\n<p>The most important factor to consider when choosing the blocking index is the<a href=\"https:\/\/rosap.ntl.bts.gov\/view\/dot\/13855\" target=\"_blank\" rel=\"noopener\"> cost-benefit trade-off<\/a>. In the case of products, if we choose a high-level product category as the blocking index, the resulting blocks will be large, which leads to many unnecessary comparisons between entity pairs, which in turn leads to more computational cost and longer time to solutions. In contrast, if the blocking index is too specific, like product manufacturer and color, the resulting blocks might be too small and miss some true matches between entities that are grouped in different blocks due to bad data quality.<\/p>\n<p>In an optimal case, we would already have a reliable blocking key (index) but these are mostly unavailable in heterogeneous datasets and are often unreliable with a single large homogenous data source (in the products-example above, the manufacturer attribute might have multiple different values with typos or other changes for the same manufacturer).<\/p>\n<p>In cases where no existing attributes are available within entities that can be used as a blocking index, a synthetic (blocking) index can be computed. A variety of algorithms are available for computing the synthetic blocking index including<a href=\"https:\/\/openresearch-repository.anu.edu.au\/handle\/1885\/40723\" target=\"_blank\" rel=\"noopener\"> Bigram Indexing<\/a>,<a href=\"https:\/\/link.springer.com\/article\/10.1023\/A:1009761603038\" target=\"_blank\" rel=\"noopener\"> Sorted Neighborhood<\/a>, and<a href=\"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/1142473.1142599\" target=\"_blank\" rel=\"noopener\"> Canopy Clustering<\/a>. The same cost-benefit trade-off applies to the synthetic blocking indexes, which are even more prone to sensitivity loss (losing the positive matches) when applied as a black box solution.<\/p>\n<p>For more critical solutions, one often includes some domain knowledge about the data and attributes when computing the synthetic blocking index. For example, one might exploit unreliable partitioning keys (like product manufacturer) to increase the coverage confidence (by clustering manufacturer values and using cluster indices) while still significantly reducing the search space.<\/p>\n<p>When dealing with data that lack useful metadata or attributes for blocking, one might have to compute the blocking index based on the content of entities. For this case, there is a possibility of using (deep learning) embeddings of data entries as the input to the clustering algorithm for producing a more reliable blocking index or finding the n-nearest-neighbors for comparison.<\/p>\n<p>For even more demanding use cases like continuous entity matching, please refer to<a href=\"https:\/\/link.springer.com\/article\/10.1007\/s11390-020-0350-4\" target=\"_blank\" rel=\"noopener\"> these<\/a><a href=\"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3377455\" target=\"_blank\" rel=\"noopener\"> two<\/a> comprehensive reviews of all available state-of-the-art blocking methods.<\/p>\n<p>After having the search space significantly reduced with blocking, it is still critical to exploit the possible parallelization of the matching process, especially in the case of deduplication or the partial pre-computation of similarities for similarity search. One consequence of parallel matching is that the result can contain overlapping groups, so an additional merging step needs to be introduced.<\/p>\n<p>Further optimizations can be performed on the level of query picking (sampling strategy) during the iterative deduplication process, but are costly and highly dataset-specific.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Identifier-based-matching\"><\/span>Identifier based matching<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>An identifier within a data source is an attribute which already matches identical entities. Every identifier is only valid on a certain scope, which typically does not cover all the available data (otherwise the problem would be already solved). It is often the case that during entity matching one has to exploit available identifiers which are incomplete, partially available, or span different scopes of the data. For example, if the dataset consists of multiple data sources, each data source can have its own unique identifiers. Some identifiers can have scopes that cover multiple data sources but only partially.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-39087\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/image1-1.png\" alt=\"depiction of connections between databases a, b and c\" width=\"987\" height=\"629\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/image1-1.png 2000w, https:\/\/www.inovex.de\/wp-content\/uploads\/image1-1-300x191.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/image1-1-1024x652.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/image1-1-768x489.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/image1-1-1536x978.png 1536w, https:\/\/www.inovex.de\/wp-content\/uploads\/image1-1-1920x1223.png 1920w, https:\/\/www.inovex.de\/wp-content\/uploads\/image1-1-400x255.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/image1-1-360x229.png 360w\" sizes=\"auto, (max-width: 987px) 100vw, 987px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>To exploit such overlapping partially-available identifiers, we can employ an iterative matching method to extend the scope of entity matching to the whole dataset. The main advantage of using the available identifier is the 100% confidence in the matches as required for scenario 1 in the \u201cChallenges and constraints\u201c above (entity matching impacting the core business processes).<\/p>\n<p>Partially-available identifiers can also be used as labels for training the classifiers of attribute-based or deep learning-based matching approaches.<\/p>\n<p>Another very impactful methodology is using the identifiers (or attributes) as exclusion criteria. Those can be used in conjunction with matching criteria in the defined order of precedence to decide if the pair of representations is a true match. One can also take it a step further and integrate the exclusion criteria in the calculation of the blocking index, where the search space can be reduced even further.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Attribute-based-matching\"><\/span>Attribute-based matching<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Entities in structured datasets have a multitude of attributes, whose relevance for entity matching is not always clear. The similarity between individual attributes of different entities can be computed but takes on a different form depending on the data type. The following figure shows examples of the data types, their features, and similarity metrics that can be computed on them.<\/p>\n<p>&nbsp;<\/p>\n<table>\n<tbody>\n<tr>\n<td>Data type<\/td>\n<td>Feature<\/td>\n<td>Similarity \/ Distance metric<\/td>\n<\/tr>\n<tr>\n<td>Boolean<\/td>\n<td>Boolean<\/td>\n<td>Equality<\/td>\n<\/tr>\n<tr>\n<td>Int<\/td>\n<td>Int<\/td>\n<td>Equality<\/td>\n<\/tr>\n<tr>\n<td>Float<\/td>\n<td>Float<\/td>\n<td>Difference<\/td>\n<\/tr>\n<tr>\n<td>Enum<\/td>\n<td>Enum<\/td>\n<td>Equality<\/td>\n<\/tr>\n<tr>\n<td>Set (of enums or strings)<\/td>\n<td>Set<\/td>\n<td>Overlapping fraction<\/td>\n<\/tr>\n<tr>\n<td>Geo-Coordinates<\/td>\n<td>Geo-Coordinates<\/td>\n<td>Euclidean distance<\/p>\n<p>Tunnel distance<\/p>\n<p>Ellipsoidal-surface distance<\/p>\n<p>Road distance<\/td>\n<\/tr>\n<tr>\n<td>String<\/td>\n<td>Normalized String<\/td>\n<td>Smith-Waterman Edit distance<\/p>\n<p>Q-gram distance<\/p>\n<p>Jaro-Winkler distance<\/p>\n<p>Monge-Elkan distance<\/p>\n<p>Extended Jaccard coefficient<\/p>\n<p>SoftTFIDF similarity<\/p>\n<p>Longest Common Substring similarity<\/p>\n<p>Bag distance<\/p>\n<p>Compression distance<\/p>\n<p>Editex similarity<\/p>\n<p>Syllable Alignment distance<\/td>\n<\/tr>\n<tr>\n<td>String<\/td>\n<td>Vector Embedding<\/td>\n<td>Euclidean distance<\/p>\n<p>Cosine similarity<\/p>\n<p>Dot product<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p>After the similarity scores of all attributes between the two entities are computed, they need to be combined into a single similarity metric. This can be done in different ways, from domain knowledge-driven manual weighing of individual similarity scores to automated data-based methods that require a labeled dataset. For the later variant, any type of classifier model can be trained. Depending on the properties of the available dataset, one usually has to apply one of the following two approaches:<\/p>\n<ul>\n<li>Generating the dataset with manual labeling and<a href=\"https:\/\/www.inovex.de\/de\/blog\/intro-to-active-learning\/\" target=\"_blank\" rel=\"noopener\"> active learning<\/a><\/li>\n<li>Exploiting the available unique identifiers as labels<\/li>\n<\/ul>\n<p>Unique identifiers that are valid on a substantial subset of data are optimal to reuse for generating a labeled data set. If the scope of the unique identifier is too small, then the labeled dataset it produces might be too biased and not directly useful for training. This however can be used as a starting point for the<a href=\"https:\/\/www.inovex.de\/de\/blog\/intro-to-active-learning\/\" target=\"_blank\" rel=\"noopener\"> active learning<\/a> approach which is still better than starting from scratch.<\/p>\n<p>If no partial identifiers are present that can be used as a labeled dataset, then the only available alternatives are the manual weighing of the similarity matrices. This is bound to produce suboptimal results but might be used as a baseline in cases there are no alternatives. The baseline can then be used as a starting point for active learning. <a href=\"https:\/\/www.inovex.de\/de\/blog\/intro-to-active-learning\/\" target=\"_blank\" rel=\"noopener\">Active learning methodology<\/a> supports the manual labeling process to maximize the performance gains for every labeled sample (by picking the samples for labeling based on the highest uncertainty of the classifier model).<\/p>\n<p>Independently of the dataset quality and the classifier confidence, with attribute matching, we can never be 100% confident in the matches as required for scenario 1 in the introduction. The only way to really use the attribute-based matching for scenario 1 is to have a manual validation step after the match generation. However, the uncertainty of the attribute-based matching is not an issue for scenario 2 where the output of the matching is only used as a recommendation generator in a business process or a customer experience.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Deep-learning-approaches\"><\/span>Deep learning approaches<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Despite the ongoing research<a href=\"https:\/\/ajph.aphapublications.org\/doi\/abs\/10.2105\/AJPH.36.12.1412\" target=\"_blank\" rel=\"noopener\"> since 1946<\/a> in different aspects and domains of entity matching, the problem is still open and does not have a satisfactory solution. The requirements of intensive human involvement in feature engineering, tuning, manual labeling, and integrating domain knowledge into the entity-matching solution largely remain. With the advances in deep learning, novel methods enable approaching the entity matching problem from a different perspective and potentially increasing the performance and reducing the need for human involvement in the development process.<\/p>\n<p>Promising deep learning concepts that enable a different type of entity matching are word and document encoders. These neural networks are trained to produce embeddings in latent space whose distance should be proportional to the similarity of the entities in the target domain. Such neural-network-based encoders come in different varieties and are suited for several domains. For example, using the distributed word representations with RNNs and LSTMs, the authors of<a href=\"https:\/\/dl.acm.org\/doi\/abs\/10.14778\/3236187.3236198\" target=\"_blank\" rel=\"noopener\"> DeepER<\/a> were able to develop a novel entity-matching system with high accuracy and efficiency that requires less human effort. An overview of deep learning-based entity-matching approaches can be found<a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3442200\" target=\"_blank\" rel=\"noopener\"> here<\/a>.<\/p>\n<p>While the above methods improved multiple aspects of entity matching and set the new state-of-the-art benchmark on datasets, they still rely on having available large and good-quality labeled datasets which are mostly not available in a real-world setting. Thus, novel deep learning approaches have been developed to specifically tackle the low resource settings. For example, an architecture combining transfer learning and active learning, dubbed<a href=\"https:\/\/arxiv.org\/pdf\/1906.08042.pdf\" target=\"_blank\" rel=\"noopener\"> Deep Transfer Active Learning<\/a>, was developed and exploits the publicly available datasets to train a high-resource model and transfer-learn to a low-resource target setting with very efficient active learning labeling, minimizing the human effort while still retaining the high performance. Another innovative deep learning approach is the<a href=\"https:\/\/arxiv.org\/pdf\/2004.00584.pdf\" target=\"_blank\" rel=\"noopener\"> Ditto architecture<\/a> which enables direct entity matching on datasets with heterogeneous schemas.<\/p>\n<p>Methods that can completely overcome the need for labeled datasets are self-supervised learning approaches, like<a href=\"https:\/\/arxiv.org\/pdf\/2108.08090.pdf\" target=\"_blank\" rel=\"noopener\"> CollaborER<\/a> and<a href=\"https:\/\/arxiv.org\/pdf\/2112.07887.pdf\" target=\"_blank\" rel=\"noopener\"> KRISS<\/a>, which achieve state-of-the-art entity matching performance and can even outperform some supervised-learning-based methods.<\/p>\n<p>In the follow-up post, I will cover more advanced techniques and architectures in entity matching including federated entity resolution and real-time data entity matching.<\/p>\n<p>And if you need help setting up and building your own entity-matching solution, feel free to contact me or anyone else at inovex. We love assessing, architecting, and building new custom solutions.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>With the increase in the application of large-scale data collection, analytics, and accompanying big data platforms over the past decade, the necessity for reliable entity-matching solutions is proportionately growing. The newly available data volumes need to be integrated, processed, and made usable before further value can be generated.<\/p>\n","protected":false},"author":199,"featured_media":40754,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[509,511,77,385,179,225,151],"service":[76,411,431],"coauthors":[{"id":199,"display_name":"Darjan Salaj","user_nicename":"dsalaj"}],"class_list":["post-39092","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-ai-2","tag-artificial-intelligence-2","tag-big-data","tag-data-engineering","tag-data-products","tag-data-science-in-production","tag-deep-learning","service-artificial-intelligence","service-data-engineering","service-data-science"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Similarity Search and Deduplication at Scale - inovex GmbH<\/title>\n<meta name=\"description\" content=\"The fundamental problem of entity matching lies at the core of many commercial and enterprise applications.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Similarity Search and Deduplication at Scale - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"The fundamental problem of entity matching lies at the core of many commercial and enterprise applications.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2022-12-23T09:28:05+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-02-18T07:11:45+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Similarity-Search-and-Deduplication-at-Scale.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Darjan Salaj\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Similarity-Search-and-Deduplication-at-Scale-1024x576.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Darjan Salaj\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"14\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Darjan Salaj\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/similarity-search-deduplication\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/similarity-search-deduplication\\\/\"},\"author\":{\"name\":\"Darjan Salaj\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/3eb5d9c43358af38d1e463c024df3da5\"},\"headline\":\"Similarity Search and Deduplication at Scale\",\"datePublished\":\"2022-12-23T09:28:05+00:00\",\"dateModified\":\"2026-02-18T07:11:45+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/similarity-search-deduplication\\\/\"},\"wordCount\":2770,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/similarity-search-deduplication\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/Similarity-Search-and-Deduplication-at-Scale.png\",\"keywords\":[\"Ai\",\"Artificial Intelligence\",\"Big Data\",\"Data Engineering\",\"Data Products\",\"Data Science in Production\",\"Deep Learning\"],\"articleSection\":[\"Analytics\",\"English Content\",\"General\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/similarity-search-deduplication\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/similarity-search-deduplication\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/similarity-search-deduplication\\\/\",\"name\":\"Similarity Search and Deduplication at Scale - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/similarity-search-deduplication\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/similarity-search-deduplication\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/Similarity-Search-and-Deduplication-at-Scale.png\",\"datePublished\":\"2022-12-23T09:28:05+00:00\",\"dateModified\":\"2026-02-18T07:11:45+00:00\",\"description\":\"The fundamental problem of entity matching lies at the core of many commercial and enterprise applications.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/similarity-search-deduplication\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/similarity-search-deduplication\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/similarity-search-deduplication\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/Similarity-Search-and-Deduplication-at-Scale.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/Similarity-Search-and-Deduplication-at-Scale.png\",\"width\":1920,\"height\":1080,\"caption\":\"several similar and two identical documents in a simple illustration\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/similarity-search-deduplication\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Similarity Search and Deduplication at Scale\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/3eb5d9c43358af38d1e463c024df3da5\",\"name\":\"Darjan Salaj\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-profile_dsalaj_2024_hk-96x96.jpg9fcfeffc54642cddbc08ffd0e0638b80\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-profile_dsalaj_2024_hk-96x96.jpg\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-profile_dsalaj_2024_hk-96x96.jpg\",\"caption\":\"Darjan Salaj\"},\"description\":\"As a Data\\\/ML Engineer at inovex, I specialize in building robust pipelines for large-scale data processing in distributed systems. I focus on delivering high-impact data products by prioritizing data quality and master data governance, ensuring that machine learning models are built on a foundation of reliable, well-governed information. I have extensive experience collaborating with SAP customers to modernize their data landscapes and integrate intelligent solutions into their enterprise workflows.\",\"sameAs\":[\"https:\\\/\\\/dsalaj.com\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/in\\\/darjan-salaj-78599818b\\\/\"],\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/dsalaj\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Similarity Search and Deduplication at Scale - inovex GmbH","description":"The fundamental problem of entity matching lies at the core of many commercial and enterprise applications.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/","og_locale":"de_DE","og_type":"article","og_title":"Similarity Search and Deduplication at Scale - inovex GmbH","og_description":"The fundamental problem of entity matching lies at the core of many commercial and enterprise applications.","og_url":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2022-12-23T09:28:05+00:00","article_modified_time":"2026-02-18T07:11:45+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/Similarity-Search-and-Deduplication-at-Scale.png","type":"image\/png"}],"author":"Darjan Salaj","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/Similarity-Search-and-Deduplication-at-Scale-1024x576.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Darjan Salaj","Gesch\u00e4tzte Lesezeit":"14\u00a0Minuten","Written by":"Darjan Salaj"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/"},"author":{"name":"Darjan Salaj","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/3eb5d9c43358af38d1e463c024df3da5"},"headline":"Similarity Search and Deduplication at Scale","datePublished":"2022-12-23T09:28:05+00:00","dateModified":"2026-02-18T07:11:45+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/"},"wordCount":2770,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Similarity-Search-and-Deduplication-at-Scale.png","keywords":["Ai","Artificial Intelligence","Big Data","Data Engineering","Data Products","Data Science in Production","Deep Learning"],"articleSection":["Analytics","English Content","General"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/","url":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/","name":"Similarity Search and Deduplication at Scale - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Similarity-Search-and-Deduplication-at-Scale.png","datePublished":"2022-12-23T09:28:05+00:00","dateModified":"2026-02-18T07:11:45+00:00","description":"The fundamental problem of entity matching lies at the core of many commercial and enterprise applications.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/Similarity-Search-and-Deduplication-at-Scale.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Similarity-Search-and-Deduplication-at-Scale.png","width":1920,"height":1080,"caption":"several similar and two identical documents in a simple illustration"},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/similarity-search-deduplication\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"Similarity Search and Deduplication at Scale"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/3eb5d9c43358af38d1e463c024df3da5","name":"Darjan Salaj","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-profile_dsalaj_2024_hk-96x96.jpg9fcfeffc54642cddbc08ffd0e0638b80","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-profile_dsalaj_2024_hk-96x96.jpg","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-profile_dsalaj_2024_hk-96x96.jpg","caption":"Darjan Salaj"},"description":"As a Data\/ML Engineer at inovex, I specialize in building robust pipelines for large-scale data processing in distributed systems. I focus on delivering high-impact data products by prioritizing data quality and master data governance, ensuring that machine learning models are built on a foundation of reliable, well-governed information. I have extensive experience collaborating with SAP customers to modernize their data landscapes and integrate intelligent solutions into their enterprise workflows.","sameAs":["https:\/\/dsalaj.com\/","https:\/\/www.linkedin.com\/in\/darjan-salaj-78599818b\/"],"url":"https:\/\/www.inovex.de\/de\/blog\/author\/dsalaj\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/39092","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/199"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=39092"}],"version-history":[{"count":6,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/39092\/revisions"}],"predecessor-version":[{"id":66235,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/39092\/revisions\/66235"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/40754"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=39092"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=39092"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=39092"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=39092"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}