{"id":59897,"date":"2025-01-17T09:15:15","date_gmt":"2025-01-17T08:15:15","guid":{"rendered":"https:\/\/www.inovex.de\/?p=59897"},"modified":"2026-01-08T10:59:32","modified_gmt":"2026-01-08T09:59:32","slug":"how-to-use-mimesis-and-dbt-to-test-data-pipelines","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/","title":{"rendered":"How to Use Mimesis and dbt to Test Data Pipelines"},"content":{"rendered":"<p>In this blog post, we introduce a framework for data pipeline testing using <a href=\"https:\/\/www.getdbt.com\/\">dbt (data build tool)<\/a> and <a href=\"https:\/\/mimesis.name\/master\/\">Mimesis<\/a>, a fake data generator for Python.<\/p>\n<p><!--more--><\/p>\n<p>Data quality is of utmost importance for the success of data products. Ensuring the robustness and accuracy of data pipelines is key to achieving this quality. At this point, data pipeline testing becomes essential. However, effectively testing data pipelines incorporates several challenges, including the availability of test data, automation, a proper definition of test cases, and the possibility of running end-to-end data tests seamlessly during local development.<\/p>\n<p>This blog post introduces a framework for testing data pipelines built with dbt (data build tool) using Mimesis, a Python library for fake data generation. Specifically, we will<\/p>\n<ul>\n<li>use Pydantic to parse and validate dbt&#8217;s <code>schema.yml<\/code> files<\/li>\n<li>implement a generic approach to auto-generate realistic test data based on dbt schemas<\/li>\n<li>use the generated data to test a dbt data pipeline.<\/li>\n<\/ul>\n<p>Combining dbt&#8217;s built-in testing capabilities with Mimesis&#8217;s data generation abilities allows us to validate data pipelines effectively, ensuring data quality and accuracy.<\/p>\n<p>All files discussed in this article are available in an <a href=\"https:\/\/github.com\/inovex\/blog-dbt-mimesis\">accompanying GitHub repository<\/a>. If you want to follow along, make sure to clone the repository.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_82_2 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Prerequisites\" >Prerequisites<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Motivation\" >Motivation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Mimesis-%E2%80%93-Fake-Data-Generation\" >Mimesis &#8211; Fake Data Generation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Combine-dbt-and-Mimesis-for-Robust-Data-Pipeline-Testing\" >Combine dbt and Mimesis for Robust Data Pipeline Testing<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Setting-Up-Your-Environment\" >Setting Up Your Environment<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Option-1-Use-the-Development-Container-recommended\" >Option 1: Use the Development Container (recommended)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Option-2-Manual-Setup\" >Option 2: Manual Setup<\/a><\/li><\/ul><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Testing-dbt-Pipelines-Using-Mimesis\" >Testing dbt Pipelines Using Mimesis<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Step-1-Parsing-dbt-Schemas-With-Pydantic\" >Step 1: Parsing dbt Schemas With Pydantic<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Step-2-Auto-Generating-Test-Data-with-Mimesis\" >Step 2: Auto-Generating Test Data with Mimesis<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Initialize-the-class\" >Initialize the class<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Generate-Random-Row-Counts\" >Generate Random Row Counts<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Generate-Unique-Values\" >Generate Unique Values<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Generate-Data-for-a-Table\" >Generate Data for a Table<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Generate-Data-for-the-Entire-Schema\" >Generate Data for the Entire Schema<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-16\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Full-Picture\" >Full Picture<\/a><\/li><\/ul><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-17\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Using-Mimesis-to-Test-dbt-Pipelines\" >Using Mimesis to Test dbt Pipelines<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-18\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#Conclusion\" >Conclusion<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Prerequisites\"><\/span>Prerequisites<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We assume the reader to be familiar with the following topics:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.python.org\/\">Python<\/a> Basics, incl. basic object-oriented programming<\/li>\n<li><a href=\"https:\/\/docs.getdbt.com\/docs\/build\/documentation\">dbt<\/a> Basics, incl. model definitions, seeds, basic project structures, and data tests<\/li>\n<li><a href=\"https:\/\/docs.pydantic.dev\/latest\/\">Pydantic<\/a> Basics, incl. model definitions and parsing YAML files into Pydantic models<\/li>\n<\/ul>\n<p>If you want to follow along, you must have Python 3.10 installed on your machine. Alternatively, our <a href=\"https:\/\/github.com\/inovex\/blog-dbt-mimesis\">GitHub repository<\/a> contains a <a href=\"https:\/\/containers.dev\/\">Development Container<\/a> Specification that allows you to set up a working environment quickly (requires <a href=\"https:\/\/www.docker.com\/\">Docker<\/a> to be installed on your machine).<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Motivation\"><\/span>Motivation<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>As discussed in detail in <a href=\"https:\/\/www.inovex.de\/de\/blog\/ensuring-data-quality-a-data-engineers-perspective\/\">this blog post<\/a>, data quality is paramount for successful data products. High-quality data is data that meets domain-specific assumptions to a high degree. This includes:<\/p>\n<ul>\n<li><strong>Semantic correctness<\/strong>: e.g., an email address matches a particular pattern, or the measured value of a sensor is always within a specific interval.<\/li>\n<li><strong>Syntactic correctness<\/strong>: e.g., fields and data types are correct, and constraints are met.<\/li>\n<li><strong>Completeness<\/strong>.<\/li>\n<\/ul>\n<p>Meeting the requirements above heavily depends on the correctness and robustness of data pipelines, which makes testing them so important. Testing data pipelines ensures accurate transformations and consistent schemas. It helps to catch errors early and to prevent downstream impacts. However, effectively running data tests comes with some challenges:<\/p>\n<ul>\n<li><strong>Realistic Test Data<\/strong>: Using production data can raise privacy concerns, while manually creating (realistic) test datasets is time-consuming and often incomplete.<\/li>\n<li><strong>Dynamic Environments<\/strong>: Adapting to changing schemas or new sources can introduce errors.<\/li>\n<li><strong>Automation in Testing<\/strong>: Data pipelines are often time-consuming and costly. Developer teams need the means to run data tests many times throughout the development lifecycle automatically and ensure the detection of problems early on.<\/li>\n<li><strong>Local Test Execution:\u00a0<\/strong>Not only should data pipeline tests be automated, but they should also be accessible to developers to run them on their local machines during development. This requires developers to easily generate test data and be able to execute data pipeline tests end-to-end.<\/li>\n<\/ul>\n<p>Tools like dbt simplify data pipeline testing by providing a framework for modular, testable SQL-based workflows. But to truly test a pipeline effectively, we also need realistic datasets to simulate real-world scenarios &#8211; this is where libraries like <a href=\"https:\/\/mimesis.name\/master\/\">Mimesis<\/a>, a powerful fake data generator, come into play. The framework we introduce in this blog post aims at tackling the challenges above. Our approach allows developers to quickly auto-generate test data based on a schema definition and run data pipeline tests both locally and as part of CI\/CD.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Mimesis-%E2%80%93-Fake-Data-Generation\"><\/span>Mimesis &#8211; Fake Data Generation<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Mimesis is a Python library that generates realistic data for testing purposes. It supports a wide range of data types, making it ideal for testing data pipelines without relying on production datasets. It also offers built-in providers for generating data related to various areas, including food, people, transportation, and addresses across multiple locales.<\/p>\n<p>Let&#8217;s look at how easy it is to generate fake data with Mimesis.<\/p>\n<pre class=\"lang:python decode:true\">from mimesis import Person\r\nfrom mimesis.locales import Local\r\n\r\n# Create an instance of the person provider for English data\r\nperson = Person(locale=local.EN)\r\n\r\n# Generate a Name and an e-mail address\r\nprint(person.full_name()) # e.g., \"John Doe\"\r\nprint(person.email()) # e.g., \"john.doe@example.com\"<\/pre>\n<p>Furthermore, we can use the <span class=\"lang:default decode:true crayon-inline \">Fieldset<\/span>\u00a0class to generate multiple values simultaneously for more significant amounts of data.<\/p>\n<pre class=\"lang:python decode:true\">from mimesis import Fieldset\r\nfrom mimesis.locales import Locale\r\n\r\n# Create an instance of Fieldset and generate ten usernames\r\nfieldset = Fieldset(locale=Locale.EN)\r\nusernames = fieldset(\"username\", i=10) # generates a list of 10 usernames<\/pre>\n<p>Now that we have a basic understanding of how Mimesis operates, we&#8217;re all set to generate fake data and use it to test our dbt pipelines.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Combine-dbt-and-Mimesis-for-Robust-Data-Pipeline-Testing\"><\/span>Combine dbt and Mimesis for Robust Data Pipeline Testing<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Combining dbt&#8217;s transformation and testing capabilities with Mimesis&#8217;s ability to generate realistic test data allows us to create a strong framework for building reliable, scalable data pipelines. In the following sections, we&#8217;ll make our way up to testing our dbt pipelines with auto-generated fake data step-by-step.<\/p>\n<p>First, we&#8217;ll set up our environment and install the necessary dependencies. Next, we&#8217;ll look into parsing dbt schemas using Pydantic. Finally, we&#8217;ll explore using the Pydantic models to automatically generate fake data for an arbitrary dbt seed or model.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Setting-Up-Your-Environment\"><\/span>Setting Up Your Environment<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Firstly, clone the GitHub repository and navigate to the root directory:<\/p>\n<pre class=\"lang:default decode:true\">git clone https:\/\/github.com\/inovex\/blog-dbt-mimesis.git\r\ncd blog-dbt-mimesis<\/pre>\n<p>There are two options to set up your environment:<\/p>\n<h4><span class=\"ez-toc-section\" id=\"Option-1-Use-the-Development-Container-recommended\"><\/span>Option 1: Use the Development Container (recommended)<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>The repository includes a <a href=\"https:\/\/containers.dev\/\">Development Container<\/a> Specification for quick setup. All necessary dependencies are already installed if you run the code inside the Development Container.<\/p>\n<h4><span class=\"ez-toc-section\" id=\"Option-2-Manual-Setup\"><\/span>Option 2: Manual Setup<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>Install the required dependencies using <a href=\"https:\/\/python-poetry.org\/\">Poetry<\/a>. This assumes you have <a href=\"https:\/\/www.python.org\/downloads\/release\/python-3100\/\">Python 3.10<\/a> installed on your machine.<\/p>\n<pre class=\"lang:default decode:true\"># install poetry \r\npip install poetry \r\npoetry install\r\n# install dbt dependencies\r\npoetry run dbt deps --project-dir dbt_mimesis_example<\/pre>\n<p>This project uses <a href=\"https:\/\/duckdb.org\/\">DuckDB. <\/a>To create a DuckDB database, you must install DuckDB CLI on your machine. You can follow <a href=\"https:\/\/duckdb.org\/docs\/installation\/\">this guide<\/a> to install it on your OS. Next, run the following command to create a database file inside the <span class=\"lang:default decode:true crayon-inline\">dbt_mimesis_example<\/span> directory:<\/p>\n<pre class=\"lang:default decode:true\">duckdb dbt_mimesis_example\/dev.duckdb \"SELECT 'Database created successfully';\"<\/pre>\n<h2><span class=\"ez-toc-section\" id=\"Testing-dbt-Pipelines-Using-Mimesis\"><\/span>Testing dbt Pipelines Using Mimesis<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Having a working environment, we are ready to look at the code we will use to generate realistic data to test our dbt pipelines.<\/p>\n<figure id=\"attachment_59916\" aria-describedby=\"caption-attachment-59916\" style=\"width: 880px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-59916 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Bildschirmfoto-2024-12-12-um-11.42.41.png\" alt=\"dbt data lineage graph\" width=\"880\" height=\"353\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Bildschirmfoto-2024-12-12-um-11.42.41.png 880w, https:\/\/www.inovex.de\/wp-content\/uploads\/Bildschirmfoto-2024-12-12-um-11.42.41-300x120.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Bildschirmfoto-2024-12-12-um-11.42.41-768x308.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Bildschirmfoto-2024-12-12-um-11.42.41-400x160.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Bildschirmfoto-2024-12-12-um-11.42.41-360x144.png 360w\" sizes=\"auto, (max-width: 880px) 100vw, 880px\" \/><figcaption id=\"caption-attachment-59916\" class=\"wp-caption-text\">Lineage graph of our dbt project<\/figcaption><\/figure>\n<p>The lineage graph of the <code>dbt_mimesis_example<\/code> dbt project shows that there are two seeds &#8211; namely <code>raw_airplanes<\/code> and <code>raw_flights<\/code> &#8211; and three downstream models depending on those seeds: <code>airplanes<\/code>, <code>cities<\/code>, and <code>flights<\/code>. In dbt, seeds are CSV files typically located in the <code>seeds<\/code> directory and can be loaded to a data warehouse using the <code>dbt seed<\/code> command. Hence, to properly test our dbt pipeline, we need to ensure that we have two CSV files: <code>raw_airplanes.csv<\/code> and <code>raw_flights.csv<\/code>. To this end, we will use Pydantic to parse the <code>schema.yml<\/code> file inside the <code>seeds<\/code> directory. Subsequently, we&#8217;ll use the parsed schema definition to auto-generate fake data.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Step-1-Parsing-dbt-Schemas-With-Pydantic\"><\/span>Step 1: Parsing dbt Schemas With Pydantic<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><a href=\"https:\/\/docs.pydantic.dev\/latest\/\">Pydantic<\/a> is a Python library that validates data using type annotations. It allows you to define models as Python classes, validate data against those models, and parse various input formats (e.g., JSON, YAML) into structured Python objects. This makes it an ideal tool for working with dbt&#8217;s <span class=\"lang:default decode:true crayon-inline\">schema.yml<\/span> files, as it ensures that the schema definitions are valid and compatible with downstream processes like our test data generation.<\/p>\n<p>Inside the <code>data_generator\/models.py<\/code>, we define a few Pydantic models to parse our dbt <span class=\"lang:default decode:true crayon-inline \">schema.yml<\/span>\u00a0 files into structured Python objects:<\/p>\n<pre class=\"lang:python decode:true\">from enum import Enum\r\nfrom pydantic import AliasChoices, BaseModel, Field\r\n\r\n\r\nclass DataType(Enum):\r\n    DATE = \"date\"\r\n    VARCHAR = \"varchar\"\r\n    INTEGER = \"integer\"\r\n\r\n\r\nclass DBTColumn(BaseModel):\r\n    \"\"\"Basic DBT column\"\"\"\r\n\r\n    name: str\r\n    data_type: DataType\r\n    data_tests: list[str] = []\r\n    meta: dict[str, str | bool] = {}\r\n\r\n\r\nclass DBTTable(BaseModel):\r\n    \"\"\"DBT Table\"\"\"\r\n\r\n    name: str\r\n    columns: list[DBTColumn]\r\n\r\n\r\nclass DBTSchema(BaseModel):\r\n    \"\"\"DBT Schema\"\"\"\r\n\r\n    models: list[DBTTable] = Field(validation_alias=AliasChoices(\"models\", \"seeds\"))<\/pre>\n<p>The <span class=\"lang:default decode:true crayon-inline\">DBTSchema<\/span>\u00a0model represents a list of <span class=\"lang:default decode:true crayon-inline \">DBTTable<\/span> objects. Subsequently, a <span class=\"lang:default decode:true crayon-inline\">DBTTable<\/span> consists of a name and a list of columns represented by the <span class=\"lang:default decode:true crayon-inline \">DBTColumn<\/span> model. Finally, a <span class=\"lang:default decode:true crayon-inline\">DBTColumn<\/span> has a name, a data type, an optional list of data tests, and an optional <span class=\"lang:default decode:true crayon-inline\">meta<\/span>-dictionary containing metadata about the column.<\/p>\n<p>We can now use these models to parse our YAML-based dbt schema definition into\u00a0structured Python objects:<\/p>\n<pre class=\"lang:python decode:true\">from pydantic_yaml import parse_yaml_file_as\r\nfrom data_generator.models import DBTSchema\r\n\r\nschema = parse_yaml_file_as(model_type=DBTSchema, file=\"dbt_mimesis_example\/seeds\/schema.yml\")\r\nprint(schema) # prints a DBTSchema object representing the parsed schema\r\n<\/pre>\n<h3><span class=\"ez-toc-section\" id=\"Step-2-Auto-Generating-Test-Data-with-Mimesis\"><\/span>Step 2: Auto-Generating Test Data with Mimesis<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The next step is to use the structured Python objects representing the dbt schema to generate fake data. To this end, we create a <code>TestDataGenerator<\/code> class inside <code>data_generator\/generator.py<\/code> that implements functionality to generate the data. Below, we&#8217;ll break it down step by step.<\/p>\n<h4><span class=\"ez-toc-section\" id=\"Initialize-the-class\"><\/span>Initialize the class<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>Let&#8217;s define a constructor for our class. We need some key attributes like our <code>schema<\/code> and the <code>locale<\/code>, as well as a mapping of the data types in our schema to the corresponding Mimesis data types. Furthermore, we want to leverage the powerful providers Mimesis offers, so we add a key attribute <span class=\"lang:default decode:true crayon-inline\">field_aliases<\/span>, which allows us to map column names to the providers. The class has two more attributes that are initiated with <code>None<\/code> and an empty dictionary, respectively. We can use the <code>iterations<\/code> attribute later to decide how many rows should be generated for a given table. Similarly, the <code>reproducible_id_store<\/code> will help store primary keys for cross-referencing.<\/p>\n<pre class=\"lang:python decode:true\" title=\"data_generator\/generator.py\">import random\r\nimport pandas as pd\r\nfrom mimesis import Fieldset, Locale\r\nfrom mimesis.keys import maybe\r\n\r\n# mapping of dbt data types to mimesis providers\r\nDATA_TYPE_MAPPING = {\r\n    \"VARCHAR\": {\"name\": \"text.word\"},\r\n    \"DATE\": {\"name\": \"datetime.date\"},\r\n    \"INTEGER\": {\"name\": \"integer_number\", \"start\": 0, \"end\": 1000},\r\n}\r\n\r\n\r\nclass TestDataGenerator:\r\n    def __init__(\r\n        self, schema: DBTSchema, locale: Locale = Locale.EN, data_type_mapping: dict = DATA_TYPE_MAPPING, field_aliases: dict = {}\r\n    ) -&gt; None:\r\n        self.schema = schema\r\n        self.reproducible_id_store: dict[str, list] = {}\r\n        self.fieldset = Fieldset(locale)\r\n        self.field_aliases = {\r\n            key: {\"name\": value} for key, value in field_aliases.items()\r\n        }\r\n        self.data_type_mapping = data_type_mapping\r\n        self.iterations = None<\/pre>\n<h4><span class=\"ez-toc-section\" id=\"Generate-Random-Row-Counts\"><\/span>Generate Random Row Counts<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>It&#8217;s not very realistic if each generated table has the same number of rows. Therefore, we&#8217;ll add a method <span class=\"lang:default decode:true crayon-inline \">_generate_random_iterations<\/span> that assigns a random number of rows to each table within a specified range (i.e., between <span class=\"lang:default decode:true crayon-inline\">min_rows<\/span> and <span class=\"lang:default decode:true crayon-inline\">max_rows<\/span>). We store the resulting dictionary in the <code>iterations<\/code> class attribute mentioned earlier.<\/p>\n<pre class=\"lang:python decode:true\">    def _generate_random_iterations(self, min_rows: int, max_rows: int) -&gt; dict:\r\n        \"\"\"Generate a random number of iterations for each table within the specified limits\r\n\r\n        Parameters\r\n        ----------\r\n        min_rows : int\r\n            Minimum number of rows to be generated for a table\r\n        max_rows : int\r\n            Maximum number of rows to be generated for a table\r\n\r\n        Returns\r\n        -------\r\n        dict\r\n            Returns a dictionary with the table names as keys and their corresponding row numbers as values\r\n        \"\"\"\r\n\r\n        return {\r\n            table.name: random.randint(min_rows, max_rows)\r\n            for table in self.schema.models\r\n        }<\/pre>\n<h4><span class=\"ez-toc-section\" id=\"Generate-Unique-Values\"><\/span>Generate Unique Values<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>In some cases, our schema contains a uniqueness constraint in the form of a data test <span class=\"lang:default decode:true crayon-inline\">unique<\/span>, or a column is defined as a primary key column. In these cases, we need a way to generate unique values. The <code>_generate_unique_values<\/code> method expects a <span class=\"lang:default decode:true crayon-inline \">DBTTable<\/span> object and a <span class=\"lang:default decode:true crayon-inline \">DBTColumn<\/span> for which the values should be generated as inputs and returns a set of values generated using Mimesis.<\/p>\n<p>Certain Mimesis providers may not generate enough unique values. For instance, the <span class=\"lang:default decode:true crayon-inline \">Airplane<\/span> provider can only produce ~300 unique airplane models. This becomes problematic if the number of iterations specified for the given table is larger than the maximum number of unique values available. Therefore, we must handle this edge case and adapt the maximum number of rows generated for the particular table to the maximum number of unique values available.<\/p>\n<pre class=\"lang:python decode:true\">def _generate_unique_values(\r\n        self, table: DBTTable, column: DBTColumn, iterations: int = None\r\n    ) -&gt; list:\r\n        \"\"\"Generate a specified number of unique values using Mimesis\r\n\r\n        Parameters\r\n        ----------\r\n        field_name : str\r\n            _description_\r\n        iterations : int\r\n            _description_\r\n\r\n        Returns\r\n        -------\r\n        list\r\n            _description_\r\n        \"\"\"\r\n        iterations = (\r\n            iterations if iterations is not None else self.iterations[table.name]\r\n        )\r\n        unique_values = set()\r\n        consecutive_no_increase = 0\r\n\r\n        while len(unique_values) &lt; iterations:\r\n            previous_len = len(unique_values)\r\n            new_values = self.fieldset(\r\n                **self.field_aliases.get(\r\n                    column.name, self.data_type_mapping[column.data_type.value.upper()]\r\n                ),\r\n                i=iterations * 2,\r\n            )\r\n            unique_values.update(new_values)\r\n\r\n            # check whether any new values have been added since previous iteration\r\n            if len(unique_values) == previous_len:\r\n                consecutive_no_increase += 1\r\n            else:\r\n                consecutive_no_increase = 0\r\n\r\n            if consecutive_no_increase == 3:\r\n                # not enough values available, restarting with lower number of iterations for given table\r\n                print(\r\n                    f\"Not enough unique values for {column.name}. Creating maximum number available.\"\r\n                )\r\n                self.iterations[table.name] = len(unique_values)\r\n                self.generate_data()\r\n                break\r\n\r\n        return list(unique_values)[:iterations]<\/pre>\n<h4><span class=\"ez-toc-section\" id=\"Generate-Data-for-a-Table\"><\/span>Generate Data for a Table<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>Next, we need a method that creates synthetic data for a single table by iterating over its columns. The <code>_generate_test_data_for_table<\/code> method takes a <code>DBTTable<\/code> object as input and returns a <a href=\"https:\/\/pandas.pydata.org\/\">Pandas<\/a> data frame with the generated fake data for the corresponding table.<\/p>\n<p>In the context of relational databases, there are usually relationships between tables. Whenever a table references a primary key from another table, it is referred to as a <em>foreign key.\u00a0<\/em>Thereby, a value in a foreign key column must either be null or present as a primary key in the referenced table. This logical dependency is referred to as referential integrity. Mimesis does not natively support referential integrity when generating data. Therefore, we apply some logic to consider primary and foreign keys during data generation. To this end, we use the <span class=\"lang:default decode:true crayon-inline\">meta<\/span> field, an optional part of dbt&#8217;s schema definitions. Specifically, we added metadata describing whether a column is a primary or foreign key to the <code>dbt_mimesis_example\/seeds\/schema.yml<\/code> file.<\/p>\n<p>Within the <code>_generate_test_data_for_table<\/code> method, we also check whether a column is a primary key, a foreign key, or a regular column and handle it accordingly.<\/p>\n<pre class=\"lang:python decode:true\">def _generate_test_data_for_table(self, table: DBTTable) -&gt; pd.DataFrame:\r\n        \"\"\"Generate test data for a given table\r\n\r\n        Parameters\r\n        ----------\r\n        table : DBTTable\r\n            pydantic model describing a dbt table\r\n        iterations : int, optional\r\n            Number of rows to be generated, by default 10\r\n\r\n        Returns\r\n        -------\r\n        pd.DataFrame\r\n            Returns a pandas DataFrame with the generated data based on the table's schema\r\n        \"\"\"\r\n\r\n        schema_data = {}\r\n\r\n        for column in table.columns:\r\n            # check if column has primary\/foreign key constraints\r\n            primary_key = column.meta.get(\"primary_key\", None)\r\n            foreign_key = column.meta.get(\"foreign_key\", None)\r\n\r\n            # generate data according to column type\r\n            if foreign_key:\r\n                schema_data[column.name] = self._handle_foreign_key(\r\n                    foreign_key, column, table\r\n                )\r\n                continue\r\n\r\n            elif primary_key:\r\n                schema_data[column.name] = self._handle_primary_key(table, column)\r\n                continue\r\n\r\n            schema_data[column.name] = self._handle_regular_column(table, column)\r\n\r\n        df = pd.DataFrame.from_dict(schema_data)\r\n        return df<\/pre>\n<h5>Handle Foreign and Primary Keys<\/h5>\n<p>In case the column is defined as a primary or foreign key column, the <code>_handle_key_column<\/code> method is called.<\/p>\n<p>First, it checks whether a set of values for the (referenced) primary key is already available in the <code>reproducible_id_store<\/code> class attribute (as you might remember, this is an initially empty dictionary). If not, it generates a unique set of values for the primary key and adds it to the dictionary. Finally, it returns either a random sample of the set of values available from the referenced primary key column or the set of values itself &#8211; depending on whether it&#8217;s a foreign or a primary key.<\/p>\n<pre class=\"lang:python decode:true\">    def _handle_key_column(\r\n        self, table: DBTTable, column: DBTColumn, foreign_key: str = None\r\n    ) -&gt; None:\r\n        \"\"\"Method to generate data for primary\/foreign key columns\r\n\r\n        Parameters\r\n        ----------\r\n        table : DBTTable\r\n            DBTTable object\r\n        column : DBTColumn\r\n            DBTColumn object\r\n        foreign_key : str, optional\r\n            foreign key, e.g., 'referenced_table.pk', by default None\r\n        \"\"\"\r\n        reproducible_id = foreign_key if foreign_key else f\"{table.name}.{column.name}\"\r\n        iterations = self.iterations[foreign_key.split(\".\")[0]] if foreign_key else None\r\n\r\n        if reproducible_id not in self.reproducible_id_store.keys():\r\n            # store generated data in reproducible_id_store\r\n            self.reproducible_id_store[reproducible_id] = self._generate_unique_values(\r\n                table, column, iterations\r\n            )\r\n\r\n        if foreign_key is not None:\r\n            return random.choices(\r\n                self.reproducible_id_store[foreign_key], k=self.iterations[table.name]\r\n            )\r\n\r\n        return self.reproducible_id_store[reproducible_id]<\/pre>\n<h5>Handle Regular Columns<\/h5>\n<p>Alternatively, the column is handled as a regular, non-key column. In that case, it returns a set of unique values if the <code>unique<\/code> data test is set for the column. Otherwise, it returns a list of generated values without ensuring uniqueness. It might also include null values &#8211; depending on whether or not the <code>not_null<\/code> data test is part of the column specification.<\/p>\n<pre class=\"lang:python decode:true\">def _handle_regular_column(self, table: DBTTable, column: DBTColumn) -&gt; list:\r\n        \"\"\"Method to generate data for regular columns, i.e., not primary\/foreign key colums,\r\n           and takes into account constraints wrt. nullability and uniqueness\r\n\r\n        Parameters\r\n        ----------\r\n        table : DBTTable\r\n            DBTTable object\r\n        column : DBTColumn\r\n            DBTColumn object\r\n\r\n        Returns\r\n        -------\r\n        list\r\n            Returns a list of generated values\r\n        \"\"\"\r\n        if \"unique\" in column.data_tests:\r\n            return self._generate_unique_values(table=table, column=column)\r\n        else:\r\n            probability_of_nones = 0 if \"not_null\" in column.data_tests else 0.1\r\n            return self.fieldset(\r\n                **self.field_aliases.get(\r\n                    column.name,\r\n                    self.data_type_mapping[column.data_type.value.upper()],\r\n                ),\r\n                i=self.iterations[table.name],\r\n                key=maybe(None, probability=probability_of_nones),\r\n            )<\/pre>\n<h4><span class=\"ez-toc-section\" id=\"Generate-Data-for-the-Entire-Schema\"><\/span>Generate Data for the Entire Schema<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>Finally, we want to implement logic to generate data for an entire schema. The <span class=\"lang:default decode:true crayon-inline \">generate_data<\/span> method orchestrates the whole process by calling the helper methods we implemented to create data for all tables in a schema.<\/p>\n<pre class=\"lang:python decode:true\">def generate_data(\r\n        self, min_rows: int = 10, max_rows: int = 100\r\n    ) -&gt; dict[str, pd.DataFrame]:\r\n        \"\"\"Generate test data for a given schema\r\n\r\n        Parameters\r\n        ----------\r\n        min_rows : int\r\n            Minimum number of rows to be generated for a table\r\n        max_rows : int\r\n            Maximum number of rows to be generated for a table\r\n\r\n        Returns\r\n        -------\r\n        dict[str, pd.DataFrame]\r\n            Returns a dictionary with table names as keys and pandas DataFrames containing generated data as values\r\n        \"\"\"\r\n        if self.iterations is None:\r\n            self.iterations = self._generate_random_iterations(min_rows, max_rows)\r\n\r\n        generated_data = {}\r\n\r\n        for table in self.schema.models:\r\n            df = self._generate_test_data_for_table(table=table)\r\n            generated_data[table.name] = df\r\n\r\n        return generated_data<\/pre>\n<h4><span class=\"ez-toc-section\" id=\"Full-Picture\"><\/span>Full Picture<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>We&#8217;re all set to use the <code>TestDataGenerator<\/code>\u00a0to generate some actual data based on the <span class=\"lang:default decode:true crayon-inline \">schema.yml<\/span> file. Let&#8217;s add some field aliases to use Mimesis&#8217;s <span class=\"lang:default decode:true crayon-inline \">city<\/span> and <span class=\"lang:default decode:true crayon-inline \">airplane<\/span>\u00a0providers.<\/p>\n<pre class=\"lang:python decode:true\">from pydantic_yaml import parse_yaml_file_as\r\nfrom data_generator.models import DBTSchema\r\n\r\nFIELD_ALIASES = {\r\n    \"OriginCityName\": \"city\",\r\n    \"DestCityName\": \"city\",\r\n    \"AirplaneModel\": \"airplane\",\r\n}\r\n\r\n# parse the schema\r\nschema = parse_yaml_file_as(model_type=DBTSchema, file=\"dbt_mimesis_example\/seeds\/schema.yml\")\r\n\r\n# instantiate a TestDataGenerator\r\ngenerator = TestDataGenerator(schema=schema, field_aliases=FIELD_ALIASES)\r\n\r\n# Generate test data with row limits\r\ntest_data = generator.generate_data(min_rows=50, max_rows=200)\r\n\r\n# Store the generated data inside our dbt Seeds directory\r\nfor table_name, df in test_data.items():\r\n    print(f\"Generated data for {table_name}\")\r\n    df.to_csv(f\"\/PATH\/TO\/YOUR\/PROJECT\/DIR\/dbt_mimesis_example\/seeds\/{table_name}.csv\")<\/pre>\n<p>The repository also contains a <code>data_generator\/main.py<\/code> file with a simple CLI implemented using <a href=\"https:\/\/click.palletsprojects.com\/en\/stable\/\">click<\/a>. It allows you to generate test data for an arbitrary dbt schema using the following command:<\/p>\n<pre class=\"lang:sh decode:true\">poetry run python data_generator\/main.py --dbt-model-path dbt_mimesis_example\/seeds\/schema.yml --output-path dbt_mimesis_example\/seeds --min-rows 100 --max-rows 1000<\/pre>\n<p>In this case, we are generating a random number of rows within the range 100-1000 for each of the seeds in our <code>dbt_mimesis_example\/seeds\/schema.yml<\/code>. You can adjust the number of rows by setting the <code>--min-rows<\/code> and <code>--max-rows<\/code> flags.\u00a0<strong>Note:\u00a0<\/strong>When using larger numbers of rows (e.g., &gt; 1.000.000 min rows), generating the data\u00a0might take some time.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Using-Mimesis-to-Test-dbt-Pipelines\"><\/span>Using Mimesis to Test dbt Pipelines<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We successfully generated test data using the described method based on our dbt seeds schema. Now that we have some CSV files in our dbt seeds directory let&#8217;s run the <span class=\"lang:default decode:true crayon-inline\">dbt seed<\/span> command to load the data into our DuckDB database and run our dbt pipeline and tests:<\/p>\n<pre class=\"lang:sh decode:true\"># navigate to the dbt project directory\r\ncd dbt_mimesis_example\r\n\r\n# load seeds into database\r\npoetry run dbt seed\r\n\r\n# run dbt pipelines\r\npoetry run dbt run\r\n\r\n# perform dbt test\r\npoetry run dbt test<\/pre>\n<p>If everything went as expected, all tests should have passed, and you should now have five tables in your DuckDB database. Let&#8217;s inspect some values:<\/p>\n<pre class=\"lang:sh decode:true\"># Enter the duckdb database\r\nduckdb dev.duckdb\r\n\r\n# list tables\r\n.tables\r\n\r\n# inspect some values from the flights database\r\nSELECT * FROM flights LIMIT 5;<\/pre>\n<p>In our case, the output of the SQL command looks like this. Yours should look similar but with different values, as Mimesis generates them randomly:<\/p>\n<figure id=\"attachment_59918\" aria-describedby=\"caption-attachment-59918\" style=\"width: 640px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-59918 size-large\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Bildschirmfoto-2024-12-12-um-12.58.27-1024x292.png\" alt=\"duckdb table output\" width=\"640\" height=\"183\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Bildschirmfoto-2024-12-12-um-12.58.27-1024x292.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Bildschirmfoto-2024-12-12-um-12.58.27-300x85.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Bildschirmfoto-2024-12-12-um-12.58.27-768x219.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Bildschirmfoto-2024-12-12-um-12.58.27-400x114.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Bildschirmfoto-2024-12-12-um-12.58.27-360x103.png 360w, https:\/\/www.inovex.de\/wp-content\/uploads\/Bildschirmfoto-2024-12-12-um-12.58.27.png 1046w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><figcaption id=\"caption-attachment-59918\" class=\"wp-caption-text\">Sample values for the `flights` table<\/figcaption><\/figure>\n<h2><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>This blog post explored how to use Mimesis to test data pipelines. We implemented a data generator that automatically creates fake data based on Pydantic models derived from parsed dbt schemas.<\/p>\n<p>If you haven not done so already, check out our <a href=\"https:\/\/github.com\/inovex\/blog-dbt-mimesis\">GitHub repository.<\/a> It includes a GitHub Actions pipeline that automates dbt data testing using the approach we discussed. Have you tried applying this method to your dbt pipelines? We would love to hear your feedback! Happy testing! \ud83e\udd73<\/p>\n<p>If you want to dive deeper into the topics covered in this blog post, we recommend checking out the following resources:<\/p>\n<ul>\n<li>&#8211; <a href=\"https:\/\/mimesis.name\/master\/about.html\">Mimesis Documentation<\/a><\/li>\n<li>&#8211; <a href=\"https:\/\/docs.getdbt.com\/docs\/build\/documentation\">dbt Documentation<\/a><\/li>\n<li>&#8211; <a href=\"https:\/\/docs.pydantic.dev\/latest\/\">Pydantic Documentation<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In this blog post, we introduce a framework for data pipeline testing using dbt (data build tool) and Mimesis, a fake data generator for Python.<\/p>\n","protected":false},"author":357,"featured_media":60393,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[385],"service":[411],"coauthors":[{"id":357,"display_name":"Timo Hartmann","user_nicename":"thartmann"},{"id":302,"display_name":"Marvin Klossek","user_nicename":"mklossek"}],"class_list":["post-59897","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-data-engineering","service-data-engineering"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>How to Use Mimesis and dbt to Test Data Pipelines - inovex GmbH<\/title>\n<meta name=\"description\" content=\"In this blog post, we introduce a framework for data pipeline testing using dbt (data build tool) and Mimesis, a fake data generator for Python. Hands-on explanation.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to Use Mimesis and dbt to Test Data Pipelines - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"In this blog post, we introduce a framework for data pipeline testing using dbt (data build tool) and Mimesis, a fake data generator for Python. Hands-on explanation.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2025-01-17T08:15:15+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-01-08T09:59:32+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/How-to-Use-Mimesis-and-dbt-to-Test-Data-Pipelines.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1500\" \/>\n\t<meta property=\"og:image:height\" content=\"880\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Timo Hartmann, Marvin Klossek\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/How-to-Use-Mimesis-and-dbt-to-Test-Data-Pipelines-1024x601.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Timo Hartmann\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"18\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Timo Hartmann, Marvin Klossek\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\\\/\"},\"author\":{\"name\":\"Timo Hartmann\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/bd6ff2c4f97fac23695827c056c69c73\"},\"headline\":\"How to Use Mimesis and dbt to Test Data Pipelines\",\"datePublished\":\"2025-01-17T08:15:15+00:00\",\"dateModified\":\"2026-01-08T09:59:32+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\\\/\"},\"wordCount\":2345,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/How-to-Use-Mimesis-and-dbt-to-Test-Data-Pipelines.png\",\"keywords\":[\"Data Engineering\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\\\/\",\"name\":\"How to Use Mimesis and dbt to Test Data Pipelines - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/How-to-Use-Mimesis-and-dbt-to-Test-Data-Pipelines.png\",\"datePublished\":\"2025-01-17T08:15:15+00:00\",\"dateModified\":\"2026-01-08T09:59:32+00:00\",\"description\":\"In this blog post, we introduce a framework for data pipeline testing using dbt (data build tool) and Mimesis, a fake data generator for Python. Hands-on explanation.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/How-to-Use-Mimesis-and-dbt-to-Test-Data-Pipelines.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/How-to-Use-Mimesis-and-dbt-to-Test-Data-Pipelines.png\",\"width\":1500,\"height\":880,\"caption\":\"inovex IT-Security-Assessment\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How to Use Mimesis and dbt to Test Data Pipelines\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/bd6ff2c4f97fac23695827c056c69c73\",\"name\":\"Timo Hartmann\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/935855ef08b84d8c89d72b1008322baa7b9b34183abb7393859c18994c45cb20?s=96&d=retro&r=g58c66c75d73aeab133dca8f35bee5de2\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/935855ef08b84d8c89d72b1008322baa7b9b34183abb7393859c18994c45cb20?s=96&d=retro&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/935855ef08b84d8c89d72b1008322baa7b9b34183abb7393859c18994c45cb20?s=96&d=retro&r=g\",\"caption\":\"Timo Hartmann\"},\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/thartmann\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to Use Mimesis and dbt to Test Data Pipelines - inovex GmbH","description":"In this blog post, we introduce a framework for data pipeline testing using dbt (data build tool) and Mimesis, a fake data generator for Python. Hands-on explanation.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/","og_locale":"de_DE","og_type":"article","og_title":"How to Use Mimesis and dbt to Test Data Pipelines - inovex GmbH","og_description":"In this blog post, we introduce a framework for data pipeline testing using dbt (data build tool) and Mimesis, a fake data generator for Python. Hands-on explanation.","og_url":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2025-01-17T08:15:15+00:00","article_modified_time":"2026-01-08T09:59:32+00:00","og_image":[{"width":1500,"height":880,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/How-to-Use-Mimesis-and-dbt-to-Test-Data-Pipelines.png","type":"image\/png"}],"author":"Timo Hartmann, Marvin Klossek","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/How-to-Use-Mimesis-and-dbt-to-Test-Data-Pipelines-1024x601.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Timo Hartmann","Gesch\u00e4tzte Lesezeit":"18\u00a0Minuten","Written by":"Timo Hartmann, Marvin Klossek"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/"},"author":{"name":"Timo Hartmann","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/bd6ff2c4f97fac23695827c056c69c73"},"headline":"How to Use Mimesis and dbt to Test Data Pipelines","datePublished":"2025-01-17T08:15:15+00:00","dateModified":"2026-01-08T09:59:32+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/"},"wordCount":2345,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/How-to-Use-Mimesis-and-dbt-to-Test-Data-Pipelines.png","keywords":["Data Engineering"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/","url":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/","name":"How to Use Mimesis and dbt to Test Data Pipelines - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/How-to-Use-Mimesis-and-dbt-to-Test-Data-Pipelines.png","datePublished":"2025-01-17T08:15:15+00:00","dateModified":"2026-01-08T09:59:32+00:00","description":"In this blog post, we introduce a framework for data pipeline testing using dbt (data build tool) and Mimesis, a fake data generator for Python. Hands-on explanation.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/How-to-Use-Mimesis-and-dbt-to-Test-Data-Pipelines.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/How-to-Use-Mimesis-and-dbt-to-Test-Data-Pipelines.png","width":1500,"height":880,"caption":"inovex IT-Security-Assessment"},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/how-to-use-mimesis-and-dbt-to-test-data-pipelines\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"How to Use Mimesis and dbt to Test Data Pipelines"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/bd6ff2c4f97fac23695827c056c69c73","name":"Timo Hartmann","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/secure.gravatar.com\/avatar\/935855ef08b84d8c89d72b1008322baa7b9b34183abb7393859c18994c45cb20?s=96&d=retro&r=g58c66c75d73aeab133dca8f35bee5de2","url":"https:\/\/secure.gravatar.com\/avatar\/935855ef08b84d8c89d72b1008322baa7b9b34183abb7393859c18994c45cb20?s=96&d=retro&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/935855ef08b84d8c89d72b1008322baa7b9b34183abb7393859c18994c45cb20?s=96&d=retro&r=g","caption":"Timo Hartmann"},"url":"https:\/\/www.inovex.de\/de\/blog\/author\/thartmann\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/59897","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/357"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=59897"}],"version-history":[{"count":5,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/59897\/revisions"}],"predecessor-version":[{"id":65631,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/59897\/revisions\/65631"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/60393"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=59897"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=59897"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=59897"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=59897"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}