{"id":43280,"date":"2023-03-20T16:38:41","date_gmt":"2023-03-20T15:38:41","guid":{"rendered":"https:\/\/www.inovex.de\/?p=43280"},"modified":"2023-12-13T10:58:43","modified_gmt":"2023-12-13T09:58:43","slug":"documentai-challenges-real-world-usage","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/","title":{"rendered":"DocumentAI: Challenges and Real World Usage"},"content":{"rendered":"<p><span style=\"font-weight: 400\">The past years and even decades have seen rapid increases in automation and digitalization, with \u201cpaperless office\u201c being only one of the terms coined in recent years. <\/span><span style=\"font-weight: 400\">Some areas, however, appear to be less affected by this trend: documents. While documents <em>are<\/em> available via native\/scanned PDF files or images, their semi-structured format often needs additional processing, for example, OCR. <\/span><span style=\"font-weight: 400\">The task of extracting structured information from semi-structured documents is often referred to as Intelligent Document Processing (IDP) or DocumentAI. In the following sections, current trends, developments, and challenges in implementing DocumentAI systems are explored.<\/span><!--more--><\/p>\n<p><span style=\"font-weight: 400\">An important decision is: make or buy. While both <\/span><a href=\"https:\/\/azure.microsoft.com\/en-us\/products\/form-recognizer\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Microsoft Azure<\/span><\/a><span style=\"font-weight: 400\"> and <\/span><a href=\"https:\/\/cloud.google.com\/document-ai-workbench\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Google Cloud<\/span><\/a><span style=\"font-weight: 400\"> offer a complete package with model, model training and labeling, vendor lock-in, and confidential\/privacy data requirements may limit the use of the respective service. Additionally, it may prove valuable to develop the knowledge and capabilities in-house. As is often the case, no perfect system exists (yet). The field is still highly active and changing rapidly, with a new model getting published nearly every month. To answer the question of \u201cmake or buy\u201c it is helpful to make a detour to what is possible and what is available off-the-shelf.<\/span><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_83 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#Problem-The-ghost-from-the-past\" >Problem: The ghost from the past<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#The-solution\" >The solution<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#Traditional-approaches\" >Traditional approaches<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#Modern-approaches\" >Modern approaches<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#Recent-developments\" >Recent developments<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#ChatGPT\" >ChatGPT<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#%F0%9F%A4%97-and-the-power-of-collaboration\" >\ud83e\udd17 and the power of collaboration<\/a><\/li><\/ul><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#Challenges\" >Challenges<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#How-to-get-started-with-DocumentAI\" >How to get started with DocumentAI<\/a><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Problem-The-ghost-from-the-past\"><\/span>Problem: The ghost from the past<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-43290 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/document_ai-2.png\" alt=\"how text is stored within a PDF\" width=\"794\" height=\"349\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/document_ai-2.png 794w, https:\/\/www.inovex.de\/wp-content\/uploads\/document_ai-2-300x132.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/document_ai-2-768x338.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/document_ai-2-400x176.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/document_ai-2-360x158.png 360w\" sizes=\"auto, (max-width: 794px) 100vw, 794px\" \/><\/p>\n<p><span style=\"font-weight: 400\">Even if the document was a <em>native<\/em> PDF with a text layer and without any need for OCR, the problem persists: PDFs are a graphical format that only allows placing a letter, an image or a vector graphic on the canvas without any built-in capability to represent words, paragraphs, tables: This is why \u201cselecting\u201c text in a PDF document sometimes does not follow the apparent visual structure of the document. In fact, this feature is only possible because the PDF reading app uses a heuristic to group letters to words, words to paragraphs, and so on. It becomes even worse when multi-pane or other complex layouts are involved. Often, documents and their underlying formats prioritize human readability and are easy to print \u2013 but they lack a well-formed machine-readable structure.<\/span><\/p>\n<p><span style=\"font-weight: 400\">This is not surprising, as the main driver for the document format was, as the name Portable Document Format (PDF) already suggests, the need for a format that <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/History_of_PDF\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">renders a document exactly the same on every device<\/span><\/a><span style=\"font-weight: 400\"> regardless of the underlying operating system and configuration.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Easily extracting information from the document was not on the agenda, thus even a simple task such as the extraction of the \u201ctotal sum\u201c field from a scanned and OCRed receipt may involve more than a simple lookup. For some cases, custom document-specific extraction rules can solve the problem \u2013 but this kind of approach does not scale and, therefore, quickly becomes unfeasible.<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"The-solution\"><\/span>The solution<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400\">As we have seen, there is more to a \u201cdigital\u201c document than meets the eye \u2013 literally.<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span><span style=\"font-weight: 400\">So how to address the problem? Interestingly, the \u201cmundane\u201c field of DocumentAI\/<\/span><span style=\"font-weight: 400\">Intelligent Document Processing (IDP) leveraged <em>several<\/em> technologies from other disciplines and combined them \u2013 a first hint that extraction sometimes is surprisingly difficult.<\/span><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Traditional-approaches\"><\/span>Traditional approaches<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400\">Traditional approaches often work with handcrafted features, mainly using the visual location of a word or a letter (bounding box) on the canvas as well as the textual data itself. With feature engineering, both the bounding boxes and the textual information can be refined and integrated further: The \u201ctotal sum\u201c field, for example, is likely to consist mostly of numbers that may be placed close to the bottom of a page.<\/span><\/p>\n<p><span style=\"font-weight: 400\">While these approaches are relatively easy to implement via Decision Trees, and a combination of TF-IDF matrices and custom features, lots of tinkering, trial and error, and feature engineering are involved. Above all, however, models trained on one dataset tend to not generalize well to slightly different ones. Also, long-range dependencies between words are hard to model with Decision Trees (a solution would be to use RNNs or Attention Based Models such as the Transformer), resulting in a simplified representation of the meaning of text. On the upside, simpler models are easier to understand, diagnose, and interpret and faster to train. You do not need an expensive sports car to go grocery shopping.<\/span><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Modern-approaches\"><\/span>Modern approaches<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400\">But sometimes a regular car just is not enough: With <\/span><a href=\"https:\/\/arxiv.org\/pdf\/2002.08087.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">LAMBERT<\/span><\/a><span style=\"font-weight: 400\">, Garncarek et al. introduced a layout-aware encoder-only model in 2021 based on RoBERTa with added bounding box coordinates that result in a significantly better performance than RoBERTa alone.<\/span><\/p>\n<p><span style=\"font-weight: 400\">One year earlier, Xu et al. from Microsoft Asia Research published <\/span><a href=\"https:\/\/arxiv.org\/pdf\/1912.13318.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">LayoutLM<\/span><\/a><span style=\"font-weight: 400\">, which uses the BERT architecture and bound box coordinates but also adds image embeddings obtained from a Faster-R-CNN, adding another <em>modality<\/em>, the image of a page, to the toolchain. LayoutLM is currently available in its <\/span><a href=\"https:\/\/arxiv.org\/pdf\/2204.08387.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">3rd iteration<\/span><\/a><span style=\"font-weight: 400\">, which swaps the Faster-R-CNN backbone with a linear embedding for better and easier integration with the remaining BERT architecture. <\/span><span style=\"font-weight: 400\">As of writing this, <\/span><a href=\"https:\/\/huggingface.co\/docs\/transformers\/model_doc\/layoutlmv3\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">LayoutLMv3<\/span><\/a><span style=\"font-weight: 400\"> can be regarded as the state-of-the-art model for <\/span><a href=\"https:\/\/paperswithcode.com\/paper\/layoutlmv3-pre-training-for-document-ai-with\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">several tasks<\/span><\/a><span style=\"font-weight: 400\">. That being said, only the first version was available under a permissive license, rendering the successors unfit for commercial applications.<\/span><\/p>\n<p><span style=\"font-weight: 400\">A promising alternative is <\/span><a href=\"https:\/\/arxiv.org\/pdf\/2202.13669.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">LiLT<\/span><\/a><span style=\"font-weight: 400\"> by Wang et al. published in 2022, which is available from the <\/span><a href=\"https:\/\/huggingface.co\/docs\/transformers\/main\/en\/model_doc\/lilt\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">huggingface hub<\/span><\/a><span style=\"font-weight: 400\"> under a permissive license that allows commercial use (MIT). LiLT again drops the image modality and uses only bounding box layout information, coined \u201clayout flow\u201c, together with the RoBERTa \u201ctext flow\u201c.\u00a0<\/span><span style=\"font-weight: 400\">What makes this model particularly interesting is that both flows are loosely coupled, allowing to take the layout flow from a LiLT model on English receipts and fine-tune it with a corresponding off-the-shelf pre-trained RoBERTa German model serving as the text flow. This ability to move quickly from one language to another is an essential asset for many business applications.<\/span><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Recent-developments\"><\/span>Recent developments<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400\">While the models discussed in this section are much larger and more complex than traditional approaches, they also tend to generalize much better to related data and do not need to be custom fitted to the data set using feature engineering. <\/span><span style=\"font-weight: 400\">In contrast to the aforementioned models which rely on provided text information that was either gathered via <\/span><a href=\"https:\/\/github.com\/tesseract-ocr\/tesseract\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">OCR<\/span><\/a><span style=\"font-weight: 400\"> (for scans\/photos) or an available native text layer in the PDF, other solutions such as <\/span><a href=\"https:\/\/arxiv.org\/pdf\/2111.15664.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Donut<\/span><\/a><span style=\"font-weight: 400\"> published by Kim et al. in 2022 are \u201cOCR-free\u201c, which is particularly interesting for a custom task such as handwriting recognition.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Another promising direction is the \u201cre-discovery\u201c of the \u201cfull\u201c transformer with encoder and decoder blocks, such as the <\/span><a href=\"https:\/\/www.jmlr.org\/papers\/volume21\/20-074\/20-074.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">T5 model<\/span><\/a><span style=\"font-weight: 400\"> published in 2020 by Raffel et al., that does not need any bounding box annotations for training. Therefore, it is less costly and easier to train such a model, since human labeling of bounding boxes on the document structure is no longer needed. At training time, the model only relies on the text information and the key-value pairs that should be extracted. A database export would suffice. Powalski et al. from <\/span><a href=\"https:\/\/www.snowflake.com\/blog\/snowflake-to-acquire-applica\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Applica (now part of Snowflake)<\/span><\/a><span style=\"font-weight: 400\"> enhanced the T5 model further and introduced the <\/span><a href=\"https:\/\/arxiv.org\/pdf\/2102.09550.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">TILT<\/span><\/a><span style=\"font-weight: 400\"> model by making it multi-modal by adding an image and bounding boxes. A process they call \u201c<\/span><a href=\"https:\/\/arxiv.org\/pdf\/2102.09550.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Going Full-TILT Boogie<\/span><\/a><span style=\"font-weight: 400\">\u201c.<\/span><\/p>\n<p><span style=\"font-weight: 400\">And with the growing capabilities of <\/span><a href=\"https:\/\/huggingface.co\/tasks\/document-question-answering\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Document Question Answering<\/span><\/a><span style=\"font-weight: 400\">, no pre-configured learned structure might be needed at all. Simply ask questions about the document in natural language: \u201cWhat is the total sum of all the items?\u201c. But for now, this only works reliably with non-complex documents. <\/span><\/p>\n<h4><span class=\"ez-toc-section\" id=\"ChatGPT\"><\/span>ChatGPT<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p><span style=\"font-weight: 400\">Another quite promising contender is \u2013 of course \u2013 ChatGPT and similar models. While they were not specifically trained as DocumentAI models, they have astounding capabilities, especially in the so-called <\/span><i><span style=\"font-weight: 400\">zero-shot <\/span><\/i><span style=\"font-weight: 400\">area: <\/span><a href=\"https:\/\/techcommunity.microsoft.com\/t5\/ai-applied-ai-blog\/revolutionize-your-enterprise-data-with-chatgpt-next-gen-apps-w\/ba-p\/3762087\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">ChaptGPT, for example, is surprisingly good<\/span><\/a><span style=\"font-weight: 400\"> at extracting information from text even if it has never seen text data from the niche domain of concern. This is all thanks to the remarkable generalization capabilities these models possess.<\/span><\/p>\n<p><span style=\"font-weight: 400\">What ChatGPT lacks, however, is the multi-modality: For now, it only accepts and processes text input without layout information. The latter is, as we have seen, crucial for complex documents with multi-level tables. Furthermore, few have both the resources and willingness to use such a complex and closed-source model within the organization, where privacy, compliance, and economic viability are major concerns. That being said, it is worthwhile to test-drive ChatGPT for DocumentAI tasks and less complex use cases.<\/span><\/p>\n<h4><span class=\"ez-toc-section\" id=\"%F0%9F%A4%97-and-the-power-of-collaboration\"><\/span>\ud83e\udd17 and the power of collaboration<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p><span style=\"font-weight: 400\">Nearly every recent contribution is available from the <\/span><a href=\"https:\/\/huggingface.co\/models\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">huggingface hub<\/span><\/a><span style=\"font-weight: 400\">, making it much easier for the Data Scientist \/ Machine Learning Engineer to get started and move to another model. See the <\/span><a href=\"https:\/\/huggingface.co\/blog\/document-ai\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">huggingface blog post<\/span><\/a><span style=\"font-weight: 400\"> on DocumentAI for further information on the relevant benchmark datasets, metrics, license issues, and related disciplines. <\/span><span style=\"font-weight: 400\">To keep up with the latest innovations and model releases, Paperswithcode provides <\/span><a href=\"https:\/\/paperswithcode.com\/sota\/semantic-entity-labeling-on-funsd\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">leaderboards<\/span><\/a><span style=\"font-weight: 400\"> on popular benchmark datasets, while Niels Rogge\u2019s <\/span><a href=\"https:\/\/github.com\/NielsRogge\/Transformers-Tutorials\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Tranformer Tutorials<\/span><\/a><span style=\"font-weight: 400\"> and <\/span><a href=\"https:\/\/www.philschmid.de\/fine-tuning-lilt\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Phillip Schmid\u2019s<\/span><\/a><span style=\"font-weight: 400\"> blog are valuable resources for a hands-on guide<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Challenges\"><\/span>Challenges<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400\">Even though many models and benchmarks exist, real-world data tends to be different and more challenging to solve. Notable exceptions such as Applica\u2019s <\/span><a href=\"https:\/\/github.com\/applicaai\/kleister-nda\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Kleister-NDA<\/span><\/a><span style=\"font-weight: 400\"> or IBM\u2019s <\/span><a href=\"https:\/\/github.com\/DS4SD\/DocLayNet\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">DocLayNet<\/span><\/a><span style=\"font-weight: 400\"> remain rare, and oftentimes simple tasks such as receipt understanding are far less challenging than complex documents with layout shifts and tables. For companies, it can therefore be difficult to assess whether their extraction tasks can be automated with the currently available models. Starting with a pilot means considerable time-invest \u2013 especially labeling the documents can be costly, even if open source solutions such as <\/span><a href=\"https:\/\/labelstud.io\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Label Studio<\/span><\/a><span style=\"font-weight: 400\"> are leveraged.<\/span><\/p>\n<p><span style=\"font-weight: 400\">What works well in a paper is not guaranteed to transfer to the \u201creal world\u201c. Even having a perfect model only means <em>one<\/em> part of the toolchain is ready.<\/span><\/p>\n<p><span style=\"font-weight: 400\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-43292 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/document_ai-3.png\" alt=\"DocumentAI systems\" width=\"794\" height=\"367\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/document_ai-3.png 794w, https:\/\/www.inovex.de\/wp-content\/uploads\/document_ai-3-300x139.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/document_ai-3-768x355.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/document_ai-3-400x185.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/document_ai-3-360x166.png 360w\" sizes=\"auto, (max-width: 794px) 100vw, 794px\" \/>\u00a0<\/span><\/p>\n<h3><span class=\"ez-toc-section\" id=\"How-to-get-started-with-DocumentAI\"><\/span>How to get started with DocumentAI<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400\">Arguably more important than chasing benchmark scores and using the absolute \u201cbest-of-the-best\u201c approach is to establish data literacy within the company and get an overview of existing in-house data and the challenges faced with processing and automation. Keeping an overview of a challenging and fast-paced environment is difficult and requires technical expertise as well as practical experience and intuition. <\/span><span style=\"font-weight: 400\">Most of the mentioned models are publicly available. Still, their integration with existing systems and processes, a robust (re-)training and inference pipeline, and the toolchain for pre-processing and labeling are vital parts as well.<\/span><\/p>\n<p><span style=\"font-weight: 400\">To get a firm understanding of the opportunities of the technology and its use within your company, the following questions may guide you through the decision-making process: Can your company benefit from DocumentAI? How could it be integrated with existing systems? How to transform and ingest document data at scale? Is SaaS (software as a service) from big software vendors an option, or is a custom solution a better fit for the requirements?<\/span><\/p>\n<p><span style=\"font-weight: 400\">Off-the-shelf SaaS solutions have the advantage of a quick setup, low technical barriers, and the flexibility of an on-premise installation. But they are not very customizable, and in a rapidly changing environment settling for one provider can be problematic \u2013 especially if many capable models are freely available. Thus, for documents with complex tables or handwriting, a custom training process might be necessary.<\/span><\/p>\n<p><span style=\"font-weight: 400\">A follow-up blog post will explore the tools needed to implement DocumentAI at a company. So stay tuned for more stories about sports cars and a hands-on guide on how to drive Intelligent Document Processing home.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The past years and even decades have seen rapid increases in automation and digitalization, with \u201cpaperless office\u201c being only one of the terms coined in recent years. Some areas, however, appear to be less affected by this trend: documents. While documents are available via native\/scanned PDF files or images, their semi-structured format often needs additional [&hellip;]<\/p>\n","protected":false},"author":337,"featured_media":43534,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[733,225,578,579,140],"service":[76,414,431,75],"coauthors":[{"id":337,"display_name":"Malte B\u00fcttner","user_nicename":"mbuettner"}],"class_list":["post-43280","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-cloud-en-2","tag-data-science-in-production","tag-information-extraction","tag-layoutlm","tag-machine-learning","service-artificial-intelligence","service-cloud","service-data-science","service-nlp"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>DocumentAI: Challenges and Real World Usage - inovex GmbH<\/title>\n<meta name=\"description\" content=\"This article explores recent developments in extracting structured information from semi-structured documents.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"DocumentAI: Challenges and Real World Usage - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"This article explores recent developments in extracting structured information from semi-structured documents.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2023-03-20T15:38:41+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-12-13T09:58:43+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/document-ai-header-2.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Malte B\u00fcttner\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/document-ai-header-2-1024x576.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Malte B\u00fcttner\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"9\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Malte B\u00fcttner\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/documentai-challenges-real-world-usage\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/documentai-challenges-real-world-usage\\\/\"},\"author\":{\"name\":\"Malte B\u00fcttner\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/1327bd076626e70f17bb045a68002602\"},\"headline\":\"DocumentAI: Challenges and Real World Usage\",\"datePublished\":\"2023-03-20T15:38:41+00:00\",\"dateModified\":\"2023-12-13T09:58:43+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/documentai-challenges-real-world-usage\\\/\"},\"wordCount\":1837,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/documentai-challenges-real-world-usage\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/document-ai-header-2.png\",\"keywords\":[\"Cloud\",\"Data Science in Production\",\"Information Extraction\",\"LayoutLM\",\"Machine Learning\"],\"articleSection\":[\"English Content\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/documentai-challenges-real-world-usage\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/documentai-challenges-real-world-usage\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/documentai-challenges-real-world-usage\\\/\",\"name\":\"DocumentAI: Challenges and Real World Usage - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/documentai-challenges-real-world-usage\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/documentai-challenges-real-world-usage\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/document-ai-header-2.png\",\"datePublished\":\"2023-03-20T15:38:41+00:00\",\"dateModified\":\"2023-12-13T09:58:43+00:00\",\"description\":\"This article explores recent developments in extracting structured information from semi-structured documents.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/documentai-challenges-real-world-usage\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/documentai-challenges-real-world-usage\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/documentai-challenges-real-world-usage\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/document-ai-header-2.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/document-ai-header-2.png\",\"width\":1920,\"height\":1080,\"caption\":\"Illustration of heterogeneous data being extracted from a document using ai.\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/documentai-challenges-real-world-usage\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"DocumentAI: Challenges and Real World Usage\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/1327bd076626e70f17bb045a68002602\",\"name\":\"Malte B\u00fcttner\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/profile1-96x96.jpg2923d91c8cf793a60efbdeed40dd2728\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/profile1-96x96.jpg\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/profile1-96x96.jpg\",\"caption\":\"Malte B\u00fcttner\"},\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/in\\\/maltepaulb\\\/\"],\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/mbuettner\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"DocumentAI: Challenges and Real World Usage - inovex GmbH","description":"This article explores recent developments in extracting structured information from semi-structured documents.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/","og_locale":"de_DE","og_type":"article","og_title":"DocumentAI: Challenges and Real World Usage - inovex GmbH","og_description":"This article explores recent developments in extracting structured information from semi-structured documents.","og_url":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2023-03-20T15:38:41+00:00","article_modified_time":"2023-12-13T09:58:43+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/document-ai-header-2.png","type":"image\/png"}],"author":"Malte B\u00fcttner","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/document-ai-header-2-1024x576.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Malte B\u00fcttner","Gesch\u00e4tzte Lesezeit":"9\u00a0Minuten","Written by":"Malte B\u00fcttner"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/"},"author":{"name":"Malte B\u00fcttner","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/1327bd076626e70f17bb045a68002602"},"headline":"DocumentAI: Challenges and Real World Usage","datePublished":"2023-03-20T15:38:41+00:00","dateModified":"2023-12-13T09:58:43+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/"},"wordCount":1837,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/document-ai-header-2.png","keywords":["Cloud","Data Science in Production","Information Extraction","LayoutLM","Machine Learning"],"articleSection":["English Content"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/","url":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/","name":"DocumentAI: Challenges and Real World Usage - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/document-ai-header-2.png","datePublished":"2023-03-20T15:38:41+00:00","dateModified":"2023-12-13T09:58:43+00:00","description":"This article explores recent developments in extracting structured information from semi-structured documents.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/document-ai-header-2.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/document-ai-header-2.png","width":1920,"height":1080,"caption":"Illustration of heterogeneous data being extracted from a document using ai."},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/documentai-challenges-real-world-usage\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"DocumentAI: Challenges and Real World Usage"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/1327bd076626e70f17bb045a68002602","name":"Malte B\u00fcttner","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/wp-content\/uploads\/profile1-96x96.jpg2923d91c8cf793a60efbdeed40dd2728","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/profile1-96x96.jpg","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/profile1-96x96.jpg","caption":"Malte B\u00fcttner"},"sameAs":["https:\/\/www.linkedin.com\/in\/maltepaulb\/"],"url":"https:\/\/www.inovex.de\/de\/blog\/author\/mbuettner\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/43280","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/337"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=43280"}],"version-history":[{"count":5,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/43280\/revisions"}],"predecessor-version":[{"id":50266,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/43280\/revisions\/50266"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/43534"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=43280"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=43280"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=43280"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=43280"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}