{"id":43797,"date":"2023-07-21T12:58:23","date_gmt":"2023-07-21T10:58:23","guid":{"rendered":"https:\/\/www.inovex.de\/?p=43797"},"modified":"2023-07-21T12:59:36","modified_gmt":"2023-07-21T10:59:36","slug":"exploiting-foundation-models-for-improved-language-identification-from-speech","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/","title":{"rendered":"Exploiting Foundation Models for Improved Language Identification from Speech"},"content":{"rendered":"<p>In the current technological landscape, an extensive range of voice assistants and dictation software have been developed, capable of processing human speech for diverse purposes. Nevertheless, the majority of these applications necessitate manual language specification by the user, as they lack the capability of automated language identification. The simultaneous use of several languages is only possible to a very limited extent.<!--more--><\/p>\n<p><span id=\"more-43280\"><\/span>For the language assistants in private use (Google Assistant, Apple Siri, Amazon Alexa), this automatic language identification is not absolutely necessary since the languages spoken do not change often. The language identification, however, would be essential for applications that are intended to be used in public spaces. An example of this would be chatbots at train stations or airports with the purpose of answering questions from a wide variety of people in different languages. Software that reliably identifies the spoken language would be essential for easy and efficient verbal communication between the chatbot and the users. Other applications for this kind of software, which are described later in detail, would be online translators or large video streaming platforms.<span id=\"Problem-The-ghost-from-the-past\"><\/span><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#Current-models-for-language-identification\" >Current models for language identification<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#Learning-the-wrong-objectives\" >Learning the wrong objectives<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#The-Solution\" >The Solution<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#The-synergistic-combination-of-foundation-models\" >The synergistic combination of foundation models<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#Experimental-results-on-the-FLEURS-dataset\" >Experimental results on the FLEURS dataset<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#Practical-use-of-the-results\" >Practical use of the results<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#Conclusion\" >Conclusion<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#References\" >References<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Current-models-for-language-identification\"><\/span><span id=\"Problem-The-ghost-from-the-past\"><\/span>Current models for language identification<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Several language identification models have already been developed but with significant limitations. Many of these models have narrow language coverage, while others suffer from high classification error rates. Consequently, current models do not possess the necessary accuracy for unambiguous language recognition, which is a critical requirement for the aforementioned applications. In these applications, a vast array of languages must be identified with very high accuracy in order to be useful in the real world.<\/p>\n<p>In some of the previous approaches, Convolutional Neural Networks (CNNs) (Revay and Teschke, 2019), a combination of CNNs and Recurrent Neural Networks (Bartz et al., 2017), and a combination of a Gaussian Mixture Model and a Support Vector Machine (Mitra, Garcia-Romero, and Espy-Wilson, 2008) were used to identify languages in audio. These models were created and tested using only six languages. These models can identify spoken languages with very good accuracies (around 90 % classification accuracy). However, the amount of languages used in these approaches is too limited to be of practical use.<\/p>\n<p>As for models which are able to recognize more languages, there have been models like wav-to-vec-51 (Conneau et al., 2023), mSLAM-CTC (Bapna et al., 2022) or <a href=\"https:\/\/openai.com\/research\/whisper\" target=\"_blank\" rel=\"noopener\">Whisper<\/a> (Radford et al., 2022). These models have been tested with mediocre results on the <a href=\"https:\/\/arxiv.org\/pdf\/2205.12446.pdf\" target=\"_blank\" rel=\"noopener\">FLEURS<\/a> test dataset (Conneau et al., 2023) which includes 102 languages:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-43812 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_preliminary_results.png\" alt=\"Results for existing models for language identification\" width=\"482\" height=\"150\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_preliminary_results.png 482w, https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_preliminary_results-300x93.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_preliminary_results-400x124.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_preliminary_results-360x112.png 360w\" sizes=\"auto, (max-width: 482px) 100vw, 482px\" \/><\/p>\n<p>We can see that currently available models exhibit a clear trade-off between the number of languages they can identify and their identification accuracy. Researchers have not yet developed a reliable method that can identify many languages while achieving very high accuracy.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Learning-the-wrong-objectives\"><\/span>Learning the wrong objectives<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The biggest challenge for training new models for language identification is that during the training process, the model can <strong>learn the wrong objectives<\/strong>. More precisely, the model could learn the voices or specific audio characteristics of the audio files rather than the difference between the languages themselves. The gravity of this problem increases for a smaller dataset. A larger model can easily learn to differentiate between a small number of voices or other audio features like the characteristics of the specific microphones used.<\/p>\n<p>This effect has been observed during our own training and testing efforts of new models. When using a part of an initial dataset as a validation dataset, the training and validation accuracy were constantly very high. However, when testing the model with a different dataset that did not include the same voices but the same languages, the accuracy regularly dropped to an accuracy only marginally higher than the accuracy which would be expected for a random classifier. This indicates that language-specific features have not or only to a very limited extent been learned. Consequently, other unwanted features like specific voices or audio characteristics have been learned.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"The-Solution\"><\/span>The Solution<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>In the training process, a dataset with many different voices and audio characteristics can be used in order to avoid learning the wrong objectives and to ensure that language-specific characteristics are learned. Also, the relative share of speakers of each gender should be similar for each language. If, for example, for one language the speakers are mostly female while for a different language the speakers are mostly male, the model could simply learn to differentiate between the two genders and not between the two languages.<\/p>\n<p>Furthermore, for validation and testing of the model, different datasets with different speakers should be used to ensure that the only common characteristic of the training and validation\/test data is the language that is being spoken. In this article, the previously mentioned methods of overcoming the problem of the model learning false objectives have been implemented by using the FLEURS dataset. This dataset includes many different speakers as well as no speaker overlap between the development, test, and training sets.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"The-synergistic-combination-of-foundation-models\"><\/span><span id=\"The-solution\"><\/span>The synergistic combination of foundation models<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Our method, which combines and enhances the current language identification models, can be summarised as training of a new classifier model which uses existing large foundation models as embedding layers. Within this approach, the already existing models were used as a basis to extract relevant information from the audio files. Then a newly trained model was built on top of them, only representing the last layers in order to perform the final classification.<\/p>\n<p>In the experiments, we built an instance of this approach that utilizes a combination of the penultimate layers of <a href=\"https:\/\/openai.com\/research\/whisper\" target=\"_blank\" rel=\"noopener\">Whisper<\/a> and <a href=\"https:\/\/huggingface.co\/TalTechNLP\/voxlingua107-epaca-tdnn\" target=\"_blank\" rel=\"noopener\">Time Delay Neural Network<\/a>. We took the output vectors of the penultimate layers of these two models, concatenated them, and used these concatenated vectors from each audio file as input vectors for the training of a new classifier model. For training, validation, and testing, the corresponding audio files from the FLEURS training, development, and test dataset have been used. Using this approach, the penultimate layers of the two models are essentially combined. Instead of the original output layers of each model, the newly trained model acts as an output layer that can classify a combined input vector.<\/p>\n<p>An example of a simple, binary classification with one output neuron could look like this:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-43809 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_method.png\" alt=\"An illustration showing the described method for language identification.\" width=\"1100\" height=\"603\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_method.png 1100w, https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_method-300x164.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_method-1024x561.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_method-768x421.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_method-400x219.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_method-528x290.png 528w, https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_method-360x197.png 360w\" sizes=\"auto, (max-width: 1100px) 100vw, 1100px\" \/><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Experimental-results-on-the-FLEURS-dataset\"><\/span>Experimental results on the FLEURS dataset<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>To develop this approach, we used the complete FLEURS dataset, which contains 102 languages. We evaluated this approach using the FLEURS test dataset and present the results of this evaluation, along with a comparison to existing models, in the following table:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-43814 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_results.png\" alt=\"Final results for all methods for language identification.\" width=\"492\" height=\"192\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_results.png 492w, https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_results-300x117.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_results-400x156.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/language_identification_results-360x140.png 360w\" sizes=\"auto, (max-width: 492px) 100vw, 492px\" \/><\/p>\n<p>We can largely attribute the bad results of the individual Whisper model to the fact that it lacked training for many languages present in the FLEURS dataset. On the other hand, we can largely attribute the positive results of the combination method to the usage and extension of Whisper by adding the missing classes. Nevertheless, the combination of the two models still outperforms the individual models whose classes have been extended by using the method described above (adding the new readout model). When we evaluated those models individually, we found that the Whisper model achieved an accuracy of around 87.5 %, and the Time Delay Neural Network achieved an accuracy of around 76.5 %.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Practical-use-of-the-results\"><\/span><span id=\"Challenges\"><\/span>Practical use of the results<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The enhancement of language identification accuracy can affect a variety of real-world scenarios. One example for the affected areas would be the automatic generation of subtitles for online video platforms. Here a large amount of user-generated videos in unknown languages are uploaded regularly. In order to choose the correct speech-to-text model for the transcription, the language would first have to be identified.\u00a0Also, service chatbots in public areas like airports or train stations with multinational conversation partners could be enhanced by being able to identify languages, making conversations possible without the user having to select a specific language.<\/p>\n<p>Additionally, the results could be relevant for online translators, as many of them are able to recognize any language from text, but not from speech. In the case of <a href=\"https:\/\/translate.google.com\/\" target=\"_blank\" rel=\"noopener\">Google Translator<\/a>, for example, a language must be selected for using the voice input. When using text as input, however, the language of the text can automatically be identified and translated into the desired language. Improved language identification in speech could be used to automatically identify the language of the voice input.<\/p>\n<p>When training speech-to-text models, we can collect audio data in large amounts from the internet. However, when we need to train models for a specific language, we must verify the language of the audio data. Since verifying language manually would take a lot of time, we could use a language identification model.<\/p>\n<p>Finally, as our approach requires only the training of the small classifier model, it is very useful for groups with limited compute resources as we only execute the forward inference pass of the large foundation models. All the experiments we conducted were done on a single node with a single GPU.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span><span id=\"Challenges\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The combination of different models as an embedding layer and the training of a final classifier achieves higher accuracy than the current models individually. This increase in classification accuracy, however, comes with an increase in the inference time. This is due to the forward propagation being executed for multiple models. Additionally, the training process within this method requires data collection and time for the training of the model. When implementing a similar method, one should consider the trade-off between inference time and classification accuracy. Finally, we have significantly improved the initially described problem of low accuracy for classifying many languages.<\/p>\n<p>Furthermore, we can use the method of combining models in a specific domain and training a model on top not only for language identification but also for other relevant applications like image classification. By using this method, we can potentially improve classification performance in other domains as well.<\/p>\n<p>The contents of this blog post are based on a master&#8217;s thesis written by Benedikt Augenstein.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"References\"><\/span><span id=\"Challenges\"><\/span>References<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Bapna, Ankur et al. (2022). \u201cmslam: Massively multilingual joint pre-training for speech and text\u201c. In: arXiv preprint arXiv:2202.01374.<\/p>\n<p>Bartz, Christian et al. (2017). \u201cLanguage identification using deep convolutional recurrent neural networks\u201c. In: Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 14\u201318, 2017, Proceedings, Part VI 24. Springer, pp. 880\u2013889.<\/p>\n<p>Conneau, Alexis et al. (2023). \u201cFleurs: Few-shot learning evaluation of universal representations of speech\u201c. In: 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, pp. 798\u2013805.<\/p>\n<p>Mitra, Vikramjit, Daniel Garcia-Romero, and Carol Y Espy-Wilson (2008). \u201cLanguage detection in audio content analysis\u201c. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp. 2109\u20132112.<\/p>\n<p>Radford, Alec et al. (2022). \u201cRobust speech recognition via large-scale weak supervision\u201c. In: arXiv preprint arXiv:2212.04356.<\/p>\n<p>Revay, Shauna and Matthew Teschke (2019). \u201cMulticlass language identification using deep learning on spectral images of audio signals\u201c. In: arXiv preprint arXiv:1905.04348.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the current technological landscape, an extensive range of voice assistants and dictation software have been developed, capable of processing human speech for diverse purposes. Nevertheless, the majority of these applications necessitate manual language specification by the user, as they lack the capability of automated language identification. The simultaneous use of several languages is only [&hellip;]<\/p>\n","protected":false},"author":342,"featured_media":43805,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[509,511,373,375,206,225,151,140],"service":[76,431,75],"coauthors":[{"id":342,"display_name":"Benedikt Augenstein","user_nicename":"baugenstein"},{"id":199,"display_name":"Darjan Salaj","user_nicename":"dsalaj"}],"class_list":["post-43797","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-ai-2","tag-artificial-intelligence-2","tag-chatbot","tag-conversational-ai","tag-data-science","tag-data-science-in-production","tag-deep-learning","tag-machine-learning","service-artificial-intelligence","service-data-science","service-nlp"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Exploiting Foundation Models for Improved Language Identification from Speech - inovex GmbH<\/title>\n<meta name=\"description\" content=\"In this article, a new method for enhancing the accuracy for the task of language identification is described.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Exploiting Foundation Models for Improved Language Identification from Speech - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"In this article, a new method for enhancing the accuracy for the task of language identification is described.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2023-07-21T10:58:23+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-07-21T10:59:36+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/sound_waves.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1280\" \/>\n\t<meta property=\"og:image:height\" content=\"640\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Benedikt Augenstein, Darjan Salaj\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/sound_waves-1024x512.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Benedikt Augenstein\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"9\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Benedikt Augenstein, Darjan Salaj\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/exploiting-foundation-models-for-improved-language-identification-from-speech\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/exploiting-foundation-models-for-improved-language-identification-from-speech\\\/\"},\"author\":{\"name\":\"Benedikt Augenstein\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/2628f665b940d875d58e202b90e86307\"},\"headline\":\"Exploiting Foundation Models for Improved Language Identification from Speech\",\"datePublished\":\"2023-07-21T10:58:23+00:00\",\"dateModified\":\"2023-07-21T10:59:36+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/exploiting-foundation-models-for-improved-language-identification-from-speech\\\/\"},\"wordCount\":1756,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/exploiting-foundation-models-for-improved-language-identification-from-speech\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/sound_waves.png\",\"keywords\":[\"Ai\",\"Artificial Intelligence\",\"Chatbot\",\"Conversational Ai\",\"Data Science\",\"Data Science in Production\",\"Deep Learning\",\"Machine Learning\"],\"articleSection\":[\"Applications\",\"English Content\",\"inovex Lab\",\"Methods\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/exploiting-foundation-models-for-improved-language-identification-from-speech\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/exploiting-foundation-models-for-improved-language-identification-from-speech\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/exploiting-foundation-models-for-improved-language-identification-from-speech\\\/\",\"name\":\"Exploiting Foundation Models for Improved Language Identification from Speech - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/exploiting-foundation-models-for-improved-language-identification-from-speech\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/exploiting-foundation-models-for-improved-language-identification-from-speech\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/sound_waves.png\",\"datePublished\":\"2023-07-21T10:58:23+00:00\",\"dateModified\":\"2023-07-21T10:59:36+00:00\",\"description\":\"In this article, a new method for enhancing the accuracy for the task of language identification is described.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/exploiting-foundation-models-for-improved-language-identification-from-speech\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/exploiting-foundation-models-for-improved-language-identification-from-speech\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/exploiting-foundation-models-for-improved-language-identification-from-speech\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/sound_waves.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/sound_waves.png\",\"width\":1280,\"height\":640},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/exploiting-foundation-models-for-improved-language-identification-from-speech\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Exploiting Foundation Models for Improved Language Identification from Speech\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/2628f665b940d875d58e202b90e86307\",\"name\":\"Benedikt Augenstein\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/05897b0b1f3914d551f24512a8e37a59d5893060749349beed84526b7be3188c?s=96&d=retro&r=gd6a30fe44f1c7a53e84027d8b54cfc41\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/05897b0b1f3914d551f24512a8e37a59d5893060749349beed84526b7be3188c?s=96&d=retro&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/05897b0b1f3914d551f24512a8e37a59d5893060749349beed84526b7be3188c?s=96&d=retro&r=g\",\"caption\":\"Benedikt Augenstein\"},\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/baugenstein\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Exploiting Foundation Models for Improved Language Identification from Speech - inovex GmbH","description":"In this article, a new method for enhancing the accuracy for the task of language identification is described.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/","og_locale":"de_DE","og_type":"article","og_title":"Exploiting Foundation Models for Improved Language Identification from Speech - inovex GmbH","og_description":"In this article, a new method for enhancing the accuracy for the task of language identification is described.","og_url":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2023-07-21T10:58:23+00:00","article_modified_time":"2023-07-21T10:59:36+00:00","og_image":[{"width":1280,"height":640,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/sound_waves.png","type":"image\/png"}],"author":"Benedikt Augenstein, Darjan Salaj","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/sound_waves-1024x512.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Benedikt Augenstein","Gesch\u00e4tzte Lesezeit":"9\u00a0Minuten","Written by":"Benedikt Augenstein, Darjan Salaj"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/"},"author":{"name":"Benedikt Augenstein","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/2628f665b940d875d58e202b90e86307"},"headline":"Exploiting Foundation Models for Improved Language Identification from Speech","datePublished":"2023-07-21T10:58:23+00:00","dateModified":"2023-07-21T10:59:36+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/"},"wordCount":1756,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/sound_waves.png","keywords":["Ai","Artificial Intelligence","Chatbot","Conversational Ai","Data Science","Data Science in Production","Deep Learning","Machine Learning"],"articleSection":["Applications","English Content","inovex Lab","Methods"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/","url":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/","name":"Exploiting Foundation Models for Improved Language Identification from Speech - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/sound_waves.png","datePublished":"2023-07-21T10:58:23+00:00","dateModified":"2023-07-21T10:59:36+00:00","description":"In this article, a new method for enhancing the accuracy for the task of language identification is described.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/sound_waves.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/sound_waves.png","width":1280,"height":640},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/exploiting-foundation-models-for-improved-language-identification-from-speech\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"Exploiting Foundation Models for Improved Language Identification from Speech"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/2628f665b940d875d58e202b90e86307","name":"Benedikt Augenstein","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/secure.gravatar.com\/avatar\/05897b0b1f3914d551f24512a8e37a59d5893060749349beed84526b7be3188c?s=96&d=retro&r=gd6a30fe44f1c7a53e84027d8b54cfc41","url":"https:\/\/secure.gravatar.com\/avatar\/05897b0b1f3914d551f24512a8e37a59d5893060749349beed84526b7be3188c?s=96&d=retro&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/05897b0b1f3914d551f24512a8e37a59d5893060749349beed84526b7be3188c?s=96&d=retro&r=g","caption":"Benedikt Augenstein"},"url":"https:\/\/www.inovex.de\/de\/blog\/author\/baugenstein\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/43797","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/342"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=43797"}],"version-history":[{"count":6,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/43797\/revisions"}],"predecessor-version":[{"id":50790,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/43797\/revisions\/50790"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/43805"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=43797"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=43797"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=43797"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=43797"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}