{"id":64788,"date":"2025-11-14T11:10:22","date_gmt":"2025-11-14T10:10:22","guid":{"rendered":"https:\/\/www.inovex.de\/?p=64788"},"modified":"2026-05-13T12:17:35","modified_gmt":"2026-05-13T10:17:35","slug":"llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/","title":{"rendered":"A Batch Made In Heaven? Efficient Prompt Processing with Ray &#038; vLLM"},"content":{"rendered":"<p>Batch processing of data can be significantly more cost-effective, as requests are handled together with consistent resource utilization \u2013 this is especially useful for prompt batch processing with LLMs, as mostly costly GPU resources are required.<\/p>\n<p>In contrast, processing individual prompts on the fly can create an inefficient usage pattern \u2013 periods of high demand alternate with idle time when no requests are being processed, yet infrastructure costs remain constant regardless of utilization.<\/p>\n<p>While providers like OpenAI offer a <a href=\"https:\/\/platform.openai.com\/docs\/guides\/batch\">Batch API<\/a> for processing many prompts at once, the goal of this article is to showcase how this is possible with Open Source models and Open Source technologies potentially hosted on private infrastructure to also make this capability available for projects where compliance regulations do not allow otherwise.<\/p>\n<p>Many clients prioritize the secure processing of their data, especially when utilizing technologies like large language models (LLMs). By hosting open-source models in a private data infrastructure, companies can ensure that their data remains within their private infrastructure, significantly enhancing their compliance posture.<\/p>\n<p>Another advantage of hosting open-source technology is the flexibility and customizability when implementing the solution in terms of the business needs, for example, when trying out different open source models.<\/p>\n<p>For those scenarios where LLM batch GPU-efficient processing and the aforementioned privacy restrictions play a role Ray and its ecosystem can be a very good choice, as we will see in the next sections.<\/p>\n<p><!--more--><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#Ray-Overview\" >Ray Overview<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#LLM-Batch-Inference-with-vLLM-on-Ray\" >LLM Batch Inference with vLLM on Ray<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#vLLM\" >vLLM<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#Ray-Problem-statement\" >Ray &amp; Problem statement<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#Architecture\" >Architecture<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#Summary-Challenges\" >Summary &amp; Challenges<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Ray-Overview\"><\/span>Ray Overview<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In their <a href=\"https:\/\/docs.ray.io\/en\/latest\/ray-overview\/getting-started.html\">own words<\/a>, Ray is described as \u201can open-source framework to build and scale your ML and Python applications easily\u201c. Prominent companies such as <a href=\"https:\/\/engineering.atspotify.com\/2023\/02\/unleashing-ml-innovation-at-spotify-with-ray\/\">Spotify<\/a>, <a href=\"https:\/\/www.youtube.com\/watch?v=CqiL5QQnN64&amp;t=510s\">OpenAI<\/a> and <a href=\"https:\/\/aws.amazon.com\/blogs\/opensource\/amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-amazon-ec2\/\">Amazon<\/a> have already established parts of their machine learning platforms on Ray, showcasing its reliability in real-world applications. At the start, the Ray ecosystem might be overwhelming with all its different sub-projects. The good news is that for building such an LLM Batch Service (scope of this blog post), we mostly care about <a href=\"https:\/\/docs.ray.io\/en\/latest\/data\/data.html\">Ray Data<\/a>, which targets the distributed and scalable processing of Datasets. This includes loading data from an external storage system and then feeding it into an ML model \u2013 in our case, this will be a LLM as we will see later.<\/p>\n<p>Ray&#8217;s mission statement from before also mentioned scalability. This can be achieved by the Kubernetes operator called <a href=\"https:\/\/github.com\/ray-project\/kuberay\">KubeRay<\/a>. This integration with Kubernetes allows organizations to dynamically adjust their resources based on workload demands, ensuring optimal performance and cost-efficiency. As LLMs can be GPU-intensive, this feature is invaluable for managing computational resources effectively.<\/p>\n<p>Additionally Ray is deeply rooted in the Python machine learning ecosystem, making it compatible with popular libraries such as TensorFlow, PyTorch (since Oct, 2025 Ray is <a href=\"https:\/\/pytorch.org\/blog\/pytorch-foundation-welcomes-ray-to-deliver-a-unified-open-source-ai-compute-stack\/\">part of the PyTorch Foundation<\/a>, bringing the projects even closer together), HuggingFace &#8211; and most relevant for this blog post: vLLM. This integration enables developers to create powerful LLM applications tailored to their specific needs while leveraging existing tools and frameworks, enhancing productivity and innovation.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"LLM-Batch-Inference-with-vLLM-on-Ray\"><\/span>LLM Batch Inference with vLLM on Ray<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h3><span class=\"ez-toc-section\" id=\"vLLM\"><\/span>vLLM<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>vLLM is a popular choice for serving LLMs. With a simple command like `vllm serve meta-llama\/Meta-Llama-3.1-8B-Instruct` you can spin up a fully OpenAI API-compatible instance to serve your requests.<\/p>\n<p>While it also offers a batch entry point, where you can reference a .jsonl file for multiple prompts, it does not expose a REST interface<\/p>\n<p>Also, it does not offer an advanced scalability and hardware allocation mechanism like Ray. In fact, vLLM <a href=\"https:\/\/docs.vllm.ai\/en\/v0.10.2\/serving\/offline_inference.html#ray-data-llm-api\">recommends using Ray in such cases<\/a>.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Ray-Problem-statement\"><\/span>Ray &amp; Problem statement<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Our problem statement is to feed a corpus of independent prompts into our LLM and receive an output. Hence, it boils down to a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Data_parallelism\">data parallelism problem<\/a>.<\/p>\n<p>Meaning: We can theoretically scale our execution time for the batch inference linearly depending on the amount of input prompts. In the extreme, this would mean having one dedicated GPU for one prompt. But this would definitely not be efficient in any way. Hence, we rather feed in the input prompts in smaller batches so that every GPU can process a certain amount of it.<\/p>\n<div>\n<dl id=\"attachment_64796\">\n<dt>\n<p><figure style=\"width: 640px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Copy-of-Problem-Statement-Parallelity-1024x653.jpg\" alt=\"\" width=\"640\" height=\"408\" \/><figcaption class=\"wp-caption-text\"><em>Fig. 1.<\/em> To utilize our costly GPU Hardware where our LLM is running on we need to submit batches of prompts.<\/figcaption><\/figure><\/dt>\n<\/dl>\n<\/div>\n<p>This is where Ray supports us with its concepts and also represents how a Ray Cluster (see <em>Fig. 2.<\/em>) is built up in general.<\/p>\n<div>\n<dl id=\"attachment_64794\">\n<dt>\n<p><figure id=\"attachment_64981\" aria-describedby=\"caption-attachment-64981\" style=\"width: 1086px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-64981 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Ray-Cluster-1.jpg\" alt=\"\" width=\"1086\" height=\"920\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Ray-Cluster-1.jpg 1086w, https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Ray-Cluster-1-300x254.jpg 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Ray-Cluster-1-1024x867.jpg 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Ray-Cluster-1-768x651.jpg 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Ray-Cluster-1-400x339.jpg 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Ray-Cluster-1-360x305.jpg 360w\" sizes=\"auto, (max-width: 1086px) 100vw, 1086px\" \/><figcaption id=\"caption-attachment-64981\" class=\"wp-caption-text\"><em>Fig. 2.<\/em> The Ray Cluster architecture with different worker types (CPU\/GPU). <em>Note<\/em>: This is still a very simplified representation of the Ray cluster. For example, those different workers can live either on the same Kubernetes Pod or on different ones.<\/figcaption><\/figure><\/dt>\n<\/dl>\n<\/div>\n<p>As a distributed system, Ray has a head node, which coordinates the job execution and is the entry point for submitting jobs to the Ray cluster. Apart from the head, there are 0-N worker nodes. So in its simplest form, a Ray cluster can consist of only the Ray head, which then also does the actual job computation. Depending on the workload, more workers will be spawned. One can be very specific about defining those specific hardware resources. For instance, you can require that your workload should run on a hardware type that has any sort of GPU (fractional GPUs are also possible). However, it is also possible to be much more specific by referencing a particular GPU family like \u200b\u200b<code>NVIDIA_TESLA_V100<\/code>. This is particularly useful, as vLLM and HuggingFace sometimes give particular advice on what hardware they recommend to run the LLMs on.<\/p>\n<p>In that way you can define <a href=\"https:\/\/docs.ray.io\/en\/latest\/cluster\/kubernetes\/user-guides\/config.html#introduction\">worker group specifications via KubeRay<\/a> (for a snippet of what that looks like for a GPU worker group, see <em>Fig. 3<\/em>), which act like blueprints\/templates and specify what the concrete hardware the Ray-worker Kubernetes pod should be provisioned onto should look like.<\/p>\n<p>For running the batch inference via vLLM on Ray, <code>ray.data<\/code>\u00a0provides some valuable abstractions, which facilitate tackling such a use case by a great factor (see <em>Box 1<\/em>).<\/p>\n<pre class=\"\">import ray\r\nfrom ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor\r\n\r\nconfig = vLLMEngineProcessorConfig(\r\n    model_source=\"meta-llama\/Meta-Llama-3.1-8B-Instruct\",\r\n    engine_kwargs=dict(\r\n        enable_prefix_caching=True,\r\n        enable_chunked_prefill=True,\r\n        max_num_batched_tokens=4096,\r\n    ),\r\n    concurrency=4,\r\n    accelerator_type=NVIDIA_TESLA_V100,\r\n    batch_size=64,\r\n)\r\nprocessor = build_llm_processor(\r\n    config,\r\n    preprocess=lambda row: dict(\r\n        messages=[\r\n            {\"role\": \"system\", \"content\": \"You summarize inovex blog posts\"},\r\n            {\"role\": \"user\", \"content\": f\"Summarize the following blog and extract its key points in a few sentences {row['blog_content']} \"},\r\n        ],\r\n        sampling_params=dict(\r\n            temperature=0.3,\r\n            max_tokens=20,\r\n            detokenize=False,\r\n        ),\r\n    ),\r\n    postprocess=lambda row: dict(\r\n        resp=row[\"generated_text\"],\r\n    ),\r\n)\r\n\r\nds = ray.data.read_json(paths=\"blog_posts.jsonl\", filesystem=fs) # fs is your PyArrow filesystem\r\n\r\nds = processor(ds)\r\nfor row in ds.take_all():\r\n    print(row)<\/pre>\n<blockquote><p><em><strong>Box 1.<\/strong> Ray Data gives us useful abstractions for implementing the use case of offline batch inference via vLLM.<\/em><\/p>\n<p><em>`<strong>config<\/strong>` (line 4): This config sets up everything that we would usually also pass into vLLM \u2013 from the obvious model_source to options like enable_prefix_caching. Indeed, this is fully compatible with the vLLM engine, so every other option can be passed in here.<\/em><\/p>\n<p><em>Additionally we can specify `concurrency=4` (line 11). In this case the KubeRay Autoscaler will then \u201corder\u201c four Ray worker nodes \u2013 each with a NVIDIA V100 GPU \u2013 from Kubernetes. Each worker node will then launch one replica of the specified vLLM engine process. Of course, after the vLLM processing is done, the KubeRay Autoscaler will shut down those resources again so that the costly GPU hardware is only up for as long as required.<\/em><\/p>\n<p><em>`<strong>batch_size<\/strong>` (line 13) corresponds to the number of prompts we are feeding into each vLLM engine process. The goal with that is to achieve a good balance between fully utilizing the GPU resource but not overloading it at the same time. This, of course, heavily depends on the model and GPU combination for your use case and might require some experimentation.<\/em><\/p>\n<p><em>For more information on the config object, see <a href=\"https:\/\/docs.ray.io\/en\/latest\/data\/api\/doc\/ray.data.llm.vLLMEngineProcessorConfig.html#ray.data.llm.vLLMEngineProcessorConfig\">here<\/a>.<\/em><\/p>\n<p><em>`<strong>build_llm_processor<\/strong>` (line 15) allows us to (optionally) do some pre- and post-processing of the prompts\/output. It is important that the input data has a `messages` field in the OpenAI chat format, so that the vLLM process can operate on it.<\/em><br \/>\n<em><strong>Note<\/strong>: This is still a very minimalistic example, which does not take into account the more complex <a href=\"https:\/\/platform.openai.com\/docs\/guides\/batch#1-prepare-your-batch-file\">JSONL structure<\/a> of the official OpenAI specification. For conciseness, this is not done within this blog post.<\/em><\/p><\/blockquote>\n<p>&nbsp;<\/p>\n<p>Ray Data offers multiple integrations for loading data into the Ray Cluster as a distributed dataset. Also, there is support for directly <a href=\"https:\/\/docs.ray.io\/en\/latest\/data\/api\/doc\/ray.data.read_json.html\">loading JSONL data<\/a>, which is used by the <a href=\"https:\/\/platform.openai.com\/docs\/guides\/batch#1-prepare-your-batch-file\">OpenAI Batch API<\/a> to specify multiple prompts in one file, with which a batch job can be initiated. It is very flexible, as you can provide it with any PyArrow filesystem (S3, Blob storage, GCS, &#8230;). From there on, Ray data will take care of feeding the right amount of data specified by <code>batch_size<\/code>\u00a0into the vLLM process(es) (see <em>Box 1<\/em>).<\/p>\n<p>If you still want to do custom distributed pre-processing on your dataset before loading it into your vLLM, you can choose from many \u201cgeneral purpose\u201c functions from the <a href=\"https:\/\/docs.ray.io\/en\/latest\/data\/api\/dataset.html\">DataSet API<\/a>.<\/p>\n<p>This is a thing that I really appreciate regarding the <code>ray.data<\/code> APIs: the heavy lifting is abstracted away from the programmer. By having basic building blocks for mapping\/filtering\/\u2026 your Dataframe you have a very concise yet powerful toolbox at hand to solve hard distributed data processing. If you know Apache Spark, you might recognize a huge resemblance here! Indeed, some data teams already shifted their data processing part away from <a href=\"https:\/\/aws.amazon.com\/blogs\/opensource\/amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-amazon-ec2\/\">Spark towards ray-data<\/a>.<\/p>\n<p>With those basic building blocks, you could achieve everything that <code>vLLMEngineProcessorConfig<\/code>\u00a0and <code>build_llm_processor<\/code> (see <em>Box 1<\/em>) do with only a bit more code. So those higher abstractions are just \u201csyntax sugar\u201c over those basic building blocks, as the Ray team discovered that so many people are using vLLM within their Ray Data processes, and hence it was a great idea to make it even more approachable with those higher abstractions.<\/p>\n<p>While this blog posts focuses on a very specific part, which is batch LLM inference, you can think about different use cases in which you could further process the data with Ray. For example, you could also process the output data of those LLMs, enhance it with some user data, save it to another database, and then continue the processing. Think ETL: All those results can be used to feed it to other processes down the pipeline.<\/p>\n<p>And such a workflow can greatly benefit from the worker group specifications that we already mentioned (and which you can see in more detail in <em>Fig. 3<\/em>): In the first step of loading data, you might want to use a very CPU-intensive worker group, while the LLM stage can then be tackled by a more GPU-heavy worker group.<\/p>\n<p>&nbsp;<\/p>\n<div>\n<dl id=\"attachment_64792\">\n<dt>\n<p><figure style=\"width: 640px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-ETL-scale-1024x797.jpg\" alt=\"\" width=\"640\" height=\"498\" \/><figcaption class=\"wp-caption-text\"><em>Fig. 3.<\/em> Ray worker groups with different specifications (CPU\/GPU) and the corresponding KubeRay specification.<\/figcaption><\/figure><\/dt>\n<dd><\/dd>\n<\/dl>\n<\/div>\n<h2><span class=\"ez-toc-section\" id=\"Architecture\"><\/span>Architecture<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>While the last section gave an overview of how Ray &amp; vLLM can be used to implement the core functionality of LLM Batch Inference, this chapter gives an overview of a potential architecture which can be used to implement the OpenAI HTTP Batch specification, so that it can be used with <a href=\"https:\/\/platform.openai.com\/docs\/guides\/batch?lang=python#2-upload-your-batch-input-file\">official client libraries<\/a>, as if we were interacting with OpenAI APIs.<\/p>\n<p>Because up to this point it is offline inference at its best (or worst for that matter \ud83d\ude42): We don\u2019t expose that capability yet to the outside world as a HTTP interface.<br \/>\nFor that, we need a HTTP server, which implements the endpoints from the <a href=\"https:\/\/developers.openai.com\/api\/reference\/go\/resources\/batches\/methods\/create\">OpenAI Batch API<\/a> &#8211; for example via FastAPI.<br \/>\nThat service handles JSONL uploads from the clients and saves that on some external storage system, where later the Ray Job can pick it up from (for the whole architecture see <em>Fig. 4<\/em>).<br \/>\nFor tracking the various file uploads and the corresponding job statuses a SQL database also comes in handy.<br \/>\nWith this information the FastAPI app can eventually submit a <a href=\"https:\/\/docs.ray.io\/en\/latest\/cluster\/running-applications\/job-submission\/sdk.html#ray-job-sdk\">RayJob via the Python SDK<\/a> to our Ray Cluster. At its core this Ray job does exactly what we showcased in the last section: Processing the prompts via vLLM.<\/p>\n<p>&nbsp;<\/p>\n<div>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<figure id=\"attachment_64979\" aria-describedby=\"caption-attachment-64979\" style=\"width: 1162px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-64979 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Architecture-1.jpg\" alt=\"\" width=\"1162\" height=\"860\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Architecture-1.jpg 1162w, https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Architecture-1-300x222.jpg 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Architecture-1-1024x758.jpg 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Architecture-1-768x568.jpg 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Architecture-1-400x296.jpg 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Ray-Blog-Architecture-1-360x266.jpg 360w\" sizes=\"auto, (max-width: 1162px) 100vw, 1162px\" \/><figcaption id=\"caption-attachment-64979\" class=\"wp-caption-text\"><strong><em>Fig. 4.<\/em>\u00a0Overall architecture: 1.) User submits a jsonl file via the `\/files` endpoint, which returns a file-id to the user. The FastAPI service will process the data, do some validation and then save it on blob storage as well as on the SQL DB, for referencing it for further processing. 2.) The user initiates the batch job by calling the `\/batches` endpoint which includes the file-id from the previous step. With that information the Ray process will load the jsonl data from the blob storage and start processing the prompts on vLLM. 3.) The user polls on the job-id. Once the job finished, ray will store the results on blob storage and the FastAPI service will serve it to the user.<\/strong><\/figcaption><\/figure>\n<\/div>\n<p>&nbsp;<\/p>\n<div>\n<dl id=\"attachment_64798\">\n<dt>\n<p><figure style=\"width: 2560px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/diagram-scaled.png\" alt=\"\" width=\"2560\" height=\"2377\" \/><figcaption class=\"wp-caption-text\">Fig. 5. Close-up sequence diagram<\/figcaption><\/figure><\/dt>\n<\/dl>\n<\/div>\n<p>Besides the GPU-intensive vLLM workload, we can, of course, also execute arbitrary Python code \u2013 in this case, the Ray process shall also store its output on our cloud storage, as well as updating the job status in the SQL database.<\/p>\n<p class=\"infobox\">For those additional library dependencies, Ray supports various ways of getting them into the cluster (<a href=\"https:\/\/docs.ray.io\/en\/latest\/ray-core\/handling-dependencies.html\">uv, conda, &#8230;<\/a>)<\/p>\n<p>This Batch API can especially be beneficial in your overall data architecture, if you also provide \u201clive LLM APIs\u201c \u2013 that operate 24\/7 \u2013 as the Batch API can take over load from use cases that need to process thousands of documents in an asynchronous manner. If documents were instead submitted to the live LLM APIs, it could slow it down and, in consequence, distribute the user experience when using systems like your company&#8217;s Chat UI, where smooth synchronous message exchange between the user and LLM is crucial.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Summary-Challenges\"><\/span>Summary &amp; Challenges<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>That\u2019s it! We&#8217;ve seen how we can achieve a cost-effective, fully OpenAI compatible Batch API service with Ray and vLLM hosted on our own infrastructure!<\/p>\n<p>Ray makes it quite straightforward to solve such problems with its code abstractions &amp; concepts: loading the data from various storage systems and configuring the LLM to run on GPU hardware. However, there were some parts where the learning curve felt quite steep. For example, when using vLLM code directly within a Ray process \u2013 to further customize the LLMs behavior (no example for this is given within this blog post to not blow the scope). While there is the <a href=\"https:\/\/docs.ray.io\/en\/latest\/ray-observability\/user-guides\/debug-apps\/ray-debugging.html\">Ray Debugger<\/a>, it is still not quite easy to drill into a bug on that level. Although, this holds true for most of my debugging experience with distributed systems. Ray is actively monitoring the needs of the bigger ML community\/landscape and integrates new trends quickly into one of its sub-libraries, which is great. By that, Ray versions are released frequently with crucial updates, to catch up with the fast pace of the GenAI landscape \u2013 so be sure to have a strong foundation to cope with updating those dependencies often.<\/p>\n<p>So in summary, if you are searching for an Open Source platform to build your ML platform on, Ray might be what you are looking for.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Batch processing of data can be significantly more cost-effective, as requests are handled together with consistent resource utilization \u2013 this is especially useful for prompt batch processing with LLMs, as mostly costly GPU resources are required. In contrast, processing individual prompts on the fly can create an inefficient usage pattern \u2013 periods of high demand [&hellip;]<\/p>\n","protected":false},"author":260,"featured_media":64974,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[509,511,77,385],"service":[76],"coauthors":[{"id":260,"display_name":"Kolja Maier","user_nicename":"kmaier"}],"class_list":["post-64788","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-ai-2","tag-artificial-intelligence-2","tag-big-data","tag-data-engineering","service-artificial-intelligence"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Efficient Prompt Processing with Ray &amp; vLLM<\/title>\n<meta name=\"description\" content=\"Leveraging Ray &amp; vLLM for GPU efficient batch processing of prompts\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Efficient Prompt Processing with Ray &amp; vLLM\" \/>\n<meta property=\"og:description\" content=\"Leveraging Ray &amp; vLLM for GPU efficient batch processing of prompts\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2025-11-14T10:10:22+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-05-13T10:17:35+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/LLM-Batch-processing-with-Ray-and-vLLM.-A-batch-made-in-heaven-.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1500\" \/>\n\t<meta property=\"og:image:height\" content=\"880\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Kolja Maier\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/LLM-Batch-processing-with-Ray-and-vLLM.-A-batch-made-in-heaven--1024x601.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Kolja Maier\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"13\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Kolja Maier\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\\\/\"},\"author\":{\"name\":\"Kolja Maier\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/cf95dd2e4dd018a16a457538186fcb9e\"},\"headline\":\"A Batch Made In Heaven? Efficient Prompt Processing with Ray &#038; vLLM\",\"datePublished\":\"2025-11-14T10:10:22+00:00\",\"dateModified\":\"2026-05-13T10:17:35+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\\\/\"},\"wordCount\":2393,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/LLM-Batch-processing-with-Ray-and-vLLM.-A-batch-made-in-heaven-.png\",\"keywords\":[\"Ai\",\"Artificial Intelligence\",\"Big Data\",\"Data Engineering\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\\\/\",\"name\":\"Efficient Prompt Processing with Ray & vLLM\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/LLM-Batch-processing-with-Ray-and-vLLM.-A-batch-made-in-heaven-.png\",\"datePublished\":\"2025-11-14T10:10:22+00:00\",\"dateModified\":\"2026-05-13T10:17:35+00:00\",\"description\":\"Leveraging Ray & vLLM for GPU efficient batch processing of prompts\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/LLM-Batch-processing-with-Ray-and-vLLM.-A-batch-made-in-heaven-.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/LLM-Batch-processing-with-Ray-and-vLLM.-A-batch-made-in-heaven-.png\",\"width\":1500,\"height\":880},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"A Batch Made In Heaven? Efficient Prompt Processing with Ray &#038; vLLM\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/cf95dd2e4dd018a16a457538186fcb9e\",\"name\":\"Kolja Maier\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-kmaierr-96x96.jpg8c8a43b8dac94f85b7cd2eae2dc34818\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-kmaierr-96x96.jpg\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-kmaierr-96x96.jpg\",\"caption\":\"Kolja Maier\"},\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/kmaier\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Efficient Prompt Processing with Ray & vLLM","description":"Leveraging Ray & vLLM for GPU efficient batch processing of prompts","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/","og_locale":"de_DE","og_type":"article","og_title":"Efficient Prompt Processing with Ray & vLLM","og_description":"Leveraging Ray & vLLM for GPU efficient batch processing of prompts","og_url":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2025-11-14T10:10:22+00:00","article_modified_time":"2026-05-13T10:17:35+00:00","og_image":[{"width":1500,"height":880,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/LLM-Batch-processing-with-Ray-and-vLLM.-A-batch-made-in-heaven-.png","type":"image\/png"}],"author":"Kolja Maier","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/LLM-Batch-processing-with-Ray-and-vLLM.-A-batch-made-in-heaven--1024x601.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Kolja Maier","Gesch\u00e4tzte Lesezeit":"13\u00a0Minuten","Written by":"Kolja Maier"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/"},"author":{"name":"Kolja Maier","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/cf95dd2e4dd018a16a457538186fcb9e"},"headline":"A Batch Made In Heaven? Efficient Prompt Processing with Ray &#038; vLLM","datePublished":"2025-11-14T10:10:22+00:00","dateModified":"2026-05-13T10:17:35+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/"},"wordCount":2393,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/LLM-Batch-processing-with-Ray-and-vLLM.-A-batch-made-in-heaven-.png","keywords":["Ai","Artificial Intelligence","Big Data","Data Engineering"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/","url":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/","name":"Efficient Prompt Processing with Ray & vLLM","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/LLM-Batch-processing-with-Ray-and-vLLM.-A-batch-made-in-heaven-.png","datePublished":"2025-11-14T10:10:22+00:00","dateModified":"2026-05-13T10:17:35+00:00","description":"Leveraging Ray & vLLM for GPU efficient batch processing of prompts","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/LLM-Batch-processing-with-Ray-and-vLLM.-A-batch-made-in-heaven-.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/LLM-Batch-processing-with-Ray-and-vLLM.-A-batch-made-in-heaven-.png","width":1500,"height":880},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/llm-batch-processing-with-ray-vllm-gpu-efficiency-data-privacy\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"A Batch Made In Heaven? Efficient Prompt Processing with Ray &#038; vLLM"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/cf95dd2e4dd018a16a457538186fcb9e","name":"Kolja Maier","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-kmaierr-96x96.jpg8c8a43b8dac94f85b7cd2eae2dc34818","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-kmaierr-96x96.jpg","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-kmaierr-96x96.jpg","caption":"Kolja Maier"},"url":"https:\/\/www.inovex.de\/de\/blog\/author\/kmaier\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/64788","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/260"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=64788"}],"version-history":[{"count":12,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/64788\/revisions"}],"predecessor-version":[{"id":67509,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/64788\/revisions\/67509"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/64974"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=64788"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=64788"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=64788"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=64788"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}