{"id":45870,"date":"2023-08-09T07:39:46","date_gmt":"2023-08-09T05:39:46","guid":{"rendered":"https:\/\/www.inovex.de\/?p=45870"},"modified":"2025-07-10T10:06:01","modified_gmt":"2025-07-10T08:06:01","slug":"bentoml-for-mlops-from-prototype-to-production","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/","title":{"rendered":"BentoML for MLOps: From Prototype to Production"},"content":{"rendered":"<p>In this blog post, we make the case for BentoML: An MLOps tool that is both easy to start with \u2013 even for prototyping \u2013 and production-ready for low-to-medium-scale projects. While there are many MLOps solutions out there, the ease of going from prototyping to production is what makes BentoML stand out. ML and especially <a href=\"https:\/\/www.inovex.de\/en\/our-services\/artificial-intelligence\/generative-ai\/\">GenerativeAI<\/a> is moving incredibly fast these days, but BentoML can help you to quickly prototype with a solid foundation, only a very small amount of custom code, and most importantly the ability to scale.<!--more--><\/p>\n<p>Putting Machine Learning models into production can be a challenging task, as the field of <a href=\"https:\/\/www.inovex.de\/en\/our-services\/data-science\/mlops\/\">MLOps<\/a> is still relatively young. There are several things that need to be considered such as scalability, model versioning, and the complexity of the software stack. MLOps tools like BentoML do a lot of the heavy lifting for you, but it may seem hard to settle for one in an <a href=\"https:\/\/www.inovex.de\/de\/blog\/a-conceptual-view-on-the-machine-learning-life-cycle\/\">early stage of the product<\/a>.<\/p>\n<p>That is why a first iteration might involve a custom fit RestAPI leveraging tools such as Flask, Django, and FastAPI. While the aforementioned frameworks are powerful and highly customisable, they often lead to a considerable amount of custom code in future iterations and can slow down development. Features that seemed to be simply \u201cnice to have\u201c, such as monitoring, log management, or the ability to group similar (model-) requests together, will become a necessity. And especially with the state-of-the-art models, storage and memory management can be a challenge.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_83 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#The-Architectural-Journey\" >The Architectural Journey<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#First-Iteration-Flask-only\" >First Iteration: Flask only<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#First-Iteration-Shortcomings\" >First Iteration: Shortcomings<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#Second-Iteration-Flask-Redis-Redis-Queue\" >Second Iteration: Flask + Redis + Redis Queue<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#Second-Iteration-Shortcomings\" >Second Iteration: Shortcomings<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#Third-Iteration-BentoML\" >Third Iteration: BentoML<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#Load-Models-to-BentoML-Model-Registry\" >Load Models to BentoML Model Registry<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#Building-the-Bento\" >Building the Bento<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#Creating-the-main-Script-and-defining-Schemas\" >Creating the main Script and defining Schemas<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#Containerization\" >Containerization<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#BentoML-Custom-Runnable\" >BentoML Custom Runnable<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#But-there-also-were-some-issues%E2%80%A6\" >But there also were some issues&#8230;<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#Using-BentoML-Conclusion\" >Using BentoML: Conclusion<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"The-Architectural-Journey\"><\/span>The Architectural Journey<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The following paragraphs describe the evolution of a model-serving backend in the context of a student-driven internal research project called parrot. Its purpose is to build an AI platform that provides access to state-of-the-art Generative AI models via a modern and intuitive user interface. Thus, the subsequent discussion of advantages, disadvantages, and architectural considerations is the result of our very own technological journey and is therefore an example of the challenges that a real-world MLOps solution faces in an production setting.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"First-Iteration-Flask-only\"><\/span>First Iteration: Flask only<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>There are three popular libraries for building a RESTful API in Python: <a href=\"https:\/\/flask.palletsprojects.com\/en\/2.3.x\/\" target=\"_blank\" rel=\"noopener\">Flask<\/a>, <a href=\"https:\/\/fastapi.tiangolo.com\/\" target=\"_blank\" rel=\"noopener\">FastAPI,<\/a> and <a href=\"https:\/\/www.django-rest-framework.org\/\" target=\"_blank\" rel=\"noopener\">Django (REST framework)<\/a>. While all three are excellent choices for most common use cases, serving a machine learning model with any of the aforementioned libraries can be challenging \u2013 as you can see in the following paragraphs.<\/p>\n<p>In the early phases of the product life cycle, we wanted to start with a simple trusted technology that is well-established and ideally offers lots of room for customisation for future iterations. That is why we chose Flask instead of a dedicated MLOps tool, as it does not lock you down into a specific way of doing things. To quote the <a href=\"https:\/\/flask.palletsprojects.com\/en\/2.3.x\/design\/\" target=\"_blank\" rel=\"noopener\">documentation<\/a>: \u201cFlask can be everything you need and nothing you don\u2019t\u201c.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-46208 aligncenter\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/first_iteration.drawio.svg\" alt=\"\" width=\"546\" height=\"331\" \/><\/p>\n<p>In the beginning, having a simple endpoint with some custom code that serves a model for one user at a time is sufficient. This is where Flask does make a lot of sense: We knew the framework already and had something very simple to show to stakeholders. But the more features were added to the barebones model serving application, the more we had to think about necessary functionality that only seemed to be \u201cnice-to-have\u201c at first. And the more users the application is expected to serve, the more urgent these issues become. It is easy and straightforward to draw a simple sketch with a regular pencil, but you will have a hard time creating a construction drawing with nothing but it.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"First-Iteration-Shortcomings\"><\/span>First Iteration: Shortcomings<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The more users are using the app, the more bugs and crashes are going to be discovered, which have to be fixed quickly. To be able to diagnose a problem, detailed application logs are a must. While this is not hard to implement, it does take some time to do it right, especially when running multiple services and containers. Having a sound logging setup serves as an example of why deploying an ML model is not that simple.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Second-Iteration-Flask-Redis-Redis-Queue\"><\/span>Second Iteration: Flask + Redis + Redis Queue<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Arguably, the biggest concern is the load that each user is generating. Especially when state-of-the-art GenerativeAI Large Language Models (LLMs) are served, performance and optimisation is never an afterthought but a necessity. To better handle the load and be able to scale, we decided to refactor the application by decoupling the API and the resource-intensive model execution. We leveraged a Redis instance that kept track of all the model prompts that were sent to the API. Separate worker containers then picked a task from the queue and wrote their respective model\u2019s output back to Redis.<\/p>\n<p><a href=\"https:\/\/www.inovex.de\/wp-content\/uploads\/second_iteration.drawio-1.svg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-46216 aligncenter\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/second_iteration.drawio-1.svg\" alt=\"\" width=\"881\" height=\"312\" \/><\/a><\/p>\n<p>With this architecture, we gained a lot of stability because errors in the model execution no longer compromised the core API process anymore. Also, it allowed us to quickly scale the computing resources according to our needs. But the downside was the significant amount of time required for refactoring and, most importantly, the added complexity. And since the model serving is only a means to an end and not the core product, it was hard to justify spending more and more time developing what was essentially another product.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Second-Iteration-Shortcomings\"><\/span>Second Iteration: Shortcomings<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Apart from these concerns, another technical issue made us turn our back to our custom-tailored solution: Namely Python and how it implements \u201ccopy on write\u201c. Something that is even further away from our product.<br \/>\nSo why does it concern us? <a href=\"https:\/\/github.com\/rq\/rq\" target=\"_blank\" rel=\"noopener\">Redis queue (rq)<\/a> uses forks under the hood to be able to separate different working processes from each other. Thus, whenever a model inference job is fetched from the queue, the main process is forked and it is this fork that runs the model inference. The nice thing with forking is, that no memory object is duplicated from the start, but merely referenced. So in theory, the GenerativeAI model that is multiple gigabytes heavy can reside only in the main process, while its children can read it via a reference \u2013 unless they modify it. But we only want to run a model, not train it. So we should be good, right?<br \/>\nUnfortunately not. As it happens, reading the model object does in fact mean writing to it due to Python\u2019s reference counting. Instead of one model for multiple processes, every process consumes multiple gigabytes separately. With large models, this makes memory usage explode really fast. For more information on how and why Python behaves the way it does, see the excellent \u201c<a href=\"https:\/\/luis-sena.medium.com\/understanding-and-optimizing-python-multi-process-memory-management-24e1e5e79047\" target=\"_blank\" rel=\"noopener\">Understanding and Optimizing Python multi-process Memory Management<\/a>\u201c post by Luis Sena.<\/p>\n<p>It is possible to make use of a <a href=\"https:\/\/github.com\/rq\/rq\/issues\/1088\" target=\"_blank\" rel=\"noopener\">workaround in rq<\/a>: using SimpleWorker instead of the regular worker class. This comes with its own drawbacks though. With only a single process active at any given moment for each worker, tasks can only be processed one after another, not simultaneously. And that also means that most of the advantages of Redis and rq like process isolation and multiprocessing do not apply anymore.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Third-Iteration-BentoML\"><\/span>Third Iteration: BentoML<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400\">\u201cKeep it simple, stupid\u201c, does not necessarily mean that you should reinvent the wheel.\u00a0<\/span><span style=\"font-weight: 400\">After looking into BentoML, we realised that a lot of the issues we ran into with our custom MLOps application were already answered by it:\u00a0<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">good integration into the existing Python ML ecosystem (Scikit-Learn, PyTorch, TensorFlow, Transformers, ONNX, LightGBM to name a few)<\/span><\/li>\n<li style=\"font-weight: 400\"><a href=\"https:\/\/docs.bentoml.org\/en\/latest\/guides\/batching.html\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">(adaptive) batching<\/span><\/a><span style=\"font-weight: 400\"> of similar requests\u00a0<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">a <\/span><a href=\"https:\/\/github.com\/bentoml\/BentoML\/discussions\/2242\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">queue<\/span><\/a><span style=\"font-weight: 400\"> similar to Celery and separate worker processes with model sharing between processes<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">a central model registry with versioning<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">and, above all, a lot less boilerplate code<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">It also comes with a lot of operational features:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">integrated logging (with support for <\/span><a href=\"https:\/\/opentelemetry.io\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Open Telemetry<\/span><\/a><span style=\"font-weight: 400\">)<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">easy to set up, deploy and monitor via Prometheus metrics endpoints<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">easily scalable and k8s friendly (with <\/span><a href=\"https:\/\/modelserving.com\/blog\/yatai-10-model-deployment-on-kubernetes-made-easy\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Yata<\/span><\/a><span style=\"font-weight: 400\">i)<\/span><\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-46254 aligncenter\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/third_iteration.drawio-1.svg\" alt=\"\" width=\"765\" height=\"381\" \/><\/p>\n<p><span style=\"font-weight: 400\">For another insightful post about the advantages of using a specialised MLOps tool like BentoML instead of Flask and FastAPI checkout \u201c<\/span><a href=\"https:\/\/modelserving.com\/blog\/breaking-up-with-flask-amp-fastapi-why-ml-model-serving-requires-a-specialized-framework\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">Breaking Up With Flask &amp; FastAPI: Why ML Model Serving Requires A Specialized Framework<\/span><\/a><span style=\"font-weight: 400\">\u201c by BentoML\u2019s Head of Product Tim Liu.<\/span><\/p>\n<p><span style=\"font-weight: 400\">The easiest and most accessible way to get across the advantages of using BentoML is to show the actual code of the application. So sit back, grab a pencil for sketches, and enjoy the ride!<\/span><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Load-Models-to-BentoML-Model-Registry\"><\/span>Load Models to BentoML Model Registry<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Before you can build your bento with <span class=\"lang:zsh decode:true crayon-inline\">bentoml build<\/span> you need to load your models into your local BentoML model registry. You can do this by executing the <a href=\"https:\/\/github.com\/inovex\/blog-post-bentoml\/blob\/main\/scripts\/load_models_into_bento.py\" target=\"_blank\" rel=\"noopener\">load_models_into_bento.py<\/a> script. This generates a transformer pipeline from the loaded model and tokenizer. This pipeline is then saved to the local model registry.<\/p>\n<pre class=\"lang:python decode:true\">model = AutoModelForCausalLM.from_pretrained(\r\n    model_identifier, trust_remote_code=True, revision=\"main\"\r\n)\r\ntokenizer = AutoTokenizer.from_pretrained(model_identifier)\r\n\r\ngenerator = pipeline(PIPELINE_TASK, model=model, tokenizer=tokenizer)\r\nbentoml.transformers.save_model(\r\n    f'{PIPELINE_PREFIX}{model_name.replace(\"\/\", \"-\")}', generator\r\n)<\/pre>\n<h3><\/h3>\n<h3><span class=\"ez-toc-section\" id=\"Building-the-Bento\"><\/span>Building the Bento<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>For building a bento you need a <a href=\"https:\/\/github.com\/inovex\/blog-post-bentoml\/blob\/main\/bentofile.yaml\" target=\"_blank\" rel=\"noopener\">bentofile.yaml<\/a> and a <a href=\"https:\/\/github.com\/inovex\/blog-post-bentoml\/blob\/main\/service.py\" target=\"_blank\" rel=\"noopener\">service.py<\/a>. The <a href=\"https:\/\/github.com\/inovex\/blog-post-bentoml\/blob\/main\/bentofile.yaml\" target=\"_blank\" rel=\"noopener\">bentofile.yaml<\/a> serves as a config file for BentoML. Here you specify the name, which files to include, and which dependencies to install when building the bento. You can find more on this <a href=\"https:\/\/docs.bentoml.org\/en\/latest\/concepts\/bento.html#bento-build-options\">here<\/a>. We set the dependency section to <span class=\"lang:default decode:true crayon-inline\">pip_args: &#8222;-e \/home\/bentoml\/bento\/src\/.&#8220;<\/span>. With this all dependencies from <a href=\"https:\/\/github.com\/inovex\/blog-bentoml\/blob\/main\/poetry.lock\" target=\"_blank\" rel=\"noopener\">poetry.lock<\/a> will be installed.<\/p>\n<pre class=\"lang:yaml decode:true \" title=\"bentofile.yaml\">service: \"service:service\"\r\ninclude:\r\n- \"src\/\"\r\n- \"poetry.lock\"\r\n- \"pyproject.toml\"\r\n- \"README.md\"\r\n- \"service.py\"\r\npython:\r\n  pip_args: \"-e \/home\/bentoml\/bento\/src\/.\"\r\n<\/pre>\n<h3><span class=\"ez-toc-section\" id=\"Creating-the-main-Script-and-defining-Schemas\"><\/span>Creating the main Script and defining Schemas<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The <a href=\"https:\/\/github.com\/inovex\/blog-bentoml\/blob\/main\/service.py\" target=\"_blank\" rel=\"noopener\">service.py<\/a> acts as the main script. Here, we first specify input and output schemas. After declaring a <a href=\"https:\/\/github.com\/pydantic\/pydantic\" target=\"_blank\" rel=\"noopener\">Pydantic<\/a> class with upper and lower bounds for the inputs and outputs respectively, BentoML creates the corresponding <a href=\"https:\/\/www.openapis.org\/\" target=\"_blank\" rel=\"noopener\">OpenAPI<\/a> specification (including schemas!) automatically \u2013 very similar to how FastAPI is doing it. A specification gives consumers of the API a complete and detailed picture of how the API behaves, which inputs are allowed, and what outputs are to be expected.<\/p>\n<pre class=\"lang:python decode:true \">#  input schema\r\nclass InputSchema(BaseModel):\r\n    max_length: conint(ge=1, le=500)\r\n    n_sequences: conint(ge=1, le=5)\r\n    prompt: constr(min_length=1, max_length=250)\r\n    selected_model: AllowedModels\r\n\r\n\r\n# output schema\r\nclass GeneratedText(BaseModel):\r\n    length: int\r\n    max_length: int\r\n    prompt: str\r\n    model_output: str\r\n    model: AllowedModels\r\n\r\n\r\nclass OutputSchema(BaseModel):\r\n    doc_list: List[GeneratedText]\r\n\r\n\r\noutput_spec = JSON(pydantic_model=OutputSchema)\r\ninput_spec = JSON(pydantic_model=InputSchema)<\/pre>\n<p>Finally, the schemas are translated into the OpenAPI specification. The InputSchema for example is represented like this:<\/p>\n<pre class=\"lang:yaml decode:true\">...  \r\nInputSchema:\r\n      properties:\r\n        max_length:\r\n          maximum: 500\r\n          minimum: 1\r\n          title: Max Length\r\n          type: integer\r\n        n_sequences:\r\n          maximum: 5\r\n          minimum: 1\r\n          title: N Sequences\r\n          type: integer\r\n        prompt:\r\n          maxLength: 250\r\n          minLength: 1\r\n          title: Prompt\r\n          type: string\r\n        selected_model:\r\n          $ref: '#\/components\/schemas\/AllowedModels'\r\n      required:\r\n      - max_length\r\n      - n_sequences\r\n      - prompt\r\n      - selected_model\r\n      title: InputSchema\r\n      type: object\r\n...<\/pre>\n<p>After getting a BentoML runner for each model we want to serve, we can already create the service object. The <a href=\"https:\/\/docs.bentoml.org\/en\/latest\/concepts\/runner.html\" target=\"_blank\" rel=\"noopener\">runner<\/a> holds a model and the execution context.<\/p>\n<pre class=\"lang:python decode:true\">MODEL_TO_RUNNER = {\r\n    model.value: get_model_runner(model.value, PIPELINE_PREFIX) for model in AllowedModels\r\n}\r\n\r\n# service definition\r\nservice = bentoml.Service(\r\n    SERVICE_NAME,\r\n    runners=list(MODEL_TO_RUNNER.values()),\r\n)<\/pre>\n<p>Using the service object and the schemas, we can create the endpoint we want to request.<\/p>\n<pre class=\"lang:python decode:true\">@service.api(\r\n    route=\"\/code-completion\",\r\n    input=input_spec,\r\n    output=output_spec,\r\n)\r\ndef completion(input_data: input_spec) -&gt; OutputSchema:\r\n    ...<\/pre>\n<p>Here, we specify the business logic of what shall happen when the endpoint gets called. This is first a bit of input processing. Then we can already call the <span class=\"lang:default decode:true crayon-inline\">run()<\/span> function of the correct runner. The function ends with some output processing.<\/p>\n<div>\n<pre class=\"lang:python decode:true\">runner = MODEL_TO_RUNNER.get(selected_model)\r\ngenerated_text = runner.run(\r\n    prompt, max_length=max_length, num_return_sequences=n_sequences\r\n)<\/pre>\n<\/div>\n<p>Then, you can call <span class=\"lang:zsh decode:true crayon-inline\">bentoml build<\/span>. This will create a bento in your local BentoML bento registry, which is by default at <span class=\"lang:zsh decode:true crayon-inline \">~\/bentoml\/bentos\/<\/span>.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Containerization\"><\/span>Containerization<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>When building a bento, BentoML auto-generates a Dockerfile that adheres to many of the <a href=\"https:\/\/pythonspeed.com\/docker\/\" target=\"_blank\" rel=\"noopener\">best practices of packing Python software<\/a>. This feature freed us from having to optimise the images ourselves. See below for an example Dockerfile that uses a sensible base image, adds a non-privileged run-time user, and sets relevant Python-specific environment variables.<\/p>\n<pre class=\"lang:default decode:true\" title=\"Dockerfile\">FROM python:3.9-slim as base-container\r\n\r\nENV LANG=C.UTF-8\r\nENV LC_ALL=C.UTF-8\r\nENV PYTHONIOENCODING=UTF-8\r\nENV PYTHONUNBUFFERED=1\r\n\r\nUSER root\r\nENV DEBIAN_FRONTEND=noninteractive\r\nRUN rm -f \/etc\/apt\/apt.conf.d\/docker-clean; echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' &gt; \/etc\/apt\/apt.conf.d\/keep-cache\r\nRUN set -eux &amp;&amp; \\\r\n    apt-get update -y &amp;&amp; \\\r\n    apt-get install -q -y --no-install-recommends --allow-remove-essential \\\r\n        ca-certificates gnupg2 bash build-essential \r\n# Block SETUP_BENTO_USER\r\nARG BENTO_USER=bentoml\r\nARG BENTO_USER_UID=1034\r\nARG BENTO_USER_GID=1034\r\nRUN groupadd -g $BENTO_USER_GID -o $BENTO_USER &amp;&amp; useradd -m -u $BENTO_USER_UID -g $BENTO_USER_GID -o -r $BENTO_USER\r\nARG BENTO_PATH=\/home\/bentoml\/bento\r\nENV BENTO_PATH=$BENTO_PATH\r\nENV BENTOML_HOME=\/home\/bentoml\/\r\n\r\nRUN mkdir $BENTO_PATH &amp;&amp; chown bentoml:bentoml $BENTO_PATH -R\r\nWORKDIR $BENTO_PATH\r\n\r\n# Block SETUP_BENTO_COMPONENTS\r\nCOPY --chown=bentoml:bentoml .\/env\/python .\/env\/python\/\r\n# install python packages with install.sh\r\nRUN bash -euxo pipefail \/home\/bentoml\/bento\/env\/python\/install.sh\r\nCOPY --chown=bentoml:bentoml . .\/\r\n\r\n# Block SETUP_BENTO_ENTRYPOINT\r\nRUN rm -rf \/var\/lib\/{apt,cache,log}\r\n# Default port for BentoServer\r\nEXPOSE 3000\r\n\r\n# Expose Prometheus port\r\nEXPOSE 3001\r\nRUN chmod +x \/home\/bentoml\/bento\/env\/docker\/entrypoint.sh\r\n\r\nUSER bentoml\r\nENTRYPOINT [ \"\/home\/bentoml\/bento\/env\/docker\/entrypoint.sh\" ]<\/pre>\n<p>Even better: You can <a href=\"https:\/\/docs.bentoml.org\/en\/latest\/guides\/containerization.html\" target=\"_blank\" rel=\"noopener\">customise this process<\/a> further by modifying Jinja2 templates!<\/p>\n<h3><span class=\"ez-toc-section\" id=\"BentoML-Custom-Runnable\"><\/span>BentoML Custom Runnable<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>While the \ud83e\udd17 Transformers integration is good, there are a few things that are not present yet.<br \/>\nFor example, you might want to use ONNX models and Transformer\u2019s <a href=\"https:\/\/huggingface.co\/docs\/optimum\/onnxruntime\/usage_guides\/pipelines\" target=\"_blank\" rel=\"noopener\">ONNX runtime accelerator<\/a> from <a href=\"https:\/\/huggingface.co\/docs\/optimum\/index\" target=\"_blank\" rel=\"noopener\">Optimum<\/a> to boost inference performance. As of writing this article, this feature was not implemented (yet). However, since BentoML\u2019s Runner concept is also customisable, you can easily create a CustomRunner for this purpose \u2013 or whenever the runners shipped with BentoML do not work for your use case.<\/p>\n<p>Thankfully, creating custom runners is very straightforward. The following code snippet allows using the Transformer\u2019s ONNX runtime:<\/p>\n<pre class=\"lang:python decode:true\" title=\"Onnx custom runner\">class ONNXRunnable(bentoml.Runnable):\r\n    def __init__(self, model_name: str):\r\n        model = bentoml.transformers.get(model_name)\r\n        model_path = model.path\r\n\r\n        self.model = ORTModelForCausalLM.from_pretrained(\r\n            model_path, file_name=\"decoder_model.onnx\"\r\n        )\r\n        self.tokenizer = AutoTokenizer.from_pretrained(model.path)\r\n        self.generator = pipeline(\r\n            \"text-generation\", model=self.model, tokenizer=self.tokenizer\r\n        )\r\n\r\n    @bentoml.Runnable.method(batchable=False)\r\n    def generate(self, prompt: str, **kwargs) -&gt; List:\r\n        output = self.generator(text_inputs=prompt, **kwargs)\r\n        return output<\/pre>\n<p>Additionally, the recent <a href=\"https:\/\/github.com\/bentoml\/BentoML\/releases\/tag\/v1.0.22\" target=\"_blank\" rel=\"noopener\">BentoML 1.0.22 release<\/a> ships another interesting feature: <a href=\"https:\/\/github.com\/bentoml\/OpenLLM\" target=\"_blank\" rel=\"noopener\">OpenLLM<\/a>. This makes it possible to have some of the most powerful language models up and running in no time. All that is needed is to set up the environment and type <span class=\"lang:python decode:true crayon-inline \">openllm start\/build [falcon | flan_t5 | dolly_v2 | chatglm | stablelm | starcoder]<\/span>. Deploying large complex models has never been easier!<\/p>\n<p>&nbsp;<\/p>\n<h3><span class=\"ez-toc-section\" id=\"But-there-also-were-some-issues%E2%80%A6\"><\/span>But there also were some issues&#8230;<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400\">When using BentoML we realized that it is still a bit buggy here and there, e.g. we had some issues using the <a href=\"https:\/\/docs.bentoml.org\/en\/latest\/reference\/cli.html#bentoml-models-export\" target=\"_blank\" rel=\"noopener\">export<\/a> functionality in CI\/CD pipelines or when using <span class=\"lang:zsh decode:true crayon-inline\">bentoml containerize\u00a0<\/span><\/span><span style=\"font-weight: 400\">on WSL or within non-privileged deployment pipelines. But the latter is not a big problem since e.g. <\/span><a href=\"https:\/\/github.com\/GoogleContainerTools\/kaniko\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400\">kaniko<\/span><\/a><span style=\"font-weight: 400\"> can be leveraged to build the provided Dockerfiles. And o<\/span><span style=\"font-weight: 400\">bviously, the advantage of it doing so much for you comes with the disadvantage of not being customisable easily at every part of the toolchain. <\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Using-BentoML-Conclusion\"><\/span>Using BentoML: Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400\">In this blog post, we have shown you how you can use BentoML to quickly serve your ML models. In the end, &#8222;BentoML can be (almost) everything you will need it to be&#8220;. For small to medium-scale GenerativeAI and other ML projects, we cannot recommend BentoML enough. It simplifies so much, allowing you to pay more attention to your business logic. For bigger projects, however, where you have the capacity and knowledge in your team, we recommend utilising more specialised tools like <a href=\"https:\/\/github.com\/huggingface\/text-generation-inference\" target=\"_blank\" rel=\"noopener\">Huggingface Generation Inference<\/a>, <a href=\"https:\/\/github.com\/allegroai\/clearml-serving\" target=\"_blank\" rel=\"noopener\">Clear ML. Serving<\/a>, or <a href=\"https:\/\/developer.nvidia.com\/nvidia-triton-inference-server\" target=\"_blank\" rel=\"noopener\">Nvidia Triton<\/a> \u2013 which is also available as an <a href=\"https:\/\/docs.bentoml.org\/en\/latest\/integrations\/triton.html\" target=\"_blank\" rel=\"noopener\">alternative model serving backend<\/a> in BentoML. You can also create a custom serving layer on top of Docker and Kubernetes to fully address your specific requirements<\/span><span style=\"font-weight: 400\">.<\/span><\/p>\n<p><span style=\"font-weight: 400\">If you want to try out BentoML yourself check out the readme in our <a href=\"https:\/\/github.com\/inovex\/blog-bentoml\" target=\"_blank\" rel=\"noopener\">repository<\/a> that accompanies this blog post.<\/span><\/p>\n<p>Last but not least we want to thank inovex working students from the parrot team and especially Lennart Krauch and Thomas Jonas for supporting the implementation of the queue-based architecture and BentoML at parrot.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this blog post, we make the case for BentoML: An MLOps tool that is both easy to start with \u2013 even for prototyping \u2013 and production-ready for low-to-medium-scale projects. While there are many MLOps solutions out there, the ease of going from prototyping to production is what makes BentoML stand out. ML and especially [&hellip;]<\/p>\n","protected":false},"author":333,"featured_media":47356,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[511,206,225,140,1064],"service":[76,431,432,75],"coauthors":[{"id":333,"display_name":"Nico Gro\u00dfkreuz","user_nicename":"ngrosskreuz"},{"id":337,"display_name":"Malte B\u00fcttner","user_nicename":"mbuettner"}],"class_list":["post-45870","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-artificial-intelligence-2","tag-data-science","tag-data-science-in-production","tag-machine-learning","tag-mlops-2","service-artificial-intelligence","service-data-science","service-devops","service-nlp"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>BentoML for MLOps: From Prototype to Production - inovex GmbH<\/title>\n<meta name=\"description\" content=\"BentoML is an MLOps tool that is both easy to start with \u2013 even for prototyping \u2013 and production-ready for low-to-medium-scale projects.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"BentoML for MLOps: From Prototype to Production - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"BentoML is an MLOps tool that is both easy to start with \u2013 even for prototyping \u2013 and production-ready for low-to-medium-scale projects.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2023-08-09T05:39:46+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-07-10T08:06:01+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/mlops-mit-BentoML.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Nico Gro\u00dfkreuz, Malte B\u00fcttner\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/mlops-mit-BentoML-1024x576.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Nico Gro\u00dfkreuz\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"14\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Nico Gro\u00dfkreuz, Malte B\u00fcttner\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/bentoml-for-mlops-from-prototype-to-production\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/bentoml-for-mlops-from-prototype-to-production\\\/\"},\"author\":{\"name\":\"Nico Gro\u00dfkreuz\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/c7e5b32e161141081b2a64ea762a7442\"},\"headline\":\"BentoML for MLOps: From Prototype to Production\",\"datePublished\":\"2023-08-09T05:39:46+00:00\",\"dateModified\":\"2025-07-10T08:06:01+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/bentoml-for-mlops-from-prototype-to-production\\\/\"},\"wordCount\":2210,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/bentoml-for-mlops-from-prototype-to-production\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/mlops-mit-BentoML.png\",\"keywords\":[\"Artificial Intelligence\",\"Data Science\",\"Data Science in Production\",\"Machine Learning\",\"MLOps\"],\"articleSection\":[\"Analytics\",\"English Content\",\"General\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/bentoml-for-mlops-from-prototype-to-production\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/bentoml-for-mlops-from-prototype-to-production\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/bentoml-for-mlops-from-prototype-to-production\\\/\",\"name\":\"BentoML for MLOps: From Prototype to Production - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/bentoml-for-mlops-from-prototype-to-production\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/bentoml-for-mlops-from-prototype-to-production\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/mlops-mit-BentoML.png\",\"datePublished\":\"2023-08-09T05:39:46+00:00\",\"dateModified\":\"2025-07-10T08:06:01+00:00\",\"description\":\"BentoML is an MLOps tool that is both easy to start with \u2013 even for prototyping \u2013 and production-ready for low-to-medium-scale projects.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/bentoml-for-mlops-from-prototype-to-production\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/bentoml-for-mlops-from-prototype-to-production\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/bentoml-for-mlops-from-prototype-to-production\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/mlops-mit-BentoML.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/mlops-mit-BentoML.png\",\"width\":1920,\"height\":1080,\"caption\":\"Typografisch MLOps, wobei das O durch das Logo von BentoML ersetzt ist.\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/bentoml-for-mlops-from-prototype-to-production\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"BentoML for MLOps: From Prototype to Production\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/c7e5b32e161141081b2a64ea762a7442\",\"name\":\"Nico Gro\u00dfkreuz\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/nico_grosskreuz-96x96.jpg2188df45d60aab51d2ee4c5c475af4b0\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/nico_grosskreuz-96x96.jpg\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/nico_grosskreuz-96x96.jpg\",\"caption\":\"Nico Gro\u00dfkreuz\"},\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/in\\\/nico-gro\u00dfkreuz-3a7a26153\\\/\"],\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/ngrosskreuz\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"BentoML for MLOps: From Prototype to Production - inovex GmbH","description":"BentoML is an MLOps tool that is both easy to start with \u2013 even for prototyping \u2013 and production-ready for low-to-medium-scale projects.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/","og_locale":"de_DE","og_type":"article","og_title":"BentoML for MLOps: From Prototype to Production - inovex GmbH","og_description":"BentoML is an MLOps tool that is both easy to start with \u2013 even for prototyping \u2013 and production-ready for low-to-medium-scale projects.","og_url":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2023-08-09T05:39:46+00:00","article_modified_time":"2025-07-10T08:06:01+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/mlops-mit-BentoML.png","type":"image\/png"}],"author":"Nico Gro\u00dfkreuz, Malte B\u00fcttner","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/mlops-mit-BentoML-1024x576.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Nico Gro\u00dfkreuz","Gesch\u00e4tzte Lesezeit":"14\u00a0Minuten","Written by":"Nico Gro\u00dfkreuz, Malte B\u00fcttner"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/"},"author":{"name":"Nico Gro\u00dfkreuz","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/c7e5b32e161141081b2a64ea762a7442"},"headline":"BentoML for MLOps: From Prototype to Production","datePublished":"2023-08-09T05:39:46+00:00","dateModified":"2025-07-10T08:06:01+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/"},"wordCount":2210,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/mlops-mit-BentoML.png","keywords":["Artificial Intelligence","Data Science","Data Science in Production","Machine Learning","MLOps"],"articleSection":["Analytics","English Content","General"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/","url":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/","name":"BentoML for MLOps: From Prototype to Production - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/mlops-mit-BentoML.png","datePublished":"2023-08-09T05:39:46+00:00","dateModified":"2025-07-10T08:06:01+00:00","description":"BentoML is an MLOps tool that is both easy to start with \u2013 even for prototyping \u2013 and production-ready for low-to-medium-scale projects.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/mlops-mit-BentoML.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/mlops-mit-BentoML.png","width":1920,"height":1080,"caption":"Typografisch MLOps, wobei das O durch das Logo von BentoML ersetzt ist."},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/bentoml-for-mlops-from-prototype-to-production\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"BentoML for MLOps: From Prototype to Production"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/c7e5b32e161141081b2a64ea762a7442","name":"Nico Gro\u00dfkreuz","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/wp-content\/uploads\/nico_grosskreuz-96x96.jpg2188df45d60aab51d2ee4c5c475af4b0","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/nico_grosskreuz-96x96.jpg","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/nico_grosskreuz-96x96.jpg","caption":"Nico Gro\u00dfkreuz"},"sameAs":["https:\/\/www.linkedin.com\/in\/nico-gro\u00dfkreuz-3a7a26153\/"],"url":"https:\/\/www.inovex.de\/de\/blog\/author\/ngrosskreuz\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/45870","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/333"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=45870"}],"version-history":[{"count":8,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/45870\/revisions"}],"predecessor-version":[{"id":62799,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/45870\/revisions\/62799"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/47356"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=45870"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=45870"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=45870"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=45870"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}