In this blog post, we make the case for BentoML: An MLOps tool that is both easy to start with – even for prototyping – and production-ready for low-to-medium-scale projects. While there are many MLOps solutions out there, the ease of going from prototyping to production is what makes BentoML stand out. ML and especially GenerativeAI is moving incredibly fast these days, but BentoML can help you to quickly prototype with a solid foundation, only a very small amount of custom code, and most importantly the ability to scale.
Putting Machine Learning models into production can be a challenging task, as the field of MLOps is still relatively young. There are several things that need to be considered such as scalability, model versioning, and the complexity of the software stack. MLOps tools like BentoML do a lot of the heavy lifting for you, but it may seem hard to settle for one in an early stage of the product.
That is why a first iteration might involve a custom fit RestAPI leveraging tools such as Flask, Django, and FastAPI. While the aforementioned frameworks are powerful and highly customisable, they often lead to a considerable amount of custom code in future iterations and can slow down development. Features that seemed to be simply “nice to have“, such as monitoring, log management, or the ability to group similar (model-) requests together, will become a necessity. And especially with the state-of-the-art models, storage and memory management can be a challenge.
The Architectural Journey
The following paragraphs describe the evolution of a model-serving backend in the context of a student-driven internal research project called parrot. Its purpose is to build an AI platform that provides access to state-of-the-art Generative AI models via a modern and intuitive user interface. Thus, the subsequent discussion of advantages, disadvantages, and architectural considerations is the result of our very own technological journey and is therefore an example of the challenges that a real-world MLOps solution faces in an production setting.
First Iteration: Flask only
There are three popular libraries for building a RESTful API in Python: Flask, FastAPI, and Django (REST framework). While all three are excellent choices for most common use cases, serving a machine learning model with any of the aforementioned libraries can be challenging – as you can see in the following paragraphs.
In the early phases of the product life cycle, we wanted to start with a simple trusted technology that is well-established and ideally offers lots of room for customisation for future iterations. That is why we chose Flask instead of a dedicated MLOps tool, as it does not lock you down into a specific way of doing things. To quote the documentation: “Flask can be everything you need and nothing you don’t“.
In the beginning, having a simple endpoint with some custom code that serves a model for one user at a time is sufficient. This is where Flask does make a lot of sense: We knew the framework already and had something very simple to show to stakeholders. But the more features were added to the barebones model serving application, the more we had to think about necessary functionality that only seemed to be “nice-to-have“ at first. And the more users the application is expected to serve, the more urgent these issues become. It is easy and straightforward to draw a simple sketch with a regular pencil, but you will have a hard time creating a construction drawing with nothing but it.
First Iteration: Shortcomings
The more users are using the app, the more bugs and crashes are going to be discovered, which have to be fixed quickly. To be able to diagnose a problem, detailed application logs are a must. While this is not hard to implement, it does take some time to do it right, especially when running multiple services and containers. Having a sound logging setup serves as an example of why deploying an ML model is not that simple.
Second Iteration: Flask + Redis + Redis Queue
Arguably, the biggest concern is the load that each user is generating. Especially when state-of-the-art GenerativeAI Large Language Models (LLMs) are served, performance and optimisation is never an afterthought but a necessity. To better handle the load and be able to scale, we decided to refactor the application by decoupling the API and the resource-intensive model execution. We leveraged a Redis instance that kept track of all the model prompts that were sent to the API. Separate worker containers then picked a task from the queue and wrote their respective model’s output back to Redis.
With this architecture, we gained a lot of stability because errors in the model execution no longer compromised the core API process anymore. Also, it allowed us to quickly scale the computing resources according to our needs. But the downside was the significant amount of time required for refactoring and, most importantly, the added complexity. And since the model serving is only a means to an end and not the core product, it was hard to justify spending more and more time developing what was essentially another product.
Second Iteration: Shortcomings
Apart from these concerns, another technical issue made us turn our back to our custom-tailored solution: Namely Python and how it implements “copy on write“. Something that is even further away from our product.
So why does it concern us? Redis queue (rq) uses forks under the hood to be able to separate different working processes from each other. Thus, whenever a model inference job is fetched from the queue, the main process is forked and it is this fork that runs the model inference. The nice thing with forking is, that no memory object is duplicated from the start, but merely referenced. So in theory, the GenerativeAI model that is multiple gigabytes heavy can reside only in the main process, while its children can read it via a reference – unless they modify it. But we only want to run a model, not train it. So we should be good, right?
Unfortunately not. As it happens, reading the model object does in fact mean writing to it due to Python’s reference counting. Instead of one model for multiple processes, every process consumes multiple gigabytes separately. With large models, this makes memory usage explode really fast. For more information on how and why Python behaves the way it does, see the excellent “Understanding and Optimizing Python multi-process Memory Management“ post by Luis Sena.
It is possible to make use of a workaround in rq: using SimpleWorker instead of the regular worker class. This comes with its own drawbacks though. With only a single process active at any given moment for each worker, tasks can only be processed one after another, not simultaneously. And that also means that most of the advantages of Redis and rq like process isolation and multiprocessing do not apply anymore.
Third Iteration: BentoML
“Keep it simple, stupid“, does not necessarily mean that you should reinvent the wheel. After looking into BentoML, we realised that a lot of the issues we ran into with our custom MLOps application were already answered by it:
- good integration into the existing Python ML ecosystem (Scikit-Learn, PyTorch, TensorFlow, Transformers, ONNX, LightGBM to name a few)
- (adaptive) batching of similar requests
- a queue similar to Celery and separate worker processes with model sharing between processes
- a central model registry with versioning
- and, above all, a lot less boilerplate code
It also comes with a lot of operational features:
- integrated logging (with support for Open Telemetry)
- easy to set up, deploy and monitor via Prometheus metrics endpoints
- easily scalable and k8s friendly (with Yatai)
For another insightful post about the advantages of using a specialised MLOps tool like BentoML instead of Flask and FastAPI checkout “Breaking Up With Flask & FastAPI: Why ML Model Serving Requires A Specialized Framework“ by BentoML’s Head of Product Tim Liu.
The easiest and most accessible way to get across the advantages of using BentoML is to show the actual code of the application. So sit back, grab a pencil for sketches, and enjoy the ride!
Load Models to BentoML Model Registry
Before you can build your bento with bentoml build you need to load your models into your local BentoML model registry. You can do this by executing the load_models_into_bento.py script. This generates a transformer pipeline from the loaded model and tokenizer. This pipeline is then saved to the local model registry.
1 2 3 4 5 6 7 8 9 |
model = AutoModelForCausalLM.from_pretrained( model_identifier, trust_remote_code=True, revision="main" ) tokenizer = AutoTokenizer.from_pretrained(model_identifier) generator = pipeline(PIPELINE_TASK, model=model, tokenizer=tokenizer) bentoml.transformers.save_model( f'{PIPELINE_PREFIX}{model_name.replace("/", "-")}', generator ) |
Building the Bento
For building a bento you need a bentofile.yaml and a service.py. The bentofile.yaml serves as a config file for BentoML. Here you specify the name, which files to include, and which dependencies to install when building the bento. You can find more on this here. We set the dependency section to pip_args: "-e /home/bentoml/bento/src/.". With this all dependencies from poetry.lock will be installed.
1 2 3 4 5 6 7 8 9 |
service: "service:service" include: - "src/" - "poetry.lock" - "pyproject.toml" - "README.md" - "service.py" python: pip_args: "-e /home/bentoml/bento/src/." |
Creating the main Script and defining Schemas
The service.py acts as the main script. Here, we first specify input and output schemas. After declaring a Pydantic class with upper and lower bounds for the inputs and outputs respectively, BentoML creates the corresponding OpenAPI specification (including schemas!) automatically – very similar to how FastAPI is doing it. A specification gives consumers of the API a complete and detailed picture of how the API behaves, which inputs are allowed, and what outputs are to be expected.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# input schema class InputSchema(BaseModel): max_length: conint(ge=1, le=500) n_sequences: conint(ge=1, le=5) prompt: constr(min_length=1, max_length=250) selected_model: AllowedModels # output schema class GeneratedText(BaseModel): length: int max_length: int prompt: str model_output: str model: AllowedModels class OutputSchema(BaseModel): doc_list: List[GeneratedText] output_spec = JSON(pydantic_model=OutputSchema) input_spec = JSON(pydantic_model=InputSchema) |
Finally, the schemas are translated into the OpenAPI specification. The InputSchema for example is represented like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
... InputSchema: properties: max_length: maximum: 500 minimum: 1 title: Max Length type: integer n_sequences: maximum: 5 minimum: 1 title: N Sequences type: integer prompt: maxLength: 250 minLength: 1 title: Prompt type: string selected_model: $ref: '#/components/schemas/AllowedModels' required: - max_length - n_sequences - prompt - selected_model title: InputSchema type: object ... |
After getting a BentoML runner for each model we want to serve, we can already create the service object. The runner holds a model and the execution context.
1 2 3 4 5 6 7 8 9 |
MODEL_TO_RUNNER = { model.value: get_model_runner(model.value, PIPELINE_PREFIX) for model in AllowedModels } # service definition service = bentoml.Service( SERVICE_NAME, runners=list(MODEL_TO_RUNNER.values()), ) |
Using the service object and the schemas, we can create the endpoint we want to request.
1 2 3 4 5 6 7 |
@service.api( route="/code-completion", input=input_spec, output=output_spec, ) def completion(input_data: input_spec) -> OutputSchema: ... |
Here, we specify the business logic of what shall happen when the endpoint gets called. This is first a bit of input processing. Then we can already call the run() function of the correct runner. The function ends with some output processing.
1 2 3 4 |
runner = MODEL_TO_RUNNER.get(selected_model) generated_text = runner.run( prompt, max_length=max_length, num_return_sequences=n_sequences ) |
Then, you can call bentoml build. This will create a bento in your local BentoML bento registry, which is by default at ~/bentoml/bentos/.
Containerization
When building a bento, BentoML auto-generates a Dockerfile that adheres to many of the best practices of packing Python software. This feature freed us from having to optimise the images ourselves. See below for an example Dockerfile that uses a sensible base image, adds a non-privileged run-time user, and sets relevant Python-specific environment variables.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
FROM python:3.9-slim as base-container ENV LANG=C.UTF-8 ENV LC_ALL=C.UTF-8 ENV PYTHONIOENCODING=UTF-8 ENV PYTHONUNBUFFERED=1 USER root ENV DEBIAN_FRONTEND=noninteractive RUN rm -f /etc/apt/apt.conf.d/docker-clean; echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/keep-cache RUN set -eux && \ apt-get update -y && \ apt-get install -q -y --no-install-recommends --allow-remove-essential \ ca-certificates gnupg2 bash build-essential # Block SETUP_BENTO_USER ARG BENTO_USER=bentoml ARG BENTO_USER_UID=1034 ARG BENTO_USER_GID=1034 RUN groupadd -g $BENTO_USER_GID -o $BENTO_USER && useradd -m -u $BENTO_USER_UID -g $BENTO_USER_GID -o -r $BENTO_USER ARG BENTO_PATH=/home/bentoml/bento ENV BENTO_PATH=$BENTO_PATH ENV BENTOML_HOME=/home/bentoml/ RUN mkdir $BENTO_PATH && chown bentoml:bentoml $BENTO_PATH -R WORKDIR $BENTO_PATH # Block SETUP_BENTO_COMPONENTS COPY --chown=bentoml:bentoml ./env/python ./env/python/ # install python packages with install.sh RUN bash -euxo pipefail /home/bentoml/bento/env/python/install.sh COPY --chown=bentoml:bentoml . ./ # Block SETUP_BENTO_ENTRYPOINT RUN rm -rf /var/lib/{apt,cache,log} # Default port for BentoServer EXPOSE 3000 # Expose Prometheus port EXPOSE 3001 RUN chmod +x /home/bentoml/bento/env/docker/entrypoint.sh USER bentoml ENTRYPOINT [ "/home/bentoml/bento/env/docker/entrypoint.sh" ] |
Even better: You can customise this process further by modifying Jinja2 templates!
BentoML Custom Runnable
While the 🤗 Transformers integration is good, there are a few things that are not present yet.
For example, you might want to use ONNX models and Transformer’s ONNX runtime accelerator from Optimum to boost inference performance. As of writing this article, this feature was not implemented (yet). However, since BentoML’s Runner concept is also customisable, you can easily create a CustomRunner for this purpose – or whenever the runners shipped with BentoML do not work for your use case.
Thankfully, creating custom runners is very straightforward. The following code snippet allows using the Transformer’s ONNX runtime:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
class ONNXRunnable(bentoml.Runnable): def __init__(self, model_name: str): model = bentoml.transformers.get(model_name) model_path = model.path self.model = ORTModelForCausalLM.from_pretrained( model_path, file_name="decoder_model.onnx" ) self.tokenizer = AutoTokenizer.from_pretrained(model.path) self.generator = pipeline( "text-generation", model=self.model, tokenizer=self.tokenizer ) @bentoml.Runnable.method(batchable=False) def generate(self, prompt: str, **kwargs) -> List: output = self.generator(text_inputs=prompt, **kwargs) return output |
Additionally, the recent BentoML 1.0.22 release ships another interesting feature: OpenLLM. This makes it possible to have some of the most powerful language models up and running in no time. All that is needed is to set up the environment and type openllm start/build [falcon | flan_t5 | dolly_v2 | chatglm | stablelm | starcoder]. Deploying large complex models has never been easier!
But there also were some issues…
When using BentoML we realized that it is still a bit buggy here and there, e.g. we had some issues using the export functionality in CI/CD pipelines or when using bentoml containerize on WSL or within non-privileged deployment pipelines. But the latter is not a big problem since e.g. kaniko can be leveraged to build the provided Dockerfiles. And obviously, the advantage of it doing so much for you comes with the disadvantage of not being customisable easily at every part of the toolchain.
Using BentoML: Conclusion
In this blog post, we have shown you how you can use BentoML to quickly serve your ML models. In the end, „BentoML can be (almost) everything you will need it to be“. For small to medium-scale GenerativeAI and other ML projects, we cannot recommend BentoML enough. It simplifies so much, allowing you to pay more attention to your business logic. For bigger projects, however, where you have the capacity and knowledge in your team, we recommend utilising more specialised tools like Huggingface Generation Inference, Clear ML. Serving, or Nvidia Triton – which is also available as an alternative model serving backend in BentoML. You can also create a custom serving layer on top of Docker and Kubernetes to fully address your specific requirements.
If you want to try out BentoML yourself check out the readme in our repository that accompanies this blog post.
Last but not least we want to thank inovex working students from the parrot team and especially Lennart Krauch and Thomas Jonas for supporting the implementation of the queue-based architecture and BentoML at parrot.