Golang Logo

Go/Golang Training

Unser Hands-On-Einstieg in die Entwicklung mit Go. Nächster Termin: 14.- 15.05. in Köln – jetzt buchen!
Zum Training 
Grafik zum Blogartikel „Code Assistant: How to Self-Host Your Own“ von Malte Büttner
NLP

Code Assistant: How to Self-Host Your Own

Lesezeit
14 ​​min

The release of the Code Assistant GitHub Copilot to the public in June 2021 marked the beginning of a new kind of helper in the tool belt of developers – alongside existing ones such as for example linters and formatters.

While basic code completion has been on the market for years with varying degree of complexity, a tool that understands code and completes it in a meaningful way that transcends simple parameter suggestions was a novelty.

This blog article is showing how to build a state-of-the-art Code Assistant using several open source tools created by Hugging Face 🤗:

… all via a single docker-compose file 🔥! This file and all the others discussed in this article are available in an accompanying repository.

Wait… Have We Been There Already?

Kite was one of the companies that provided a more advanced variant of code completion and gave up on the task for various reasons. In late 2022 the company gave the following explanation:

First, we failed to deliver our vision of AI-assisted programming because we were 10+ years too early to market, i.e. the tech is not ready yet.

We built the most-advanced AI for helping developers at the time, but it fell short of the 10× improvement required to break through because the state of the art for ML on code is not good enough. You can see this in Github Copilot, which is built by Github in collaboration with Open AI. As of late 2022, Copilot shows a lot of promise but still has a long way to go.

But in “late“ 2023 you can run a publicly available model that even beats ChatGPT and old versions of GPT-4 on your personal computer! One year in AI moves blazingly fast and can cover a decade…

Challenge Accepted

Ever since Copilot was released, the open source LLM community tried its best to replicate its functionality. ChatGPT and GPT-4 raised the bar even higher. The release of StarCoder by the BigCode project was a major milestone for the open LLM community: The first truly powerful large language model for code generation that was released to the public under a responsible but nonetheless open license: The code wars had begun and the source was with StarCoder.

While it still performed considerably worse than the proprietary and walled GPT-4 (67 in March) and ChatGPT (48.1) models on the HumanEval benchmark with 32.9 points, it positioned itself successfully within striking distance.

The releases of Llama 2 and subsequently Code Llama – both by Meta – are also important waypoints. Code Llama achieved an impressive HumanEval pass@1 score of 48.8, beating ChatGPT. A few days later WizardCoder builds on top of StarCoder, thereby achieving 73.2 pass@1 which even surpasses GPT-4’s March score!

Why Bother with Self-Hosting?

While Coding Assistant services like GitHub Copilot and tabnine (allows VPC and air-gapped installs) exist already, there are many reasons to self-host one for your company or even yourself.

  • Full control over all the moving parts, models and software
  • The ability to easily fine-tune models on your own data
  • No vendor lock-in
  • The fact that by now many of the most capable models are public anyway
  • Various compliancy reasons

On August 22, Hugging Face 🤗 announced an enterprise Code Assistant called SafeCoder, which brings together StarCoder (and other models), as well as an inference endpoint and a VSCode extension all in a single managed package. SafeCoder addresses many of the points above, but hides most of its moving parts behind its managed service – by design. Luckily, the main components are open source and readily available. In the following, we will setup everything that is needed to run your very own Coding Assistant serviced by you.

Prerequisites

The best and most performant way to run LLM today is by leveraging GPUs or TPUs. This article assumes that you have a NVIDIA GPU with CUDA support with at least 10 Gigabytes of VRAM at your disposal. Be sure to install an up-to-date driver and CUDA version. You will also need Docker (or another container engine like Podman) and the NVIDIA Container Toolkit.

First Component: The Inference Engine

The core of the Coding Assistant is the backend that is handling the user’s completion requests and generating new tokens based on them. For this we will use huggingface’s Text Generation Inference, which powers Inference Endpoints and the Inference API – a well tested and vital part of huggingface’s infrastructure. Note that the license for the software was slightly changed recently: TGI (text generation inference) from 1.0 onwards uses a new license called HFOIL 1.0, which restricts commercial use. Olivier Dehaene, the maintainer of the project, summarises the implications of the license as follows:

building and selling a chat app for example that uses TGI as a backend is ok whatever the version you use
building and selling a Inference Endpoint like experience using TGI 1.0+ requires an agreement with HF

While this summary should give you a basic understanding of what is possible under the license, be sure to consult a lawyer to get a thorough understanding of whether your use case is covered or not.

The Model: WizardCoder

We will use a quantised and optimised version of a SOTA Code Assistant model called WizardCoder. There are several options available today for quantised models: GPTQ, GGML, GGUF… Tom Jobbins aka “TheBloke“ gives a good introduction here. Since GGUF is not yet available for Text Generation Inference yet, we will stick to GPTQ. For the model to run properly, you will need roughly 10 Gigabytes of available VRAM. If you happen to have more than that available, feel free to try the 34B model, or the slightly better 34B Phind model, which unfortunately is not yet available in a 13B version. Also, check the “Big Code Models Leaderboard“ on huggingface to regularly select the best performing model for your use case.

Setting up Text Generation Inference

Create a docker-compose.yml file with the following contents:

Optionally, create an .env file with:

Finally, use sudo docker compose up -d to run the text generation service. It will now be available at localhost:8080. sudo docker container ls gives you a list of all running container instances. Next, type sudo docker logs text-generation –follow to get live-output of the TGI container logs. This is particularly helpful for debugging. As you can see in the logs, TGI will download the model the first time that it is run and save it to the data folder that is mounted as a volume inside the container.

To test if everything was setup correctly, try to send the following POST request to your API from a new terminal window/tab:

Now, you should get a response back from the API and also see the request in the container logs! Note that the quality of the response may very well be lacking, since we did not configure any parameters for our request, as this is just to test the basic functionality. You should now have Text Generation Inference up and running on your machine with WizardCoder as a model. Well done!

Second Component: The VSCode Extension

Next, we will setup a plugin for Visual Studio Code that allows us to query TGI conveniently from our IDE! For this we will use huggingface’s VSCode extension available from the marketplace. The plugin is actively developed and thankfully a recent update made it possible to configure the max_new_tokens parameter, which controls how long the model’s response can be. A larger number allows for longer code to be generated but also results in more load.

Setting up the Extension

Once you have installed the plugin, head over to the extension settings. We will need to configure a few parameters:

  1. First, change the Hugging Face Code: Config Template
    to WizardLM/WizardCoder-Python-34B-V1.0
  2. Next, configure the Hugging Face Code: Model ID Or Endpoint setting and change it to http://YOUR-SERVER-ADDRESS-OR-IP:8080/generate or localhost if TGI runs on the same machine.

To test if everything works as intended, create a new .py file and copy over the following text. Since we are using an instruction model, the model will perform best when prompted properly:

Then move your cursor to the end of function definition’s line and hit enter. You should see a spinning circle in the bottom of the window and should be greeted with some (hopefully functional) code!

Third Component: The Chat UI

Would it not be convenient to also be able to access the Code Assistant from your web browser without needing to open an IDE? Certainly! And this is where another great open source software comes into play: huggingface’s Chat UI. It is the very same code that drives the Assistant HuggingChat, which is a very well put together variant of the familiar ChatGPT UI.

Setting up Chat UI

First, clone the repository and create a file called .env.local in its root directory with the following contents:

There is still a lot of room for improvement especially in the chatPromptTemplate section. See here for further information.

Unfortunately, no prebuilt Docker image exists for Chat UI. Thus, we have to build the image ourselves. The .env and .env.local files are needed at build-time, so be sure to have them ready. Run the following command in the root directory of the Chat UI repository:

Next, create a new folder and create a new docker-compose.yml file with the following contents. It is important that the .env file from Chat UI is not in the same folder hierarchy as the docker-compose.yml (hence the new folder), since Docker compose will try to parse and use the .env file in this case case, which will lead to parsing errors due to the JSON string formatting. And we do not need the .env file and its contents at runtime, anyway.

Now, we can test-drive Chat UI. To do so, type in sudo docker compose up -d in the directory of the docker-compose.yml (as before with TGI) and be sure to also keep an eye on the logs via sudo docker container logs chat-ui –follow. If all works as expected, you should be able to access the UI on port 3000!

Code Assistant example using the UI

Putting Everything Together

Besides, it is also possible, of course, to use one combined docker-compose file if you are willing to host the backend, frontend and database on the same machine. Copy the data folder from earlier so the models do not need to be re-downloaded. You might also have to remove the old Chat UI and database containers using sudo docker container remove chat-ui mongo-chatui.

Do not forget to change the endpoints parameter in the MODELS variable of Chat UI’s .env.local to “endpoints“:[{„url“:“http://text-generation:/generate_stream“}], since we now can conveniently use the container address of the shared Docker network. Remember, you have to re-build the image after adapting the .env.local file.

Great! Now you can start the backend, the frontend and the database with one single sudo docker compose -up -d.

Bonus: Adding HTTPS

Up to this point, the API and UI are all served only via HTTP. It is therefore advisable to better secure our traffic with HTTPS and the help of a reverse proxy like nginx. Without HTTPS, you will not be able to access the UI from other destinations as localhost.

Create a new directory called nginx and inside of it a new file nginx.conf. The specific settings depend on what local registrar you are using – in case you only want to make the service available to your local network.

This nginx.conf template can serve as a starting point:

You also need to add the nginx service to your existing docker-compose.yml.

Now you only need to generate the certificates, save them in the certificates folder and restart everything.

This is it!

Good job. You now have all the components needed to self-host our very own Code Assistant. Thanks to the awesome people at huggingface, it is easier than ever.  And maybe you even learned a thing or two along the way. Before you put it in production though, you may want to do a final load test, e.g. via locust. Doing so, you get an understanding of how many users are able to use the service at the same time. For this you will need to write a small locust-file.py – and for that you could kindly ask WizardCoder to help you out 🧙‍♀️.

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert