Featured
Open Models on Vertex AI with Hugging Face: Serving multiple LoRA Adapters on Vertex AI
TL;DR
Are you struggling to serve multiple LoRA adapters on Vertex AI? This blog post of the Open Models on Vertex AI with Hugging Face series provides a practical example of how to deploy a Gemma 2 model with multiple LoRA adapters on Vertex AI using custom handlers to enable SQL and code generation tasks.
Introduction
Imagine this: you have a Gemma 2-based coding application for beginners deployed on Vertex AI, and you want to expand its capabilities to support both intermediate coders and SQL developers. You can achieve this by fine-tuning Gemma 2 with LoRA, a parameter-efficient fine-tuning technique for large language models (LLMs) that allows LLM adaptation to new tasks without retraining the entire model by adding a small number of trainable parameters (adapter). With LoRA, you can use your own private data within your own controlled environment to build CODE and SQL Gemma adapters and direct user requests to the appropriate LoRA when you serve the model. This leads to the following question:
How to serve multiple LoRA adapters on Vertex AI?
This article shows you how to deploy LoRA adapters on Vertex AI using the Hugging Face Deep Learning container and its custom handler. For simplicity, you use the gemma-2–2b-it-lora-sql and gemma-2–2b-it-lora-magicoder which allows you to handle both SQL and code generation user requests using Gemma 2.
Recap: What are Hugging Face Deep Learning containers and its custom handler?
Hugging Face Deep Learning Containers (DLCs) are your fast track to deploying Hugging Face models on Vertex AI. These optimized Docker containers come pre-loaded with all the essential dependencies — Transformers, Datasets, Tokenizers, and more so you don’t have to wrestle with them.
But what if you need a more customized approach? Here is where Custom Handlers come into play. Custom Handlers are Python classes you can define to fine-grained control over your inference pipeline, from pre-processing to post-processing.
With a Custom Handler, you can tailor your deployment, adding extra steps, custom measurements, or logging. It’s all about flexibility and giving you the ability to build exactly what you need.
In this scenario, you can use a handler to define the custom inference pipeline.
For each user request, you use a proprietary model like Gemini or even an open model like Gemma itself as an LLM router to categorize the request into either an SQL query generation or CODE generation. Based on this categorization, the user request is directed to the appropriate LoRA for the corresponding prediction. This approach allows for a single deployment of the base model. And the small size of the LoRA adapters enables loading multiple adapters onto the same Vertex AI Endpoint using a Hugging Face Deep Learning Container, which reduces the cost of hosting your open-models.
Now that you have a general understanding of how custom handlers allow you to serve multiple LoRA Adapters on Vertex AI, let’s have a look at how to implement it on Vertex AI. You can find the notebook here.
Custom handlers in action: Serving Gemma 2 with multiple LoRA adapters
To serve Gemma 2 with multiple LoRA adapters, you start by crafting a handler.py module. As discussed in Open Models on Vertex AI with Hugging Face: Custom handlers, the handler module must have a class named EndpointHandler and include both the `__init__` and `__call__` methods. Beyond that, you can add other methods within the class, or even functions outside of it, and utilize them within your class methods.
Here you have the init method of the EndpointHandler class in this scenario.
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig as TGenerationConfig
from vertexai.generative_models import GenerativeModel
class EndpointHandler:
def __init__(
self,
model_dir: str = "google/gemma-2-2b-it",
sql_adapter_id: str = "/tmp/model/google-cloud-partnership/gemma-2-2b-it-lora-sql"
magicoder_adapter_id: str = "/tmp/model/google-cloud-partnership/gemma-2-2b-it-lora-magicoder"
router_model_id: str = "gemini-1.5-flash",
**kwargs: Any,
) -> None:
self.processor = AutoTokenizer.from_pretrained(model_dir, token=os.getenv("HF_TOKEN"))
self.model = AutoModelForCausalLM.from_pretrained(model_dir,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
device_map="auto",
token=os.getenv("HF_TOKEN")
)
self.model.load_adapter(sql_adapter_id, adapter_name="sql_adapter", is_trainable=False, device_map="auto", offload_folder="/tmp/offload")
self.model.load_adapter(magicoder_adapter_id, adapter_name="magicoder_adapter", is_trainable=False, device_map="auto", offload_folder="/tmp/offload")
self.router_model = GenerativeModel(router_model_id)
The EndpointHandler class’s init method initializes Gemma 2 as the base model and its associated Tokenizer as preprocessor. It also initializes gemma-2–2b-it-lora-sql and gemma-2–2b-it-lora-magicoder as the associated LoRA adapters using the Hugging Face Transformers library with the corresponding sql_adapter and magicoder_adapter adapter IDs. Additionally, the class initializes Gemini 1.5 Flash as the router model. Note that both adapters are loaded from a local tmp folder, which will be mounted during deployment on Vertex AI. The base model is loaded directly from the Hugging Face Hub.
The EndpointHandler class’s call method is defined after the init method is set. The call method for this scenario is shown below.
from huggingface_inference_toolkit.logging import logger
class EndpointHandler:
def __init__(...):
...
def __call__(self, data: Dict[str, Any]) -> Dict[str, List[Any]]:
logger.info("Processing new request")
predictions = []
for instance in data["instances"]:
# Check the prompt
if "inputs" not in instance:
raise ValueError("The request body must contain the `inputs` key.")
# Get the adapter label
logger.info(f'Getting the adapter label for the prompt: {instance["inputs"]}')
prompt = instance["inputs"]
prompt_classification = route_prompt(prompt, self.router_model)["classification"]
# Set the adapter model
logger.info(f'Setting the model to {prompt_classification} adapter')
if prompt_classification == "SQL":
self.model.set_adapter("sql_adapter")
else:
self.model.set_adapter("magicoder_adapter")
# Prepare input
logger.info('Preparing the input for the prompt')
messages = [{"role": "user", "content": prompt}]
input_ids = self.tokenizer.apply_chat_template(
messages,
return_tensors="pt"
).to(self.model.device)
# Generate prediction
logger.info('Generating the prediction')
input_len = input_ids.shape[-1]
with torch.inference_mode():
generation_config = instance.get(
"parameters", {"temperature": 0.7, "do_sample": True}
)
generation = self.model.generate(
input_ids=input_ids,
generation_config=TGenerationConfig(**generation_config),
)
generation = generation[0][input_len:]
response = self.tokenizer.decode(generation, skip_special_tokens=True)
logger.info(f'Generated response: {response[:50]}...')
predictions.append(response)
logger.info(f"Successfully processed {len(predictions)} instances")
return {"predictions": predictions}
The call method contains the inference logic. For every instance, it takes the user’s prompt as input and classifies it using the Gemini-based router. The sql_adapter is loaded if the resulting label is SQL; otherwise, the magicoder_adapter is loaded. The user’s prompt is preprocessed using the associated template, and the resulting token ids are sent to the chosen adapter model. The text predictions are then processed from the tokens and returned in a dictionary.
To use the custom handler in combination with the Hugging Face Deep Learning Container (DLC) for PyTorch Inference, you can either upload the artifacts (handler.py, requirements.txt and adapters) to a Google Cloud Storage (GCS) bucket, or instead use the publicly available weights from the Hugging Face Hub. You can obtain the gemma-2–2b-it-lora-sql and gemma-2–2b-it-lora-magicoder adapters from the Hugging Face Hub with Hugging Face CLI (huggingface-cli) or use a locally cached version (if available in your HF_HOME directory). Below you have a view of the Google Cloud bucket you should have.
After you upload both the custom handler code (handler.py) and the LoRA adapters model to a Google Cloud Storage (GCS) Bucket, you deploy Gemma 2 model and its adapters to a Vertex AI Endpoint by first registering them within the Vertex AI Model Registry. This centralized repository manages the lifecycle of your ML models on Vertex AI. Below you can see how to register the model using the upload method of the Vertex AI SDK for Python.
from google.cloud.aiplatform import Model
model = Model.upload(
display_name="google--gemma2-tgi-multi-lora-model",
artifact_uri=str(serve_uri),
serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311",
serving_container_ports=[8080],
serving_container_environment_variables={
"HUGGING_FACE_HUB_TOKEN": get_token(),
},
)
model.wait()
To upload a model, you specify the model name (display_name) and the Hugging Face DLC for Pytorch Inference (serving_container_image_uri), which will be used to serve the model. Optional settings include the serving_container_environment_variables, such as the Hugging Face Hub token (HUGGING_FACE_HUB_TOKEN) which is required in this example because google/gemma-2b-it is a gated model. The serving_container_ports, where the Vertex AI endpoint will be exposed, can also be set optionally or default to 8080. Additional information about supported upload arguments can be found in its Python reference. A new version will be created in the Vertex AI Model Registry UI upon successful registration of your model as shown below.
After registering the model, you can deploy your custom-handler-enabled Gemma 2 model with multiple LoRA adapters to an Endpoint within Vertex AI Prediction for scalable online and batch inference. The code example below demonstrates how to create an endpoint and deploy the model using the deploy method of the Vertex AI SDK for Python.
from google.cloud.aiplatform import Endpoint
deployed_model = model.deploy(
endpoint=Endpoint.create(display_name="google--gemma2-tgi-multi-lora-endpoint"),
machine_type="g2-standard-4",
accelerator_type="NVIDIA_L4",
accelerator_count=1,
)
The model can be deployed by optionally setting the endpoint; by default, this will be set to the model display name with the suffix “_endpoint”. Additionally, you can specify the instance (machine_type), accelerator (accelerator_type), and number of accelerators (accelerator_count). More information on the supported deploy can be found in its Python reference. You can verify the status of the endpoint in Vertex AI Prediction UI.
The deployment of your model to Vertex AI will take some time. After the endpoint is ready, you can use the Gemma 2 model and its adapters to generate either code or SQL predictions through Vertex AI SDK for Python, gcloud, and cURL. Below is an example of prediction generated via Vertex AI SDK for Python.
prediction_request = {
"instances":[
{
"inputs":"I have a table called orders with columns order_id (INT), customer_id (INT), order_date (DATE), and total_amount (DECIMAL). I need to find the total revenue generated in the month of October 2023. How can I write a SQL query to achieve this?",
"parameters":{
"do_sample":true,
"temperature":0.7
}
},
{
"inputs":"# Context: You have a list of numbers called `my_numbers`.\n # Question: How do I calculate the sum of all the numbers in `my_numbers` using a built-in function?\n # Example `my_numbers` list:\n my_numbers = [1, 2, 3, 4, 5]",
"parameters":{
"do_sample":true,
"temperature":0.7
}
}
]
}
output = deployed_model.predict(instances=prediction_request["instances"])
for prediction in output.predictions:
print("------- Prediction -------")
print(prediction)
print("--------------------------\n")
# ------- Prediction -------
# ```sql SELECT SUM(total_amount) ... ```
# --------------------------
# ------- Prediction -------
# ```python total_sum = sum() ... ```
# --------------------------
Conclusion
Are you struggling to serve multiple LoRA adapters on Vertex AI? This blog post demonstrated how to deploy a Gemma 2 model with multiple LoRA adapters on Vertex AI using custom handlers to enable SQL and code generation tasks assuming that you fine-tuned Gemma 2 with LoRA using your own private data within your own controlled environment. This example showcases how custom handlers can be leveraged for complex deployments of open models, even in combination with proprietary models. We explored the implementation of custom handlers and the deployment for online predictions on Vertex AI. These techniques, combined with the following resources, can be applied to a wide range of Generative AI applications.
So what are you going to build next?
What’s next
Explore these resources to dive deeper into Vertex AI Model Garden and Hugging Face Deep Learning containers.
Documentation
GitHub examples
Thanks for reading
I hope you enjoyed the article. If so, 𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲, 👏 this article or leave comments. Also let’s connect on LinkedIn or X to share feedback and questions 🤗 about Vertex AI you would like to find an answer.
Special thanks to , Gus Martins, Alvaro Bartolome and Simon Pagezy for feedback and support!