Quickstart - Deploy Hugging Face Models with SageMaker Jumpstart

Why use SageMaker JumpStart for Hugging Face models?

Amazon SageMaker JumpStart lets you deploy the most-popular open Hugging Face models with one click—inside your own AWS account. JumpStart offers a curated selection of model checkpoints for various tasks, including text generation, embeddings, vision, audio, and more. Most models are deployed using the official Hugging Face Deep Learning Containers with a sensible default instance type, so you can move from idea to production in minutes.

In this quickstart guide, we will deploy Qwen/Qwen2.5-14B-Instruct.

1. Prerequisites

	Requirement
AWS account with SageMaker enabled	An AWS account that will contain all your AWS resources.
An IAM role to access SageMaker AI	Learn more about how IAM works with SageMaker AI in this guide.
SageMaker Studio domain and user profile	We recommend using SageMaker Studio for straightforward deployment and inference. Follow this guide.
Service quotas	Most LLMs need GPU instances (e.g. ml.g5). Verify you have quota for `ml.g5.24xlarge` or request it.

These docs and examples use the SageMaker Python SDK v3, which introduces a new framework-agnostic API built around ModelBuilder (inference) and ModelTrainer (training), replacing the v2 HuggingFaceModel and HuggingFace classes. Install it with pip install "sagemaker>=3.0.0".

2. Endpoint deployment

Let’s explain how you would deploy a Hugging Face model to SageMaker browsing through the Jumpstart catalog:

Open SageMaker → JumpStart.
Filter “Hugging Face” or search for your model (e.g. Qwen2.5-14B).
Click Deploy → (optional) adjust instance size / count → Deploy.
Wait until Endpoints shows In service.
Copy the Endpoint name (or ARN) for later use.

Alternatively, you can also browse through the Hugging Face Model Hub:

Open the model page → Click Deploy → SageMaker → Jumpstart tab if model is available.
Copy the code snippet and use it from a SageMaker Notebook instance.

# SageMaker JumpStart models can be deployed with ModelBuilder by passing the
# JumpStart model ID as `model`. ModelBuilder resolves the JumpStart artifacts and
# container, and runs the deployment in network isolation.
# Set `instance_type` to one the model supports (see the model card): ModelBuilder's
# auto-detection otherwise picks a CPU instance, which LLMs don't support.
import json
from sagemaker.serve import ModelBuilder

# use the `role_arn` parameter to use a different role
model_builder = ModelBuilder(
    model="huggingface-llm-qwen2-5-14b-instruct",
    instance_type="ml.g5.24xlarge",
)
model_builder.build()

predictor = model_builder.deploy(accept_eula=True)

payload = {
    "inputs": "what is machine learning?",
    "parameters": {"max_new_tokens": 256},
}
response = predictor.invoke(body=json.dumps(payload), content_type="application/json")
print(json.loads(response.body.read()))

The endpoint creation can take several minutes, depending on the size of the model.

3. Test interactively

If you deployed through the console, you need to grab the endpoint ARN and reuse in your code.

import json
from sagemaker.core.resources import Endpoint

endpoint_name = "MY ENDPOINT NAME"
predictor = Endpoint.get(endpoint_name=endpoint_name)
payload = {
    "messages": [
        {
            "role": "system",
            "content": "You are a passionate data scientist."
        },
        {
            "role": "user",
            "content": "what is machine learning?"
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.7,
    "top_p": 0.9,
    "stream": False
}

response = predictor.invoke(body=json.dumps(payload), content_type="application/json")
print(json.loads(response.body.read()))

The endpoint supports the OpenAI API specification.

4. Clean‑up

To avoid incurring unnecessary costs, when you’re done, delete the SageMaker endpoints in the Deployments → Endpoints console or using the following code snippets:

predictor.delete()