As open source LLMs gain momentum, Magistral AI by Mistral is emerging as a top choice for developers and enterprises looking to build fast, cost effective and privacy centric AI systems.
In this guide, you’ll learn how to deploy Magistral AI on AWS EC2 with support from Hugging Face’s Transformers and Accelerate libraries, giving you the power to serve real-time generative AI workloads at scale.
Whether you're building an AI assistant, RAG system or internal LLM search, this guide will get you up and running in minutes.
Before you begin
Step by Step Guide to Deploy Magistral AI on EC2
1. Go to the AWS EC2 Console
2. Choose Amazon Linux 2 or Ubuntu 22.04 LTS
3. Select a GPU instance like g4dn.xlarge, p3.2xlarge, or g5.xlarge
4. Create a security group with port 22 (SSH) and optionally port 8000 or 5000 open for API access
5. Launch instance and connect via SSH
ssh -i your-key.pem ec2-user@your-ec2-public-ip
sudo apt update && sudo apt upgrade -y
sudo apt install python3-pip git -y
pip3 install --upgrade pip
If you’re using a GPU instance, install NVIDIA drivers
sudo apt install nvidia-driver-525
nvidia-smi
python3 -m venv venv
source venv/bin/activate
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate huggingface_hub
You can use any of Mistral’s open source LLMs that support Magistral inference:
python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "mistralai/Magistral-7B" # Example model ID
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).cuda()
inputs = tokenizer("Write a short story about a robot learning emotions.", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
You can also use pipeline from Hugging Face for simplified inference.
Use FastAPI or Flask to expose an endpoint:
pip install fastapi uvicorn
Basic FastAPI app:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class Prompt(BaseModel):
text: str
@app.post("/generate")
def generate_text(prompt: Prompt):
inputs = tokenizer(prompt.text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=150)
return {"response": tokenizer.decode(outputs[0], skip_special_tokens=True)}
Then run the server
uvicorn app:app --host 0.0.0.0 --port 8000
Access via
http://your-ec2-ip:8000/docs
If managing EC2 seems heavy, try Hugging Face Inference Endpoints for managed hosting just upload your model or use Mistral’s pre-trained versions.
Hosting Magistral AI on AWS EC2 with Hugging Face gives you full control, GPU optimized performance and cost effective deployment of your own private LLM infrastructure.
From chatbots to content generation and enterprise search, this setup can scale with your AI ambitions.
Deploy Magistral AI today and unleash the full power of open source LLMs securely, affordably and at scale!
Contact us today to develop custom applications using Magistral AI from smart assistants to enterprise grade LLM workflows tailored to your unique use case.