
Qwen 2.5 Model for your Business?
Cost Efficiency (Open Source)
Lower Long Term costs
Customised data control
Pre-trained model
Get Your Qwen 2.5 AI Model Running in a Day
Free Installation Guide - Step by Step Instructions Inside!
We want to install and run the Qwen-2.5 model on our local server using Hugging Face, but are unsure how to properly set up the environment, manage dependencies, and execute a prompt.
We will go through the step-by-step process of:
Before installation, ensure that the local server has the following:
Start by updating the system and installing required packages:
sudo apt update && sudo apt upgrade -ysudo apt install -y python3 python3-pip git
For NVIDIA GPU, install CUDA & cuDNN:
sudo apt install -y nvidia-driver-525pip install torch torchvision torchaudio --index-url ttps://download.pytorch.org/whl/cu118
Verify GPU installation:
nvidia-smi If you see GPU details, it’s installed correctly.
To isolate dependencies, create and activate a virtual environment:
python3 -m venv qwen_envsource qwen_env/bin/activate Now, install Hugging Face Transformers, PyTorch, and other required libraries:
pip install torch transformers acceleratepip install sentencepiece Confirm installation:
python -c "import torch; print(torch.cuda.is_available())" If it prints True, CUDA is enabled for GPU acceleration.
Use the Hugging Face CLI to pull the model:
pip install huggingface_hubhuggingface-cli login # (Optional, required for some models) Then, download the Qwen-2.5 model:
from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "Qwen/Qwen2.5-7B" # Change to Qwen2.5-14B if needed # Load tokenizer and modeltokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")print("Model loaded successfully!")
Now, let’s test text generation using Qwen-2.5:
def generate_text(prompt): inputs = tokenizer(prompt, return_tensors="pt").to("cuda") # Use "cpu" if no GPU output = model.generate(**inputs, max_length=100) return tokenizer.decode(output[0], skip_special_tokens=True)# Example usageprint(generate_text("What is the meaning of life?"))
If the setup is correct, we should see an AI-generated response.
Enable Half-Precision (FP16) for Faster Inference
Modify the model loading to use torch_dtype=torch.float16:
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")Use DeepSpeed or BitsAndBytes for Memory Efficiency
Install additional tools for better memory usage:
pip install bitsandbytes deepspeed
Then, modify model loading:
from transformers import BitsAndBytesConfigbnb_config = BitsAndBytesConfig(load_in_8bit=True)model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto")
To access Qwen-2.5 via an API, use FastAPI:
pip install fastapi uvicorn
Create a simple API (app.py):
from fastapi import FastAPIfrom transformers import AutoModelForCausalLM, AutoTokenizerapp = FastAPI()model_name = "Qwen/Qwen2.5-7B"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")@app.post("/generate")async def generate(prompt: str): inputs = tokenizer(prompt, return_tensors="pt").to("cuda") output = model.generate(**inputs, max_length=200) return {"response": tokenizer.decode(output[0], skip_special_tokens=True)}# Run API# uvicorn app:app --host 0.0.0.0 --port 8000
This allows you to send prompts via HTTP requests:
curl -X POST "http://localhost:8000/generate" -H "Content-Type: application/json" -d '{"prompt": "Tell me about quantum physics"}'
Hosting Qwen-2.5 on a local server provides:
For better performance, enable FP16, quantization, or DeepSpeed optimizations.
Ready to transform your business with our technology solutions? Contact Us today to Leverage Our AI/ML Expertise.
Contact Us