AI/ML

    How to Install Qwen-2.5 Model on a Local Server Using Hugging Face

     

    Qwen-2.5 Model

    Qwen 2.5 Model for your Business?

    • check icon

      Cost Efficiency (Open Source)

    • check icon

      Lower Long Term costs

    • check icon

      Customised data control

    • check icon

      Pre-trained model

    Read More

    Get Your Qwen 2.5 AI Model Running in a Day


    Free Installation Guide - Step by Step Instructions Inside!

    Problem

    We want to install and run the Qwen-2.5 model on our local server using Hugging Face, but are unsure how to properly set up the environment, manage dependencies, and execute a prompt.

    Solution

    We will go through the step-by-step process of:

    • Setting up the local server with required dependencies.
    • Installing Hugging Face Transformers & PyTorch for model inference.
    • Downloading and loading the Qwen-2.5 model for text generation.
    • Running the model locally and testing an AI-generated response.

     

    1. System Requirements

    Before installation, ensure that the local server has the following:

    • Operating System: Ubuntu 22.04 (or similar)
    • GPU Support (Optional but Recommended): NVIDIA GPU with CUDA support
    • RAM: At least 16GB (32GB+ recommended for large models)
    • Disk Space: At least 50GB free for model storage

     

    2. Install System Dependencies

    Start by updating the system and installing required packages:

    sudo apt update && sudo apt upgrade -ysudo apt install -y python3 python3-pip git

     

    For NVIDIA GPU, install CUDA & cuDNN:

    sudo apt install -y nvidia-driver-525pip install torch torchvision torchaudio --index-url ttps://download.pytorch.org/whl/cu118

     

    Verify GPU installation:

    nvidia-smi 

    If you see GPU details, it’s installed correctly.

    3. Set Up a Virtual Environment (Recommended)

    To isolate dependencies, create and activate a virtual environment:

    python3 -m venv qwen_envsource qwen_env/bin/activate 

    4. Installing Hugging Face Transformers & Dependencies

    Now, install Hugging Face Transformers, PyTorch, and other required libraries:

    pip install torch transformers acceleratepip install sentencepiece 

    Confirm installation:

    python -c "import torch; print(torch.cuda.is_available())" 

    If it prints True, CUDA is enabled for GPU acceleration.

    5. Download the Qwen-2.5 Model from Hugging Face

    Use the Hugging Face CLI to pull the model:

    pip install huggingface_hubhuggingface-cli login  # (Optional, required for some models) 

    Then, download the Qwen-2.5 model:

     

    from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "Qwen/Qwen2.5-7B"  # Change to Qwen2.5-14B if needed # Load tokenizer and modeltokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")print("Model loaded successfully!")

    6. Running Qwen-2.5 Locally & Executing a Prompt

    Now, let’s test text generation using Qwen-2.5:

    def generate_text(prompt): inputs = tokenizer(prompt, return_tensors="pt").to("cuda")  # Use "cpu" if no GPU output = model.generate(**inputs, max_length=100) return tokenizer.decode(output[0], skip_special_tokens=True)# Example usageprint(generate_text("What is the meaning of life?"))

     

    If the setup is correct, we should see an AI-generated response.

    7. Optimizing Performance (For Large Models)

    Enable Half-Precision (FP16) for Faster Inference

    Modify the model loading to use torch_dtype=torch.float16:

    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")Use DeepSpeed or BitsAndBytes for Memory Efficiency

     

    Install additional tools for better memory usage:

    pip install bitsandbytes deepspeed

     

    Then, modify model loading:

    from transformers import BitsAndBytesConfigbnb_config = BitsAndBytesConfig(load_in_8bit=True)model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto")

     

    8. Running Qwen-2.5 as an API (Optional)

    To access Qwen-2.5 via an API, use FastAPI:

    pip install fastapi uvicorn

     

    Create a simple API (app.py):

    from fastapi import FastAPIfrom transformers import AutoModelForCausalLM, AutoTokenizerapp = FastAPI()model_name = "Qwen/Qwen2.5-7B"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")@app.post("/generate")async def generate(prompt: str): inputs = tokenizer(prompt, return_tensors="pt").to("cuda") output = model.generate(**inputs, max_length=200) return {"response": tokenizer.decode(output[0], skip_special_tokens=True)}# Run API# uvicorn app:app --host 0.0.0.0 --port 8000

     

    This allows you to send prompts via HTTP requests:

    curl -X POST "http://localhost:8000/generate" -H "Content-Type: application/json" -d '{"prompt": "Tell me about quantum physics"}'

     

    Conclusion

    Hosting Qwen-2.5 on a local server provides:

    • Full control over deployment and performance tuning
    • Lower long-term costs vs. cloud-hosted models
    • Better security since no data leaves your server

    For better performance, enable FP16, quantization, or DeepSpeed optimizations.

     

    Ready to transform your business with our technology solutions? Contact Us  today to Leverage Our AI/ML Expertise. 

    Share

    facebook
    LinkedIn
    Twitter
    Mail
    AI/ML

    Related Center Of Excellence