Software Development

Running LLMs Locally: Using Ollama, LM Studio, and HuggingFace on a Budget

How to serve and fine-tune models like Mistral or LLaMA 3 on your own hardware.

With the rise of powerful open-weight models like Mistral, LLaMA 3, and Gemma, running large language models (LLMs) locally has become more accessible than ever—even on consumer-grade hardware.

This guide covers:

  • Best tools for local LLM inference (Ollama, LM Studio, HuggingFace)
  • Hardware requirements (CPU vs. GPU, RAM, quantization)
  • Running models efficiently (GGUF, AWQ, and GPTQ formats)
  • Fine-tuning on a budget (LoRA, QLoRA, and dataset preparation)
  • Performance benchmarks (speed vs. quality trade-offs)

1. Why Run LLMs Locally?

  • Privacy – No data leaves your machine.
  • Cost savings – Avoid API fees (OpenAI, Anthropic, etc.).
  • Customization – Fine-tune models for specific tasks.
  • Offline access – Use AI without internet.

Best for:

  • 🔹 Developers experimenting with AI
  • 🔹 Researchers needing full model control
  • 🔹 Businesses handling sensitive data

2. Step-by-Step: Running LLMs Locally

Option 1: Ollama (Simplest Setup)

Ollama provides pre-built models with one-command installation.

Installation:

# Linux/Mac (Windows requires WSL2)
curl -fsSL https://ollama.com/install.sh | sh

Running Models:

# Download a model (Mistral 7B)
ollama pull mistral

# Start interactive chat
ollama run mistral

# You can also run with system prompts
ollama run mistral "Explain quantum computing in simple terms"
Tip: Other available models include llama3, gemma, and phi3.

Option 2: LM Studio (GUI for Windows/Mac)

Perfect for users who prefer a graphical interface.

Installation Steps:

  1. Download from lmstudio.ai
  2. Install and launch the application
  3. Search for models in the “Discover” tab (e.g., “TheBloke/Mistral-7B-GGUF”)
  4. Download the Q4_K_M version (good balance of quality/speed)
  5. Load the model and start chatting

Option 3: HuggingFace Transformers (Most Flexible)

For Python developers who want full control.

Basic Setup:

# Create virtual environment
python -m venv llm-env
source llm-env/bin/activate  # Windows: llm-env\Scripts\activate

# Install dependencies
pip install torch transformers accelerate

Running Inference:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

3. Hardware Requirements

Model SizeMinimum RAM (CPU)Recommended (GPU)
7B (4-bit)8GB RAMRTX 3060 (12GB)
13B (4-bit)16GB RAMRTX 3090 (24GB)
70B (4-bit)32GB+ RAMA100 40GB
Quantization Tip: Use GGUF (CPU) or GPTQ (GPU) formats to reduce memory usage:

# For GGUF models (CPU optimized)
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

4. Fine-Tuning Guide (QLoRA)

Adapt models to your specific needs with limited hardware.

Step 1: Install Requirements

pip install transformers accelerate peft bitsandbytes datasets

Step 2: Prepare Dataset

Example format (JSON):

[
    {
        "instruction": "Explain quantum computing",
        "input": "",
        "output": "Quantum computing uses qubits..."
    },
    {
        "instruction": "Write a poem about AI",
        "input": "",
        "output": "In silicon minds, dreams take flight..."
    }
]

Step 3: Fine-Tuning Script

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)

# LoRA configuration
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=1,
    fp16=True
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    args=training_args,
    tokenizer=tokenizer
)

trainer.train()

5. Performance Benchmarks

Model (7B)Speed (Tokens/sec)GPU VRAM Used
Mistral (FP16)25-40 (A100)14GB
LLaMA 3 (4-bit)15-25 (RTX 3060)6GB
Phi-3 (GGUF)10-20 (CPU)8GB RAM

6. Where to Download Models

Conclusion

Running LLMs locally is now affordable and practical, thanks to tools like Ollama, LM Studio, and HuggingFace. By using quantization and LoRA, even mid-range PCs can handle 7B-13B models efficiently.

Next Steps:

  1. Try Ollama for the easiest setup.
  2. Experiment with QLoRA for fine-tuning.
  3. Join communities (r/LocalLLaMA, HuggingFace Discord).

Eleftheria Drosopoulou

Eleftheria is an Experienced Business Analyst with a robust background in the computer software industry. Proficient in Computer Software Training, Digital Marketing, HTML Scripting, and Microsoft Office, they bring a wealth of technical skills to the table. Additionally, she has a love for writing articles on various tech subjects, showcasing a talent for translating complex concepts into accessible content.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button