Software Development
Running LLMs Locally: Using Ollama, LM Studio, and HuggingFace on a Budget
How to serve and fine-tune models like Mistral or LLaMA 3 on your own hardware.
With the rise of powerful open-weight models like Mistral, LLaMA 3, and Gemma, running large language models (LLMs) locally has become more accessible than ever—even on consumer-grade hardware.
This guide covers:
- ✅ Best tools for local LLM inference (Ollama, LM Studio, HuggingFace)
- ✅ Hardware requirements (CPU vs. GPU, RAM, quantization)
- ✅ Running models efficiently (GGUF, AWQ, and GPTQ formats)
- ✅ Fine-tuning on a budget (LoRA, QLoRA, and dataset preparation)
- ✅ Performance benchmarks (speed vs. quality trade-offs)
1. Why Run LLMs Locally?
- Privacy – No data leaves your machine.
- Cost savings – Avoid API fees (OpenAI, Anthropic, etc.).
- Customization – Fine-tune models for specific tasks.
- Offline access – Use AI without internet.
Best for:
- 🔹 Developers experimenting with AI
- 🔹 Researchers needing full model control
- 🔹 Businesses handling sensitive data
2. Step-by-Step: Running LLMs Locally
Option 1: Ollama (Simplest Setup)
Ollama provides pre-built models with one-command installation.
Installation:
# Linux/Mac (Windows requires WSL2) curl -fsSL https://ollama.com/install.sh | sh
Running Models:
# Download a model (Mistral 7B) ollama pull mistral # Start interactive chat ollama run mistral # You can also run with system prompts ollama run mistral "Explain quantum computing in simple terms"
Tip: Other available models include
llama3
, gemma
, and phi3
.Option 2: LM Studio (GUI for Windows/Mac)
Perfect for users who prefer a graphical interface.
Installation Steps:
- Download from lmstudio.ai
- Install and launch the application
- Search for models in the “Discover” tab (e.g., “TheBloke/Mistral-7B-GGUF”)
- Download the Q4_K_M version (good balance of quality/speed)
- Load the model and start chatting
Option 3: HuggingFace Transformers (Most Flexible)
For Python developers who want full control.
Basic Setup:
# Create virtual environment python -m venv llm-env source llm-env/bin/activate # Windows: llm-env\Scripts\activate # Install dependencies pip install torch transformers accelerate
Running Inference:
from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "mistralai/Mistral-7B-Instruct-v0.2" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto") inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0]))
3. Hardware Requirements
Model Size | Minimum RAM (CPU) | Recommended (GPU) |
---|---|---|
7B (4-bit) | 8GB RAM | RTX 3060 (12GB) |
13B (4-bit) | 16GB RAM | RTX 3090 (24GB) |
70B (4-bit) | 32GB+ RAM | A100 40GB |
Quantization Tip: Use GGUF (CPU) or GPTQ (GPU) formats to reduce memory usage:
# For GGUF models (CPU optimized) wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf
4. Fine-Tuning Guide (QLoRA)
Adapt models to your specific needs with limited hardware.
Step 1: Install Requirements
pip install transformers accelerate peft bitsandbytes datasets
Step 2: Prepare Dataset
Example format (JSON):
[ { "instruction": "Explain quantum computing", "input": "", "output": "Quantum computing uses qubits..." }, { "instruction": "Write a poem about AI", "input": "", "output": "In silicon minds, dreams take flight..." } ]
Step 3: Fine-Tuning Script
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments from peft import LoraConfig, get_peft_model from trl import SFTTrainer model_name = "mistralai/Mistral-7B-v0.1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True) # LoRA configuration peft_config = LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj", "k_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) training_args = TrainingArguments( output_dir="./results", per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, num_train_epochs=1, fp16=True ) trainer = SFTTrainer( model=model, train_dataset=dataset, peft_config=peft_config, args=training_args, tokenizer=tokenizer ) trainer.train()
5. Performance Benchmarks
Model (7B) | Speed (Tokens/sec) | GPU VRAM Used |
---|---|---|
Mistral (FP16) | 25-40 (A100) | 14GB |
LLaMA 3 (4-bit) | 15-25 (RTX 3060) | 6GB |
Phi-3 (GGUF) | 10-20 (CPU) | 8GB RAM |
6. Where to Download Models
- HuggingFace Model Hub (Mistral, LLaMA 3, Gemma)
- TheBloke’s Quantized Models (GGUF, GPTQ)
- Ollama Library (Pre-packaged models)
Conclusion
Running LLMs locally is now affordable and practical, thanks to tools like Ollama, LM Studio, and HuggingFace. By using quantization and LoRA, even mid-range PCs can handle 7B-13B models efficiently.
Next Steps:
- Try Ollama for the easiest setup.
- Experiment with QLoRA for fine-tuning.
- Join communities (r/LocalLLaMA, HuggingFace Discord).