Running LLMs Locally: Using Ollama, LM Studio, and HuggingFace on a Budget

Eleftheria DrosopoulouJune 10th, 2025Last Updated: December 19th, 2025

0 3,421 2 minutes read

How to serve and fine-tune models like Mistral or LLaMA 3 on your own hardware.

With the rise of powerful open-weight models like Mistral, LLaMA 3, and Gemma, running large language models (LLMs) locally has become more accessible than ever—even on consumer-grade hardware.

This guide covers:

Best tools for local LLM inference (Ollama, LM Studio, HuggingFace)
Hardware requirements (CPU vs. GPU, RAM, quantization)
Running models efficiently (GGUF, AWQ, and GPTQ formats)
Fine-tuning on a budget (LoRA, QLoRA, and dataset preparation)
Performance benchmarks (speed vs. quality trade-offs)

1. Why Run LLMs Locally?

Privacy – No data leaves your machine.
Cost savings – Avoid API fees (OpenAI, Anthropic, etc.).
Customization – Fine-tune models for specific tasks.
Offline access – Use AI without internet.

Best for:

Developers experimenting with AI
Researchers needing full model control
Businesses handling sensitive data

2. Step-by-Step: Running LLMs Locally

Option 1: Ollama (Simplest Setup)

Ollama provides pre-built models with one-command installation.

Installation:

# Linux/Mac (Windows requires WSL2)
curl -fsSL https://ollama.com/install.sh | sh

Running Models:

# Download a model (Mistral 7B)
ollama pull mistral

# Start interactive chat
ollama run mistral

# You can also run with system prompts
ollama run mistral "Explain quantum computing in simple terms"

Tip: Other available models include llama3, gemma, and phi3.

Option 2: LM Studio (GUI for Windows/Mac)

Perfect for users who prefer a graphical interface.

Installation Steps:

Download from lmstudio.ai
Install and launch the application
Search for models in the “Discover” tab (e.g., “TheBloke/Mistral-7B-GGUF”)
Download the Q4_K_M version (good balance of quality/speed)
Load the model and start chatting

Option 3: HuggingFace Transformers (Most Flexible)

For Python developers who want full control.

Basic Setup:

# Create virtual environment
python -m venv llm-env
source llm-env/bin/activate  # Windows: llm-env\Scripts\activate

# Install dependencies
pip install torch transformers accelerate

Running Inference:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

3. Hardware Requirements

Model Size	Minimum RAM (CPU)	Recommended (GPU)
7B (4-bit)	8GB RAM	RTX 3060 (12GB)
13B (4-bit)	16GB RAM	RTX 3090 (24GB)
70B (4-bit)	32GB+ RAM	A100 40GB

Quantization Tip: Use GGUF (CPU) or GPTQ (GPU) formats to reduce memory usage:

# For GGUF models (CPU optimized)
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

4. Fine-Tuning Guide (QLoRA)

Adapt models to your specific needs with limited hardware.

Step 1: Install Requirements

pip install transformers accelerate peft bitsandbytes datasets

Step 2: Prepare Dataset

Example format (JSON):

[
    {
        "instruction": "Explain quantum computing",
        "input": "",
        "output": "Quantum computing uses qubits..."
    },
    {
        "instruction": "Write a poem about AI",
        "input": "",
        "output": "In silicon minds, dreams take flight..."
    }
]

Step 3: Fine-Tuning Script

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)

# LoRA configuration
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=1,
    fp16=True
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    args=training_args,
    tokenizer=tokenizer
)

trainer.train()

5. Performance Benchmarks

Model (7B)	Speed (Tokens/sec)	GPU VRAM Used
Mistral (FP16)	25-40 (A100)	14GB
LLaMA 3 (4-bit)	15-25 (RTX 3060)	6GB
Phi-3 (GGUF)	10-20 (CPU)	8GB RAM

6. Where to Download Models

HuggingFace Model Hub (Mistral, LLaMA 3, Gemma)
TheBloke’s Quantized Models (GGUF, GPTQ)
Ollama Library (Pre-packaged models)

Conclusion

Running LLMs locally is now affordable and practical, thanks to tools like Ollama, LM Studio, and HuggingFace. By using quantization and LoRA, even mid-range PCs can handle 7B-13B models efficiently.

Next Steps:

Try Ollama for the easiest setup.
Experiment with QLoRA for fine-tuning.
Join communities (r/LocalLLaMA, HuggingFace Discord).

Running LLMs Locally: Using Ollama, LM Studio, and HuggingFace on a Budget

1. Why Run LLMs Locally?

2. Step-by-Step: Running LLMs Locally

Option 1: Ollama (Simplest Setup)

Installation:

Running Models:

Option 2: LM Studio (GUI for Windows/Mac)

Installation Steps:

Option 3: HuggingFace Transformers (Most Flexible)

Basic Setup:

Running Inference:

3. Hardware Requirements

4. Fine-Tuning Guide (QLoRA)

Step 1: Install Requirements

Step 2: Prepare Dataset

Step 3: Fine-Tuning Script

5. Performance Benchmarks

6. Where to Download Models

Conclusion

Thank you!

Eleftheria Drosopoulou

Thank you!

1. Why Run LLMs Locally?

2. Step-by-Step: Running LLMs Locally

Option 1: Ollama (Simplest Setup)

Installation:

Running Models:

Option 2: LM Studio (GUI for Windows/Mac)

Installation Steps:

Option 3: HuggingFace Transformers (Most Flexible)

Basic Setup:

Running Inference:

3. Hardware Requirements

4. Fine-Tuning Guide (QLoRA)

Step 1: Install Requirements

Step 2: Prepare Dataset

Step 3: Fine-Tuning Script

5. Performance Benchmarks

6. Where to Download Models

Conclusion

Thank you!

Related Articles

Thank you!