Zi Huang - Local LLM Chat

A local chatbot that runs entirely in your browser. Download a language model locally and chat with it privately. Say no to OpenAI ads!

Model Status

GPU: CheckingModel: NoneSIZE: ~580 MB

Console Logs

Model Selection

(╯°-°）╯︵ CAUTION ... You should download stuff over a Wi-Fi connection as model weights are large and may use significant mobile data!

Model Zoo

All models are 4-bit quantised to reduce memory footprint and improve loading efficiency, while inference is done using FP32 compute shaders for maximum GPU driver compatibility. FP16 is nice but it has limited support in older GPUs.

SmolLM2

360M (~580 MB) — Fast, ideal for low-end GPUs.

1.7B (~2.7 GB) — Good balance of speed and quality.

Developed by Hugging Face. Compact instruction-tuned models optimised for efficiency while retaining surprisingly strong reasoning for their size.

Qwen 2.5

0.5B (~1.1 GB) — Tiny but multilingual (English, Chinese, and 20+ languages). Perfect as a lightweight mobile translator.

Developed by Alibaba Qwen Team and is part of the Qwen 2.5 series with strong multilingual and coding capabilities.

Llama 3.2

1B (~1.1 GB) — Well-rounded and general-purpose.

3B (~3.0 GB) — Best, but needs 4 GB+ GPU VRAM.

Developed by Meta AI. Open-weight instruction-tuned models from the Llama 3.2 release, designed for on-device and edge deployment.

Model weights are downloaded on first use and stored in the browser's cache storage. Subsequent visits would load the model from local storage while model weights persist locally across sessions. Real-time inference is powered by WebLLM, which uses your device's GPU via the WebGPU API.

Projects Home