Zi Huang - LLaMA Chat

A local chatbot that runs entirely in your browser. Download a language model locally and chat with it privately. Say no to OpenAI ads!

Initialisation

Status

GPU

Checking…

Model

None

Size

~1.1 GB

Tokens

—

Local Models

Model Zoo

All models are 4-bit quantised to reduce memory footprint and improve loading efficiency, while inference is done using FP32 compute shaders to ensure GPU driver compatibility.

SmolLM2

360M (~580 MB) — Fast, ideal for phones and low-end GPUs.

1.7B (~2.7 GB) — Good balance of speed and quality for desktop use.

Developed by Hugging Face. Compact instruction-tuned models optimised for efficiency while retaining surprisingly strong reasoning for their size.

Qwen 2.5

0.5B (~1.1 GB) — Tiny but multilingual (English, Chinese, and 20+ languages).

Developed by Alibaba Qwen Team and is part of the Qwen 2.5 series with strong multilingual and coding capabilities.

Llama 3.2

1B (~1.1 GB) — Well-rounded general-purpose assistant.

3B (~3.0 GB) — Best model provided here, but needs 4 GB+ GPU VRAM.

Developed by Meta AI. Open-weight instruction-tuned models from the Llama 3.2 release, designed for on-device and edge deployment.

Under the Hood

This chat interface runs a large language model (LLM) privately inside the browser. Inference is powered by WebLLM, which uses the GPU via the WebGPU API to run quantised (4-bit) versions of open-source models directly in the browser.

Model weights are downloaded on first use and stored in the browser's cache storage. Subsequent visits would load the model from local storage while model weights persist locally across sessions. Chrome for Android on recent hardware supports WebGPU and can run the smaller models comfortably.

Note, if the model loads but errors on first message, clear the model cache and try again as this would resolve corrupted WASM downloads. For best results, download larger models over a stable Wi-Fi connection, unless you're ready to say goodbye to your mobile data!

Projects Home