A minimal GPT-style language model for character-level next-token prediction. Trains on a corpus of names (or any text) and learns to generate similar sequences. Pure JavaScript, no dependencies. Ported from Karpathy's micrograd-based makemore.
- Loads
input.txt(one item per line, e.g. names) - Trains a small transformer on next-character prediction
- Evaluates on held-out data (loss, perplexity, accuracy)
- Generates 20 sample sequences autoregressively
- Node.js
input.txtin the project directory (lines of text to train on)
node microllm.jscreateRandom(seed)– LCG-based seeded PRNG (reproducible runs).gauss(mean, std)– Box-Muller transform for Gaussian samples (weight init).shuffle(arr)– Fisher-Yates shuffle.weightedChoice(weights)– Samples an index from a discrete distribution (inference sampling).
Loads input.txt, splits into 90% train / 10% validation, shuffles. Docs are lines of text (e.g. names).
Character-level tokenizer. chars = ['<BOS>', 'a', 'b', ...] (sorted unique chars + BOS). stoi / itos map between chars and integer ids. Sequences are wrapped with BOS at start and end.
Scalar autograd engine. Each Value holds data and grad and tracks its inputs (_prev) and backward rule (_backward). Operations (add, mul, pow, log, exp, relu) build a DAG; backward() does reverse-mode differentiation via topological sort. Used for all model parameters and activations so gradients flow through the full graph.
nEmbd(16) – Embedding dimension.nHead(4) – Attention heads.nLayer(1) – Transformer blocks.blockSize(8) – Max context length.matrix(nout, nin, std)– Initializesnout × ninmatrix ofValueobjects with Gaussian(0, std).
State dict:
wte– Token embeddings (vocab × nEmbd).wpe– Position embeddings (blockSize × nEmbd).layer{i}.attn_wq/wk/wv/wo– Attention Q, K, V, O projections.layer{i}.mlp_fc1/fc2– MLP (4× expand, ReLU² activation).lm_head– Output projection to vocab logits.
Heads and MLP output layers use std=0 to start from identity-like behavior.
linear(x, w)–xvector ×wmatrix; returnsw @ x.softmax(logits)– Numerically stable softmax (subtract max before exp).rmsnorm(x)– RMS normalization: scale by(mean(x²) + ε)^(-0.5).gpt(tokenId, posId, keys, values)– Forward pass:- Embed:
tok_emb + pos_emb, then RMSNorm. - Per layer:
- Attention: RMSNorm → Q,K,V projections → per-head scaled dot-product attention over cached K,V → O projection → residual.
- MLP: RMSNorm → fc1 → ReLU² → fc2 → residual.
- Final projection to vocab logits.
- Embed:
Uses causal attention: at position t, only positions 0..t can attend (via keys/values built step-wise).
First and second moment buffers (m, v). Bias correction with β1^(step+1) and β2^(step+1). Learning rate decays linearly to 0 over training.
Each step: pick a doc, tokenize as [BOS, ...chars, BOS], take up to blockSize positions. For each position, run GPT forward, get logits → softmax → cross-entropy loss on target. Average loss over positions, call backward(), then Adam update. Zero grads after each step.
On up to 500 validation docs: forward only (no backward), compute cross-entropy, argmax accuracy, and perplexity exp(loss).
For each sample: start with BOS, autoregressively run GPT, softmax with temperature scaling, sample next token via weightedChoice, decode and print until BOS or blockSize. Temperature 0.6 softens the distribution for more varied outputs.