Skip to content

Quantization: Embeddings quantization, new packing format, Rtn quantizer#2238

Merged
jambayk merged 2 commits intomainfrom
jambayk/embeds-rtn
Nov 3, 2025
Merged

Quantization: Embeddings quantization, new packing format, Rtn quantizer#2238
jambayk merged 2 commits intomainfrom
jambayk/embeds-rtn

Conversation

@jambayk
Copy link
Copy Markdown
Contributor

@jambayk jambayk commented Nov 3, 2025

Describe your changes

  • New QuantEmbedding module added to do input embedding quantization
    • export to GatherBlockQuantized with torch script and dynamo mode supported. model builder doesn't support it since it requires change in the builder script
    • If a quantization pass is responsible for quantizing both the embeds and lm_head of a model with original tied weights, it can keep them tied in the pytorch model. During export, torch script duplicates the shared qweight because of reshape and dynamo keeps the reshape on the MatMulNBits (need to test if it's more efficient for the reshape to be on the GatherBlockQuantized)
  • New Rtn pass that can be composed on top of other quantization passes. For example, gptq on the transformer layers and then rtn on the embedding and lm head to take advantage of weight tieing
  • New quantized model checkpoint format. Moved to the same format used by the MatMulNBits and GatherBlockQuantized. Now there is no overhead for unpacking and repacking the weights during export, so it is very fast.
    • Enforce the same restrictions on block size and weight shapes required by the contrib ops for compatibility.
    • We also enforce that that the quantization dim is divisible by the block size. This makes the packing logic easier as we don't have to worry about paddings and gives compatibility between 3D qweight for Linears and 2D qweight for Embeddings.
  • Updated the autogptq and autoawq checkpoint export to make the quantization parameters 2D as per the latest specs for the contrib operators.

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass.
  • Update documents if necessary.
  • Lint and apply fixes to your code by running lintrunner -a
  • Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

(Optional) Issue link

@jambayk jambayk requested a review from xiaoyu-work November 3, 2025 19:55
@jambayk jambayk changed the title Qauntization: Embeddings quantization, new packing format, Rtn quantizer Quantization: Embeddings quantization, new packing format, Rtn quantizer Nov 3, 2025
@xiaoyu-work
Copy link
Copy Markdown
Collaborator

We are trying to improve our test coverage. Can you please add unit test for new files created in this PR?

@jambayk jambayk merged commit d645057 into main Nov 3, 2025
11 checks passed
@jambayk jambayk deleted the jambayk/embeds-rtn branch November 3, 2025 23:58
skywall pushed a commit to NXP/eiq-olive that referenced this pull request Mar 10, 2026
Merge in AITEC/eiq-olive from feature/EITO-565-rebase-to-newest-version-of-olive-0.9.3 to main

* commit 'fd44fa6a51e382d59a88e4fceec49042b7e2caa5': (370 commits)
  ruff safe fixes
  update rebased on badge readme
  fix import
  things I've missed during rebasing
  ruff
  Revert "ruff stuff"
  ruff stuff
  Bump up version to 0.10.1
  Fix cache output model name bug (microsoft#2249)
  HfModelHandler: Check for tokenizer_config.json instead of try/else (microsoft#2247)
  Quantization: Keep embeddings tied in SelectiveMixedPrecision, Clean overrides (microsoft#2246)
  TieWordEmbeddings: return model when no tieing detected (microsoft#2242)
  Static Quantization: Always patch `MinMaxCalibrator` (microsoft#2241)
  Release branch 0.10.0
  Add custom onnx model name support for output dir (microsoft#2235)
  TieWordEmbeddings: unquantized and quantized support (microsoft#2240)
  Quantization: Embeddings quantization, new packing format, Rtn quantizer (microsoft#2238)
  Add support for Quark onnx quantization (microsoft#2236)
  Spelling fixes (microsoft#2234)
  LLMAugmentedDataLoader: No decode phase for non-GQA model (microsoft#2204)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants