Quantization: Embeddings quantization, new packing format, Rtn quantizer by jambayk · Pull Request #2238 · microsoft/Olive

jambayk · 2025-11-03T19:52:45Z

Describe your changes

New QuantEmbedding module added to do input embedding quantization
- export to GatherBlockQuantized with torch script and dynamo mode supported. model builder doesn't support it since it requires change in the builder script
- If a quantization pass is responsible for quantizing both the embeds and lm_head of a model with original tied weights, it can keep them tied in the pytorch model. During export, torch script duplicates the shared qweight because of reshape and dynamo keeps the reshape on the MatMulNBits (need to test if it's more efficient for the reshape to be on the GatherBlockQuantized)
New Rtn pass that can be composed on top of other quantization passes. For example, gptq on the transformer layers and then rtn on the embedding and lm head to take advantage of weight tieing
New quantized model checkpoint format. Moved to the same format used by the MatMulNBits and GatherBlockQuantized. Now there is no overhead for unpacking and repacking the weights during export, so it is very fast.
- Enforce the same restrictions on block size and weight shapes required by the contrib ops for compatibility.
- We also enforce that that the quantization dim is divisible by the block size. This makes the packing logic easier as we don't have to worry about paddings and gives compatibility between 3D qweight for Linears and 2D qweight for Embeddings.
Updated the autogptq and autoawq checkpoint export to make the quantization parameters 2D as per the latest specs for the contrib operators.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass.
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

(Optional) Issue link

olive/common/quant/hf_utils.py

xiaoyu-work · 2025-11-03T23:44:45Z

We are trying to improve our test coverage. Can you please add unit test for new files created in this PR?

Merge in AITEC/eiq-olive from feature/EITO-565-rebase-to-newest-version-of-olive-0.9.3 to main * commit 'fd44fa6a51e382d59a88e4fceec49042b7e2caa5': (370 commits) ruff safe fixes update rebased on badge readme fix import things I've missed during rebasing ruff Revert "ruff stuff" ruff stuff Bump up version to 0.10.1 Fix cache output model name bug (microsoft#2249) HfModelHandler: Check for tokenizer_config.json instead of try/else (microsoft#2247) Quantization: Keep embeddings tied in SelectiveMixedPrecision, Clean overrides (microsoft#2246) TieWordEmbeddings: return model when no tieing detected (microsoft#2242) Static Quantization: Always patch `MinMaxCalibrator` (microsoft#2241) Release branch 0.10.0 Add custom onnx model name support for output dir (microsoft#2235) TieWordEmbeddings: unquantized and quantized support (microsoft#2240) Quantization: Embeddings quantization, new packing format, Rtn quantizer (microsoft#2238) Add support for Quark onnx quantization (microsoft#2236) Spelling fixes (microsoft#2234) LLMAugmentedDataLoader: No decode phase for non-GQA model (microsoft#2204) ...

quantembedding, rtn quantizer, new packing

3323c24

jambayk requested a review from xiaoyu-work November 3, 2025 19:55

github-advanced-security bot found potential problems Nov 3, 2025

View reviewed changes

olive/common/quant/hf_utils.py Fixed Show fixed Hide fixed

olive/common/quant/hf_utils.py Fixed Show fixed Hide fixed

jambayk changed the title ~~Qauntization: Embeddings quantization, new packing format, Rtn quantizer~~ Quantization: Embeddings quantization, new packing format, Rtn quantizer Nov 3, 2025

fix lint

c891619

xiaoyu-work approved these changes Nov 3, 2025

View reviewed changes

jambayk merged commit d645057 into main Nov 3, 2025
11 checks passed

jambayk deleted the jambayk/embeds-rtn branch November 3, 2025 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization: Embeddings quantization, new packing format, Rtn quantizer#2238

Quantization: Embeddings quantization, new packing format, Rtn quantizer#2238
jambayk merged 2 commits intomainfrom
jambayk/embeds-rtn

jambayk commented Nov 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

xiaoyu-work commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jambayk commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Checklist before requesting a review

(Optional) Issue link

Uh oh!

Uh oh!

Uh oh!

xiaoyu-work commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jambayk commented Nov 3, 2025 •

edited

Loading