Releases: microsoft/Olive
Releases · microsoft/Olive
Olive-ai 0.11.0
Passes
- SelectiveMixedPrecision: Sensitivity score based algorithms
- Add Stable Diffusion lora pass
- Add ONNX conversion pass support for diffusers model
- MatMulNBitsToQDQ: Support 2bit
Quantization
- Quantization: Keep embeddings tied in SelectiveMixedPrecision, Clean overrides
- Add support for tensor-wise mixed precision in Quark onnx quantization
- Support multiple modes of ignored scopes simultaneously in Intel® OpenVINO weight compression and quantization passes
- Automatically tie kv cache i/o quantizers in AimetQuantization
- Quantization: Fix annotation
- Quantization: Utils to pack and unpack uint8 storage
- Quantization: Generalize olive quantized model loading in model builder
- VitisAI AMD NPU LLM Quantization - Add Windows + CUDA support for Quark Quantizer
- Quantization: tie_quant_modules util and new tests
- Few fixes for benchmark cli and quark quantization
- Quantization: Enable 2-bit in QuantModules
CLI
Evaluation
Bug Fixes and other updates
- HfModelHandler: Check for tokenizer_config.json instead of try/else
- Fix cache output model name bug
- Fix a bug with an incorrect parameter type
- Update MatMulAddToGemm Graph Surgery when ReLU is present after Add
- UT: Remove azureml-evaluate-mlflow, Update optimum, autoawq dependencies
- Add dict check for HF model patch in Conversion pass
- UT: Use same input pytorch model in openvino test
- Remove gidx input from MatMulNBits graph surgery
- Add diffusers model handler
- Add sd lora data container and preprocessing funcs
- Add MB bf16 support for caption-onnx-graph cli
- Replace TRANSFORMERS_CACHE with HF_HUB_CACHE
- Disable lmhead during prefill phase in genai config
- Update device selection for bf16 for caption-onnx cli
- Add flag to apply DeduplicateHashedInitializersPass post graph surgery
- Add with_prior_preservation option for dreambooth
- Add RenameOutputDims, PackedAttentionToPackedMHA and PackedAttentionToLoopMHA surgeon. These will be used for Qwen VL model.
- Add popular model IO configs
Olive-ai 0.10.1
Improvements and Bug Fixes
Olive-ai 0.10.0
New Features
- Quark Quantization for ONNX Models (#2236) — New
QuarkQuantizationpass viaolive runwith support for int8/uint8/int16/uint16/int32/uint32/bf16/bfp16 and CLE/SmoothQuant/AdaRound/AdaQuant. - Embedding Quantization & RTN Improvements (#2238) — Added
QuantEmbedding, a composableRtnpass, and a unified checkpoint format aligned withMatMulNBits/GatherBlockQuantized(block/shape constraints enforced; AutoGPTQ/AutoAWQ export updated to 2D params). - Word Embedding Tying Surgery (#2240) —
TieWordEmbeddingsties input embeddings andlm_headfor both unquantized (Gemm) and quantized (MatMulNBits+GatherBlockQuantized) graphs. - Custom ONNX Model Naming (#2235) — Allows specifying a custom ONNX model name in the output directory.
- Intel OpenVINO Weight Compression Pass (#2180) — Adds NNCF-based weight compression for HF/ONNX models to OpenVINO or compressed ONNX.
Improvements
- AIMET Enhancements (#2158, #2187, #2215) — Adds Sequential MSE, enables AIMET in
quantizeCLI, and supports manual precision overrides. - GPTQ Updates (#2202, #2203) — Supports user-provided module overrides and
transformers >= 4.53. - Quantization Export Compatibility (#2218) — Updates checks for
ort-genai > 0.9.0and fixes minorOnnxDAGname clashes. - Torch Dynamo Export Alignment (#2185) —
extract_adapterrecovers folded LoRA and decomposes DORA-fusedGemmtoMatMulfor quantization. - Post-Surgery Deduplication (#2228) — Runs
DeduplicateHashedInitializersPassafter surgeries to remove duplicate initializers. - QNN Execution Provider: GPU Enablement (#2220) — Enables QNN-EP GPU, updates
StaticLLMandContextBinaryGeneration, keeps NPU default. - Run API Ergonomics (#2199) —
olive.run()now accepts a dictrun_config. - OpenVINO Config Overrides (#2191) — Allows overriding
genai_config.jsonproperties in OV encapsulation. - ReplaceAttentionMaskValue Robustness (#2213) — Adds
ShapetoALLOWED_CONSUMER_OPSfor text-encoder graphs. - Implicit Olive Version Tagging (#2183) — Automatically embeds the Olive version in saved ONNX model protos.
Olive-ai 0.9.3
New Features:
- Compatibility with Windows ML for ONNX model inference and evaluation (#2052, #2056, #2059, #2084).
Gptqquantization supportslm_headquantization and more generic weight packing (#2137).
Improvements
optimizeCLI supportsWebGPUexecution provider (#2076) andNVTensorRtRTXexecution provider (#2078).quantizeCLI supports Gptq pass as an implementation (#2115).Onnx static quantizationsupports strided calibration data for lower memory usage (#2086).- Extra options can be provided directedly to the
ModelBuilderpass (#2107). LMEvaluatorhas a new ORT backend withIOBindingleading to large speedup in runtime (#2133).OnnxFloatToFloat16allows more granular control throughop_include_listandnode_include_list(#2134).AIMETquantization pass: Support for exclude op types (#2055), pre-quantized models (#2111), LLM augmented dataloaders (#2108), LPBQ (#2119), and Adaround (#2140).
Deprecation
As per the deprecation warning in the previous release, the following Azure ML related features have been removed:
- Azure ML system
- Azure ML resource types: model, datastore, job outputs.
- Remote workflow
- Azure ML artifact packaging
Other removed features include:
IsolatedORT System(#2070)Quantization Aware Training(#2089)AppendPrePostProcessingOpspass (#2090)SNPEpasses (#2098)
Recipes Migration
All recipes have been migrated to olive-recipes repository.
Olive-ai 0.9.2
New Features:
- Selective Mixed Precision. (#1898)
- Native GPTQ Implementation with support for Selective Mixed Precision. (#1949)
- Blockwise RTN Quantization for ONNX models. (#1899)
- Ability to add custom metadata in ONNX model. (#1900)
- New simplified
olive optimizeCLI command and theolive.quantize()Python API for effortless model optimization with minimal developer input. See CLI usage and Python API docs for more details. (#1996) - New command line
olive run-passprovides advanced users ability to run individual passes. (#1904)
New Integrations
- GPTQModel. (#1999)
- AIMET (#2028). This is a work in progress.
- ONNX model support while targeting OpenVINO. (#2019)
QuarkQuantization: AMD Quark quantization for LLMs. (#2010)VitisGenerateModelLLMfor optimized LLM model generation for Vitis AI Execution Provider. (#2010)
Improvements
- New graph surgeries including
dla transformers,DecomposeRotaryEmbeddingandDecomposeQuickGelu. (#2018, #1972, #2000) - Exposed
WorkflowOutputin Python API and added unified APIs for CLI commands. (#1907) - Refactored Docker system for simplified setup and execution. (#1990)
- ExtractAdapters:
- Added support for DORA and LoHA adapters. (#1611)
- NVMO quantization:
- OnnxPeepholeOptimizer:
- Removed
fuse_transpose_qatandpatch_unsupported_argmax_operator. (#1976)
- Removed
Deprecation
Azure ML will be deprecated in the next release, including:
- Azure ML system
- Azure ML workspace model
- Remote workflow
Recipes Migration
All recipes are being migrated to the olive-recipes repository. New recipes will be added and maintained there going forward.
Olive-ai 0.9.1
Minor release to fix following issues
- OpenVINO Encapsulation pad_token_id fix (#1847)
- Add support for Nvidia TensorRT RTX execution provider in Olive (#1852)
- Basic support for ONNX auto EP selection introduced in onnxruntime v1.22.0 (#1854, #1863)
- Add Nvidia TensorRT-RTX Olive recipe for vit, clip and bert examples (#1858)
- gate optimum[openvino] version to <=1.24 (#1864)
Olive-ai 0.9.0
Feature Updates
- Implement lm-eval-harness based LLM quality evaluator for ONNX GenAI models #1720
- Update minimum supported target opset for ONNX to 17. #1741
- QDQ support for ModelBuilder pass #1736
- Refactor OnnxOpVersionConversion to conditionally use onnxscript version converter #1784
- HQQ Quantizer Pass #1799, #1835
- Introducing global definitions for Precision & PrecisionBits #1808
- Improvements in PeepholeHoleOptimizer #1697, #1698
New Passes
- OnnxScriptFusion: ONNX script fusion
- OpenVINOEncapsulation, OpenVINOReshape, OpenVINOIoUpdate: OpenVINO encapsulation #1754
- TrtMatMulToConvTransform: Convert non-4D MatMul to Transpose-Conv-Transpose sequence
- OpenVINOOptimumConversion: Add optimum Intel® pass for converting a Huggingface Model to an OpenVINO Model
- Graph Surgeries
- MatMulAddGemm: Graph surgery to transform Add Op followed by Matmul as Gemm op
- PowReduceSumPowDiv2LpNorm: Graph surgery to merge Pow ReduceSum Pow Div pattern to L2Norm
- OnnxHqqQuantization: Implements 4-bit HQQ quantization
- VitisAIAddMetaData: Adds metadata to an ONNX model based on specified model attributes.
New/Updated Examples
- Alibaba-NLP/gte #1695
- DeepSeek
- OpenVINO #1786
- Google BERT
- Google VIT
- Intel BERT
- Laion Clip
- Llama3
- OpenVINO #1786
- Meta Llama3
- QDQ #1707
- OpenAI Clip (16 and 32)
- Phi3.5
- Phi4
- OpenVINO #1828
- Qwen
- Resnet50
- Sentence Transformers CLIP
- Stable Diffusion
- QDQ #1730
Deprecated Examples
Deprecated Passes
- InsertBeamSearchOp #1805
Olive-ai 0.8.0
New Features (Passes)
QuaRotperforms offline weight rotationSpinQuantperforms offline weight rotationStaticLLMconverts dynamic shaped llm into a static shaped llm for NPUs.GraphSurgeriesapplies surgeries to ONNX model. Surgeries are modular and individually configurable.LoHa,LoKrandDoRAfinetuningOnnxQuantizationPreprocessapplies quantization preprocessing.EPContextBinaryGeneratorcreates EP specific context binary onnx models.ComposeOnnxModelscomposes split onnx models.OnnxIOFloat16ToFloat32replaced with more genericOnnxIODataTypeConverter
Command Line Interface
New command line tools have been added and existing tools have been improved.
generate_config_fileoption to save the workflow config file.extract-adapterscommand to extract multiple adapters from a PyTorch model.- Simplied
quantizecommand
Improvements
- Better output model structure for workflow and CLI runs.
- New
no_artifactsoptions in workflow config to disable saving run artifacts such as footprints.
- New
- Hf data preprocessing:
- Dataset is truncated if
max_samplesis set. - Empty text are filtered.
padding_sideis configurable and defaults to"right".
- Dataset is truncated if
SplitModelpass keeps QDQ nodes together in the same split.OnnxPeepholeOptimizer: constant folding + onnxoptimizer added.CaptureSplitInfo: Separate split for memory intensive module.OnnxConversion:- Dynamic shapes for dynamo export.
optimizeoption to perform constant folding and redundancies elimination on dynamo exported model.
GPTQ: Default wikitest calibration dataset. Patch to support newer versions oftransformers.MatMulNBitsToQDQ:nodes_to_excludeoption.SplitModel:split_assignmentsoption to provide custom split assignments.CaptureSplitInfo:block_to_splitcan be a single block (str) or multiple blocks (list).OnnxMatMul4Quantizer: Support onnxruntime 1.18+OnnxQuantization:- Support onnxruntime 1.18+.
op_types_to_excludeoption.LLMAugmentedDataLoaderaugments the calibration data for llms with kv cache and other missing inputs.
- New document theme and organization.
- Reimplement search logic to include passes in search space.
Examples:
- New QNN EP examples:
- SLMs:
- Phi-3.5
- Deepseek R1 Distill
- Llama 3.2
- MobileNet
- ResNet
- CLIP VIT
- BAAI/bge-small-en-v1.5
- Table Transformer Detection
- adetailer
- SLMs:
- Deepseek R1 Distill Finetuning
timmMobileNet
Olive-ai 0.7.1.1
Same as 0.7.1 with updated dependencies for nvmo extra and NVIDIA TensorRT Model Optimizer example doc.
Refer 0.7.1 Release Notes for other details.
Olive-ai 0.7.1
Command Line Interface
New command line tools have been added and existing tools have been improved.
olive --helpworks as expected.auto-opt:- The command chooses a set of passes compatible with the provided model type, precision and accelerator information.
- New options to split a model, either using
--num-splitsor--cost-model.
Improvements
ExtractAdapters:- Support lora adapter nodes in Stable Diffusion unet or text-embedding models.
- Default initializers for quantized adapter to run the model without adapter inputs.
GPTQ:- Avoid saving unused bias weights (all zeros).
- Set
use_exllamatoFalseby default to allow exporting and fine-tuning external GPTQ checkpoints.
AWQ: Patch autoawq to run quantization on newer transformers versions.- Atomic
SharedCacheoperations - New
CaptureSplitInfoandSplitpasses to split models into components. Number of splits can be user provided or inferred from a cost model. disable_searchis deprecated from pass configuration in an olive workflow config.OrtSessionParamsTuningredone to use olive search features.OrtModelOptimizerrenamed toOrtPeepholeOptimizerand some bug fixes.
Examples:
- Stable Diffusion: New MultiLora Example
- Phi3: New int quantization example using
nvidia-modelopt