Skip to content

Conversation

@mengniwang95
Copy link
Contributor

@mengniwang95 mengniwang95 commented Dec 26, 2025

User description

Type of Change

example update

Description

kv cache quantization is supported


PR Type

Enhancement


Description

  • Added support for KV quantization in Llama4 example

  • Introduced static_kv_dtype argument for key-value cache quantization

  • Updated scripts to handle static_kv_dtype parameter


Diagram Walkthrough

flowchart LR
  A["Update main.py"] -- "Add static_kv_dtype" --> B["Modify setup_parser"]
  B -- "Pass static_kv_dtype" --> C["Update tune function"]
  C -- "Add static_kv_dtype arg" --> D["Modify run_benchmark.sh"]
  D -- "Handle kv_cache_dtype" --> E["Modify run_quant.sh"]
  E -- "Pass kv_cache_dtype" --> F["Update README.md"]
Loading

File Walkthrough

Relevant files
Enhancement
main.py
Enhance argument parsing and KV quantization                         

examples/pytorch/multimodal-modeling/quantization/auto_round/llama4/main.py

  • Replaced BasicArgumentParser with standard argparse.ArgumentParser
  • Added static_kv_dtype argument for key-value cache quantization
  • Passed static_kv_dtype to tune function
+59/-24 
run_benchmark.sh
Update benchmark script for KV quantization                           

examples/pytorch/multimodal-modeling/quantization/auto_round/llama4/run_benchmark.sh

  • Added --static_kv_dtype parameter parsing
  • Updated extra_model_args to include kv_cache_dtype
  • Added conditional logic for FP8 KV cache
+12/-2   
run_quant.sh
Update quantization script for KV quantization                     

examples/pytorch/multimodal-modeling/quantization/auto_round/llama4/run_quant.sh

  • Added --static_kv_dtype parameter parsing
  • Conditionally added static_kv_dtype to extra_cmd
+14/-6   
Documentation
README.md
Document KV quantization support                                                 

examples/pytorch/multimodal-modeling/quantization/auto_round/llama4/README.md

  • Updated documentation to reflect KV quantization support
+1/-0     

Signed-off-by: Mengni Wang <mengni.wang@intel.com>
@PRAgent4INC
Copy link
Collaborator

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Typo

There is a typo in the help text for the --scheme argument. It should be "quantization" instead of "quantizaion".

default="MXFP4",
type=str,
help="quantizaion scheme."
Redundant Argument

The reloading=False argument in the tune function call seems redundant unless it is intended to override a default value. Consider removing it if it is not necessary.

reloading=False,

@PRAgent4INC
Copy link
Collaborator

PR Code Suggestions ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants