Quantization with DirectML helps you scale further on Windows

DirectML support for Phi 3 mini launched last month and we’ve since made several improvements, unlocking more models and even better performance!

Developers can grab already quantized versions of Phi-3 mini (with variants for the 4k and 128k versions). They can now also get Phi 3 medium (4k and 128k)  and Mistral v0.2. Stay tuned for additional pre-quantized models! We’ve also shipped a gradio interface to make easier to test these models with the new ONNX Runtime Generate() API. Learn more.

Be sure to check out our Build sessions to learn more. See below for details.

See here to learn what our hardware vendor partners have to say:

What is quantization?

Memory bandwidth is often a bottleneck for getting models to run on entry-level and older hardware, especially when it comes to language models. This means that making language models smaller directly translates to increasing the breadth of devices developers can target.

There’s been a lot of research into reducing model size through quantization, a process that reduces the precision and therefore size of model weights.

Our goal is to ensure scalability, while also maintaining model accuracy, so we integrated support for models that have had Activation-Aware Quantization (AWQ) applied to them. AWQ is a technique that lets us reap the memory savings from quantization with only a minimal impact on accuracy. AWQ achieves this by identifying the top 1% of salient weights that are needed for maintaining model accuracy and then quantizes the remaining 99% of weights. This leads to much less accuracy loss with AWQ compared to other techniques.

The average person reads up to 5 words/second. Thanks to the significant memory wins from AWQ, Phi-3-mini runs at this speed or faster on older discrete GPUs and even laptop integrated GPUs. This translates into being able to run Phi-3-mini on hundreds of millions of devices!

Check out our Build talk below to see this in action!

Perplexity measurements

Perplexity is a measure used to quantify how well a model predicts a sample. Without getting into the math of it all, a lower perplexity score means the model is more certain about its predictions and suggests that the model’s probability distribution is closer to the true distribution of the data.

Perplexity can be thought of as a way to quantify the average number of branches in front of a model at each decision point. At each step, a lower perplexity would mean that the model has fewer, more confident choices to make, which reflects a more refined understanding of the topic. A higher perplexity would mean more, less confident choices and therefore choices that are less predictable, relevant, and/or varied in quality.

As you can see below our data shows that AWQ leads to a small loss in model accuracy with only a small increase in perplexity. In return, using AWQ means 4x smaller model weights, leading to a dramatic increase in the number of devices that can run Phi-3-mini!

Model variant Dataset Base model perplexity AWQ perplexity Difference
Phi3 mini 128k wikitext2 14.42 14.81 0.39
Phi3 mini 128k ptb 31.39 33.63 2.24
Phi3 mini 4k wikitext2 15.83 16.52 0.69
Phi3 mini 4k ptb 31.98 34.3 2.32

Learn more

Be sure check out the these sessions at Build to learn more:

Get Started

Check out the ONNX Runtime Generate() API repo to get started today: https://github.com/microsoft/onnxruntime-genai

See here for our chat app with a handy gradio interface: https://github.com/microsoft/onnxruntime-genai/tree/main/examples/chat_app

This lets developers choose from different types of language models that work best for their specific use case. Stay tuned for more!


We recommend upgrading to the latest drivers for the best performance.

Source: Windows Blog