Metal FlashAttention v2.5 w/ Neural Accelerators: delivering breakthrough performance on the Apple M5 chip

Draw Things brings breakthrough 4.6× performance gains to Apple silicon with Metal FlashAttention v2.5 w/ Neural Accelerators.

Nov 10, 2025

Draw Things is the fastest way to generate images or videos with your Apple silicon, locally and privately. Our work on Metal FlashAttention has been the bedrock of this claim.

The recent release of M5 debuts Draw Things with 3.5× performance improvements over M4. This was achieved through Apple’s MPSGraph API, first enabled in our 1.20250912.0 release. While MPSGraph API adequately implements Neural Accelerators support, it lacks the fine-grained memory and performance optimizations that we have developed through our ongoing Metal FlashAttention research.

Version 1.20251107.1 is our first release containing the preview version of Metal FlashAttention v2.5 w/ Neural Accelerators. It delivers up to 4.6× performance improvements on M5 over M41. With adequate cooling2, it sometimes outperforms M2 Max and narrows the gap to M3 Ultra. Its efficient memory management allows 5-second, 480p (448×768) video generation (with Wan 2.2 A14B models) on an M5 iPad with 16GiB of RAM.

We will discuss more Metal FlashAttention v2.5 w/ Neural Accelerators implementation details in a separate post on the Engineering@Draw Things channel, stay tuned!

Performance

Metal FlashAttention v2.5 w/ Neural Accelerators delivers the fastest performance on Apple’s non-Pro M-series chips, often rivaling or exceeding previous Max M-series chips. With several large image-generation models (FLUX.1 [schnell], a 12B-parameter model; Qwen Image, a 20B-parameter model; and HiDream, a 17B-parameter model), the M5 iPad can now generate high-resolution images in under a minute.

On A19 Pro, it improves on our previously reported numbers (with MPSGraph API), and now rivals M3 Pro (18c) performance.

2-Step Distilled Generation, End-to-End Time Spent (including loading model from SSD). This is updated from previous numbers.

2-Step Distilled Generation, Time Spent at 1 Sampling Step (excluding text encoding, image decoding or model loading). This is updated from previous numbers.

On M5 iPad, we observe 3.6× to 5.5× raw performance improvements over M4 iPad, and 3.3× to 4.6× end-to-end improvements. In the mid-range, M5 iPad performs as well as M2 Max3. At the high end, M5 iPad often runs about 80% slower than M3 Ultra (60c).4

4-Step Distilled Generation, End-to-End Time Spent (including loading model from SSD).

4-Step Distilled Generation, Time Spent at 1 Sampling Step (excluding text encoding, image decoding or model loading).

6-Step Distilled Generation, End-to-End Time Spent (including loading model from SSD). Significant time was spent on VAE decoding (~180s on iPad), which MFA v2.5 is not yet optimized for (3D convolution kernels).

Support

Metal FlashAttention v2.5 w/ Neural Accelerators leverages the Neural Accelerators across all key operators — including matrix multiplication, attention, and segmented matrix multiplication, the latter being essential for Mixture-of-Experts models.

Availability & Limitations

The source code of Metal FlashAttention v2.5 w/ Neural Accelerators is available at https://github.com/liuliu/ccv/tree/unstable/lib/nnc/mfa/v2. Draw Things 1.20251107.1 integrates the full implementation of Metal FlashAttention 2.5 w/ Neural Accelerators. It is currently released as a preview because there are still performance cliffs around odd attention sequence lengths and large head dimensions. BF16 support is turned off due to unsolved bugs. The neural accelerators-enabled shaders also take a significantly longer time to specialize (often 10s or more for the first generation). We expect to resolve these issues in future updates.

Edit (20251118): 1.20251117.1 now supports BF16 as well as binary artifacts cache for neural accelerators-enabled shaders.

The end-to-end result was obtained on an M5 iPad with the 1.20251107.1 build and iPadOS 26.1, compared against an M4 iPad with the same build and iPadOS 26.1, for 1280×1280 FLUX.1 [schnell] (5-bit) 4-step generation. For all measurements below, we ran the generation twice and took the second measurement.

We used an ice pad underneath the iPad for maximum cooling performance.

The end-to-end result was obtained on a Mac Studio with an M2 Max (38c) chip, using the 1.20251107.1 build and macOS 15.7.

The end-to-end result was obtained on a Mac Studio with an M3 Ultra (60c) chip, using the 1.20251107.1 build and macOS 15.6.1.

JustFunSoils

Nov 18

Thank you for all the attention to this app. Really great to see these improvements! One question, will there potentially a new macOS CLI binary release tied to this? I do not have a macOS dev environment at home, so need to rely on the binary releases. Thank you again!

Expand full comment

Xeo

Nov 12

I tested it on my iP17PM with SD3 Large 3.5 Q8, 20 Steps and it is around 2.4 x faster (6 minutes vs 14.5 minutes) iOS 26.1

Releases @ Draw Things

Discussion about this post

Ready for more?