+inf

Teaching AI to Optimize AI Models for Edge Deployment

Brandon — Sat, 27 Sep 2025 04:47:07 GMT

TLDR: We’re building Möbius, an AI agent for model porting. In early tests, it helped convert Silero VAD from PyTorch to Core ML in 12 hours with human guidance, achieving 0.99998 accuracy parity and 3.5× CPU speedup. We’re extending this approach to OpenVINO, TensorRT, and other edge runtimes.

Why Voice-Activity-Detection matters for edge AI

Voice activity detection (VAD) answers a simple question many times per second: did this audio slice contain speech? Because VADs rely on acoustics, they’re language-agnostic and typically under 5MB, making them perfect for local, always-on use. They sit in front of automatic speech recognition (ASR), text-to-speech (TTS), and voice agent systems, improving quality and keeping latency predictable.

In this article, we show how we’re building Möbius, our AI agent platform to help developers bring modern AI models onto edge devices. Like Codex or Claude Code for general programming, Möbius will work with a human. In this early test, we used our prototype to help convert Silero VAD, one of the most popular VAD models, from PyTorch to Apple’s native Core ML runtime. While this example converts PyTorch to Core ML, we’re designing the approach to generalize to other runtimes like OpenVINO and TensorRT.

Our manual conversion failed after two weeks

Our team has ported several models, including FluidInference/parakeet-tdt-0.6b-v3-coreml and Pyannote models for speaker diarization. We previously converted Silero VAD v5 manually. A team member without prior experience, mentored by a researcher with over 20 years in ML, spent about two weeks on it. The result was broken with ±0.15 to 0.2 probability drift, no speed improvements, and inconsistent behavior.

Our previous model’s probability distribution when running against the same time used to test later. You see the blue and green are decently far apart.

This time, we tested our Möbius prototype on v6 from scratch, with only select context from data we’ve collected. Our goal wasn’t just another porting example but to demonstrate that given the right context and knowledge, an AI agent can search, reason, verify, and help optimize, turning a months-long process into hours while discovering optimizations most humans miss. Just as Vercel abstracted infrastructure for web apps, we want developers to bring models to edge applications just as easily.

Why are we building Möbius?

Möbius is named after Möbius strip

Our team’s background is in traditional ML and distributed systems, building recommendation systems and content moderation serving over a billion users. We’ve spent years scaling cloud infrastructure across AWS, Azure, and GCP. Edge systems are relatively new territory for us, but we’ve found that many concepts from distributed systems translate surprisingly well to edge optimization.

After tinkering with edge AI applications and deploying it to users, we quickly realized the hardware is catching up but the software layer is really behind and isn’t improving fast enough. Running near-realtime workloads on consumer CPUs and GPUs was too slow and drained battery life for most consumer hardware. Interestingly, enterprises and healthcare providers started reaching out, asking how they could run modern models locally for privacy and compliance reasons.

While some solutions exist for running local AI models on edge devices, most are only partially open or integrate poorly with native applications. We found this frustrating when building our own solution, so instead of waiting for others to solve the problem, we decided to tackle it ourselves and share our models and SDKs with everyone.

The fragmented ecosystem is evolving quickly and new models are being released with different architectures. Instead of the traditional static compiler approach, we wanted to tackle this differently using AI itself to solve the problem. That’s why we’re building Möbius as an AI agent that captures these optimization patterns, making edge AI accessible to everyone, not just hardware or machine learning engineers. By bridging this gap, Möbius lets teams leverage their existing distributed systems expertise while navigating the unique constraints of on-device inference.Retry

Core ML is more than Apple’s Neural Engine

Before diving into how möbius solved this, it’s important to understand what Core ML really is. While it’s the only developer interface to Apple’s Neural Engine (ANE), Core ML is Apple’s complete on-device machine learning runtime framework that can target CPU, GPU, and ANE.

Core ML uses heuristics, operator support tables, and runtime constraints to decide where each part of your model runs. You get control through the model configuration’s computeUnits setting, allowing you to limit or prefer certain compute units, though it’s not guaranteed.

What makes Core ML powerful isn’t just hardware access but its deep ecosystem integration. Ahead-of-time compilation means models compile before runtime, helping with startup latency and enabling graph-level optimizations. The unified memory architecture lets CPU, GPU, and ANE share the same memory pool, eliminating data copying between compute units.

For CPU execution, Core ML leverages Accelerate and Basic Neural Network Subroutines (BNNS) for many operations and potentially fuses operations. It also supports weight palettization as a compression technique, as well as more traditional quantization depending on the OS version and target. Automatic compute unit selection dispatches operations to the best unit based on support and cost.

Non-uniform lowering of precision using clustering. https://apple.github.io/coremltools/docs-guides/source/opt-palettization-overview.html

Process of quantizing to int8.https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html

As with any abstraction, there are trade-offs. Apple doesn’t expose all internal decision logic, and there are strict requirements for input/output dimensions, internal state, and static data structures. Not all operations convert the same, and you often have to remap operations to Core ML’s supported set.

Model conversion requires translation, not just export

Running a model in another runtime is like rewriting code in a different programming language. You’re taking the same logic and structure but lowering one operator graph into the target runtime’s supported operations without changing the math or state semantics.

This brings familiar challenges. Not all operators have direct equivalents. Most AI accelerators work best with static graphs with fixed dimension inputs. You run into control flow gaps where some operations lack direct equivalents. There’s the challenge of dynamic versus static execution since most accelerators require static graphs. And tracing limitations mean not all operations export cleanly.

We often run into dimension limitations, dynamic graphs, state, and operations not being supported. Despite these hurdles, porting to a native runtime is worth it. You gain access to vendor-tuned optimizations, and performance can be significantly better, sometimes even when the final model executes on CPU only, thanks to ahead-of-time graph compilation and optimized memory layouts.

Theoretically someone could craft custom kernels that might outperform Core ML models, but those are extremely rare cases. On Apple devices, you lose access to Apple’s internal optimizations that are often opaque and restricted without using Core ML.

Building Möbius to help bring AI to the edge

Rather than building new models or writing kernels from scratch, we’re developing Möbius as a specialized AI agent to assist developers with model conversion. Like other coding agents, it’s designed to work alongside engineers, handling the repetitive and complex parts while humans provide guidance and verification. We’re creating an orchestration layer that combines frontier LLMs like Claude with domain-specific tooling and continuous learning capabilities through memory and models we’re training specifically for runtime optimization.

Each conversion teaches our system new patterns. We deliberately target true positives, collecting examples where the agent gets blocked and their solutions. We’ve found this better skews outputs toward positive outcomes than providing true negatives, which lead the agent down incorrect paths more frequently.

While our roadmap leads toward more autonomy, our current prototype operates as an AI coding partner that can dramatically reduce the time needed to bring an AI model to an edge device. A developer still drives the process, but instead of spending weeks debugging numerical differences and rewriting operations, they can guide our agent through the conversion in hours. In essence, we’re democratizing access to expert-level model optimization, enabling any developer to achieve what previously required specialized ML engineering expertise.

Mapping models and breaking apart black boxes

Our prototype begins by understanding what it’s working with, gathering requirements from the user and clarifying the model’s use case. Typically, users need to provide access to the existing model as well as details about the desired platform and programming language for running the inference model.

To start our test, our team member provided a link to the Silero VAD GitHub repository and confirmation that the model would run on macOS 13+ and iOS 15+ as a prompt. If we increase the minimum OS version, we might get more optimizations, but users often lag behind in upgrades, so we prefer to support a wider number of users first.

The agent then tries to understand the model and researches how to optimize it based on the user’s requirements. After about 10 minutes of analysis, our prototype mapped out the model’s execution flow and chose 16kHz, matching our existing Parakeet TDT v3 Core ML model for transcription.

The tricky part with Silero VAD is that it’s already in JIT format, pre-traced and heavily optimized for mobile and libtorch execution, making it harder to convert. JIT models are prepackaged versions of AI models made to run faster and more reliably, especially outside a developer’s laptop, but they lose the granular details of the original model.

The ONNX path failed after 30 minutes when the agent hit the IF operation, which isn’t supported in Core ML. Since ONNX has been deprecated from Core ML Tools, it abandoned this approach after trying various version permutations.

Here’s where our prototype showed promise. After finding the convert-gguf.pyscript from whisper.cpp, it realized it could rebuild the nn.Module classes from scratch using PyTorch with the provided weights instead. It loaded the weights and created separate Short-Time Fourier Transform (STFT), Encoder, and Decoder models, traced them individually, then verified each component matched the baseline. This was a strategy our manual attempt never considered.

Finding and fixing three critical bugs

Initial conversion looked successful, but when tested with real audio data, everything fell apart. The differences compared to the baseline model were quite significant. For a VAD model, a difference of up to 30% is unusable.

Accuracy compared to baseline model over time

After prompting our agent to review its work, we discovered it had been comparing against the simplified baseline, not the actual PyTorch JIT model. This is a common problem with coding agents. Once corrected, the agent systematically identified three root causes that compounded into complete failure.

This level of systematic debugging demonstrates the potential value of AI-assisted model porting. Our manual attempt + ChatGPT/Claude had tried band-aid solutions like Squeeze-and-Excitation modifications without identifying these fundamental issues. Working with an agent with the right context and tooling helped us understand the real problems rather than applying surface-level fixes.

Speeding up with 256ms windows

With correctness achieved, our prototype focused on speed. The initial Core ML model wasn’t showing significant improvements over PyTorch JIT, defeating one purpose of conversion.

Here’s where the agent helped make a crucial discovery our team missed initially. Processing 256ms windows instead of 32ms chunks provided dramatic speedups. This isn’t traditional batching where you process multiple independent samples in parallel. It’s simply passing longer audio windows to the model in a single call.

Visualization of windows versus individual calls

The speedup comes from Core ML’s ahead-of-time graph optimizations, FP16 execution with BNNS and Accelerate, and a single-call unified graph that avoids per-window dispatch overhead. This results in better cache utilization and optimized memory patterns.

For combining the eight probability outputs from the 256ms window, the agent chose noisy-OR over mean or max by treating each 32ms slice as an independent expert vote. Noisy-OR better models whether voice existed in the last 256ms. It avoids overreacting to spurious spikes unlike max while preserving sensitivity to brief speech unlike mean. This optimization strategy, enabled by the unified graph execution in Core ML, led to a 3.5× improvement in inference speed.

Why this model runs faster on CPU than ANE

The agent also tried quantization and palettization by reducing the model from float32 to float16 or int8. For most large language models, quantizing provides significant speedups, but that’s generally untrue for Core ML models. For Silero VAD at around 2MB, memory isn’t the bottleneck, and the overhead of dequantization exceeds any savings for a model this size.

Performance comparison

Like any optimization techniques, there’s always a sacrifice to accuracy, and it depends on the architecture of the model. These trade-offs become especially apparent when comparing model size against latency and accuracy against compression rates.

Note that there are other techniques like pruning that the agent didn’t try. This is something we need to look more into.

Accuracy comparison

When we profiled the mlpackage using Xcode, the results were unexpected. The Core ML model ran 100% on CPU. We confirmed this by comparing latency between CPU-only and CPU+NE settings, which showed almost identical performance, with < 0.05ms difference.

Accuracy and compression trade-off graph generated by Möbius

Xcode performance benchmark

This surprised us, as we had expected some operations to run on the ANE. The profiling revealed that while the operations could technically be supported by both GPU and ANE, Core ML’s execution engine determined that CPU-only execution was optimal.

Grey checkmarks mean they could technically support GPU and Neural Engine but Core ML chose CPU

The reason likely comes down to model size. At just 2MB, the overhead of transferring data to the ANE would exceed any computational benefits. Operations like STFT and custom convolutions are simply more efficient on CPU at this scale. Unfortunately, Apple provides limited documentation on Core ML’s internal decision-making, so this analysis is based on our best understanding of the framework’s behavior.

From two weeks to twelve hours

Despite running entirely on CPU, converting to the native runtime format still unlocked significant performance optimizations. The results speak for themselves.

Using our Möbius prototype, development time dropped from two weeks to 12 hours, a 93% reduction. The human developer still made key decisions and verified results, but the AI agent handled the heavy lifting of debugging, testing variations, and systematically exploring optimizations. Accuracy remained near-perfect with 0.99998 correlation, compared to the broken manual attempt that suffered from ±0.15 to 0.2 drift on a 60-minute video with multiple speakers and varying volume. Most importantly, we achieved a 3.5× speedup versus no improvement in the manual conversion.

Quality comparison generated by Möbius

Performance comparison generated by Möbius

What makes this promising for our development is that the agent didn’t just speed up human work. Working together, the human-AI team discovered optimizations we missed in our manual attempt. The 256ms window processing with noisy-OR aggregation wasn’t in our manual playbook. The agent systematically explored the solution space, verified each step, and found a better path.

For use cases requiring near 32ms processing, this approach isn’t feasible, but our tests demonstrate that the 32ms Core ML model runs at 33 RTFx, approximately 8% faster than the baseline PyTorch model with much less performance jitter.

What this means for edge AI

While our prototype excels at converting simpler models like VAD, more complex architectures present challenges we’re still working to solve. Models with dynamic shapes like Kokoro still require significant human intervention. We believe these challenges can be addressed as we continue developing Möbius, and we’re actively working to extend its capabilities to handle these more complex cases while exploring how it can rewrite certain operations to better utilize the ANE.

To understand the complexity gap, consider that Kokoro contains over 2,000 operations compared to Silero VAD’s 48 operations. The model structure visualizations from Xcode clearly illustrate this dramatic difference in architectural complexity.

Model Structure of VAD by Xcode

Kokoro Model is so complicated that you have to zoom out 50x to see the whole model in Xcode

Native restrictions also limit certain workloads on AI accelerators like ANE and NPU. For LLMs, context windows are often restricted to under 5,000 tokens, and memory bandwidth becomes the primary bottleneck. Currently, Apple’s MLX framework or llama.cpp remain better solutions for local LLM inference on Apple devices.

Each successful conversion adds to our knowledge base. We’re already seeing community contributions like CAM++ achieving 3× speedups after successfully converting the model to Core ML format.

This time around, we learned how to better handle PyTorch JIT models when the original PyTorch model isn’t available, working directly with the weights instead. While the VAD model itself is too small for meaningful ANE usage, converting to the native runtime still provides more flexibility for optimization. Eventually Möbius may even be able to redesign models to run entirely on the neural engine.

Bringing edge intelligence everywhere

Getting models to run well on edge devices is crucial for AI’s future. Devices have limited resources, requiring maximum performance optimization. While unified runtimes offer portability, they often lack the performance and efficiency of native runtimes. Yet there aren’t enough people with the deep expertise needed for this optimization work.

Native runtimes matter significantly, even on CPU. These runtimes are often optimized by the hardware maker and offer more flexibility for optimizing models. For ambient workloads on the edge, any performance optimization is critical. The challenge of model conversion requires deep understanding of multiple frameworks and debugging subtle numerical differences. However, properly structured agents with domain knowledge can handle this complexity effectively, offering a path forward through AI-assisted development that works.

This case study demonstrates the potential for AI-assisted model conversion. While we’ve successfully tested our approach on several models beyond Silero VAD, we recognize this is early work with much more to explore. The same approach that achieved 3.5× speedup for VAD extends naturally to emerging edge scenarios like fleet management systems running vision models on heterogeneous vehicle hardware, ambient monitoring with multimodal models on battery-powered sensors, and real-time data collection on industrial IoT devices.

As powerful multimodal models like Moondream3 and VoxCPM-0.5B emerge, the gap between cloud capabilities and edge constraints continues to widen, making systematic optimization increasingly vital. Each successful conversion teaches our system new patterns. What works for a 2MB VAD on Core ML informs how we’ll optimize tomorrow’s foundation models for tomorrow’s edge accelerators.

The real bottleneck in edge AI deployment isn’t the models or the hardware. It’s the expertise gap in optimization. We’re building Möbius to help bridge this gap, creating an AI coding assistant that will make sophisticated model optimization accessible to more developers. Just as GitHub Copilot changed how developers write code, we believe tools like Möbius can change how we optimize models for edge deployment, gradually dissolving the boundaries between machine learning engineers and software engineers.

At FluidInference, Möbius represents our commitment to making edge AI optimization accessible through AI-assisted development. We’re building it to be like Claude Code or Codex for model porting, specializing in the complex task of bringing models to edge devices. We’ll be sharing more details as development continues and plan to eventually open source the platform. Join our Discord to be the first to hear about it. If you’re working on edge AI or model optimization, we’d love to connect at hello@fluidinference.com.

Use the new VAD model: https://github.com/FluidInference/FluidAudio

Access möbius’s models and conversion code: github.com/FluidInference/möbius

Join our community: https://discord.gg/WNsvaCtmDe

Near-Real-Time Speaker Diarization on CoreML

Alex weng — Fri, 01 Aug 2025 16:25:51 GMT

Voice-based AI apps are gaining significant popularity as whisper models have democratized speech-to-text, but transcription alone isn't enough. Voice AI apps need to know who said what and when. That's where speaker diarization solves this issue by detecting and labeling different speakers in real time, bringing essential structure and clarity to transcripts. In a meeting with many participants, you need to attribute action items and follow-ups to the right person.

Many existing diarization systems rely on cloud processing, but that introduces major cost and latency issues. Streaming audio buffers to the cloud introduces unpredictable delays, especially over cellular networks. Additionally, every minute of cloud inference incurs variable costs from AI compute time to bandwidth, latency round trips, and storage; this makes cloud-based diarization expensive at scale. For example, Picovoice's proprietary diarization costs up to $3 per hour of audio, making processing 1,000 hours of meetings cost up to $3,000.

While post-processed diarization can be more accurate, it introduces its own problems. One major issue is cold start: when users launch a voice app, post-processing systems initially assign generic labels like "Speaker 1" and "Speaker 2." These labels lack real-world context, forcing users to guess who said what after the fact. This adds cognitive friction and undermines the value of automation. In contrast, real-time diarization allows assigning names or roles as people begin speaking, maintaining consistent speaker identity throughout both intra-session and inter-session periods.

This article will go into detail on our attempt and learnings from trying to convert diarization models to run efficiently on CoreML, and how the Fluid Inference platform helped us achieve this in a fraction of the time it would have taken otherwise.

The Limitations of Current Local Diarization Solutions

There are some solutions out there that allow developers to run speaker diarization models on local devices like macOS, but they were either focused on CPU, written in Python, or had proprietary licensing. As developers building local AI applications across Apple, Windows, and Android, we've experienced firsthand how limiting these constraints can be.

Pyannote, while widely regarded as a benchmark for speaker diarization, does present notable limitations for on-device, real-time applications. Its Python-based implementation is ill-suited for native macOS development. Adding Python as a dependency introduces a substantial download size, requiring approximately 200MB for a minimal Python installation plus 350-500MB for the necessary computing libraries (NumPy, PyTorch, and SciPy). This is unsuitable for native applications that are usually less than 100MB in total.

Another option was sherpa-onnx, which offered Swift support; however, it falls short for real-time, on-device diarization on macOS due to its lack of support for ANE (Apple Neural Engine), requiring inferencing on the CPU. This resulted in higher latency and power usage—critical drawbacks for real-time applications. Additionally, sherpa-onnx lacks tight integration with Swift and CoreML, making it cumbersome to manage for macOS development with all the C++ header files and bindings. These limitations ultimately led us to favor a CoreML pipeline optimized for Swift processing and Apple hardware.

Engineering On-Device Diarization

Converting PyTorch Models to CoreML

There were other speaker diarization models that we were considering for conversion but we ultimately went with Pyannote’s segmentation model and WeSpeaker embedding because of their popularity, accuracy and speed.

Speaker diarization uses two models; the segmentation and speaker embedding model. Running these models efficiently on macOS and iOS required integration with Apple's on-device machine learning ecosystem. At the core of this is CoreML, Apple's framework for executing machine learning models directly on devices, which presented unique challenges and required careful engineering.

The first major issue we encountered for model conversion was that the PyTorch library had functions that were unsupported in CoreML files. For example, when attempting to convert the embedding model, we encountered runtime errors such as:

# Original PyTorch code using unsupported operations
features = torch.vmap(fbank_partial)(waveforms)  # vmap not supported in CoreML

# Replaced with this 
features = torch.stack([fbank_partial(w) for w in waveforms.to(fft_device)]).to(device)

# Original code that caused: "RuntimeError: PyTorch convert function for op 'as_strided' not implemented"
return waveform.as_strided(sizes, strides)

# Our implementation using basic operations supported by CoreML
frames = []
for i in range(0, waveform.size(0) - window_size + 1, window_shift):
    frames.append(waveform[i:i + window_size].unsqueeze(0))
frames = torch.cat(frames, dim=0)

Furthermore, the fbank computation, a critical precursor to many audio models, had to be re-implemented outside the converted model to ensure full compatibility. These adaptations required extensive knowledge of both PyTorch internals and CoreML capabilities, making the conversion process labor-intensive and error-prone.

Another issue is dynamic graph versus static graph support; CoreML currently only supports static graph usage. This was encountered during our embedding model conversion process. During tracing, we discovered that Pyannote's internal logic contained conditional statements dependent on input audio dimensions. This led to a dynamic graph that torch.jit.trace couldn't resolve into a static representation. Our workaround involved modifying the code to handle only 1D audio inputs, effectively removing the problematic conditional paths and allowing for successful graph tracing.

Building a Real-Time Processing Pipeline

We needed a system that could make accurate speaker assignments on the fly. However, during our research, we found most speaker diarization algorithms were designed for post-processing. This meant we needed to rethink traditional pipelines and get creative with optimization strategies to balance accuracy and the constraints of real-time, on-device inferencing.

While apps like Zoom or Slack Huddles appear to offer real-time speaker identification, it's important to note that they operate within their own ecosystems. These platforms own the audio routing, so they know exactly which user is speaking at any given time based on internal metadata—not from analyzing the audio itself. In contrast, our diarization system works on raw, single-channel audio with no privileged access to speaker identity, making the problem of local model device support difficult.

The original Pyannote models were designed to operate on 10-second chunks using a sliding window approach to detect speaker segments from full audio files. Additionally, the speaker embeddings tend to degrade in accuracy when generated from less than 3 seconds of speech. However, during testing, this 10-second chunk design did not negatively impact the perceived real-time responsiveness of the system. Since the diarization is designed to run locally on the user's machine, using longer chunks also allows for smoother computation without compromising interactivity.

For every 10-second audio chunk, here's how our pipeline operates:

Speech Segmentation: We first feed the audio chunk into our Segmentation model, identifying precise time intervals within that 10-second window where speech activity occurs, effectively telling us when someone was speaking.
Speaker Embedding Generation: Next, for each of these detected speech segments, we generate a speaker embedding. Think of this as a digital fingerprint for the speaker's voice during that specific segment—a compact numerical representation capturing their vocal characteristics.
Speaker Assignment: Finally, we take this newly generated speaker embedding and compare it against our existing registry of known speakers. We then assign the segment to the closest matching speaker, ensuring consistent identification even as the conversation progresses.

This iterative process allows us to continuously update speaker information as audio comes in, providing dynamic diarization that keeps pace with the conversation. While it does come at the tradeoff of some accuracy, it is sufficient for our real-time diarization utility.

Managing Voice Clustering in Real-Time

Our diarization system begins with a clean slate: an empty speaker registry. As our pipeline generates new speaker embeddings from incoming audio segments, we perform an immediate check against this existing list. If no sufficiently close match is found for a new embedding, we consider it a potential new speaker.

During our development, we encountered spurious speaker entries (e.g., from brief coughs, background noise, or short interjections), so we implemented a crucial filter: a new speaker embedding is only officially added to our database if that voice has been detected speaking for a cumulative duration exceeding three seconds, along with some basic speech-to-noise ratio algorithms to detect non-speech noises. These procedures helped ensure the robustness of our speaker identification.

Furthermore, human voices are not static; they can exhibit subtle acoustic drift over time due to factors like microphone changes, emotional state, or even just long conversations. To accommodate this, we continuously update our existing speaker embeddings using exponential smoothing. This technique allows the model to adapt to subtle changes in a speaker's voice over the course of a meeting, ensuring long-term accuracy without creating new, redundant speaker identities.

Speed vs. Accuracy Performance Trade-offs

Our diarization models achieved an average DER of 22.14% across standard benchmarks (excluding the AVA-AVD dataset), compared to the open-source Pyannote 3.0's 17.0%. While we trail by about 5% in accuracy on most datasets, FluidAudio speaker diarization was designed for real-time performance, and degradation was expected, whereas Pyannote was designed primarily for post-batch processing. Interestingly, we significantly outperform on the AVA-AVD dataset (32.6% vs. 49.1%), though this is largely due to our different handling of silent frames and non-speech audio, as well as different threshold tuning for some of the datasets on our end.

Quantization and Optimizations

Once we were able to get the models in a working state, we naturally started exploring further optimizations to reduce the speed and size of the models. Interestingly, coming from an LLM world where quantization usually leads to faster performance with sacrifice to accuracy, the assumption did not exactly translate for CoreML models. This is one area we have been able to automate completely, relying on our AI Agent to generate a comparison against the baseline model. This graph was generated with a single natural language prompt; working together with several other sub-agents, it was able to output this for a human to review, saving days of manual work for an ML engineer.

Comparison of latency, precision and compression of various quantization settings in CoreML

The final recommendation was to keep the baseline fp32 models. Even though a quantized model showed significant compression, the size reduction and latency gains were not significant enough for the precision tradeoff for both models.

DER compared with public Pyannote benchmarks

FluidAudio achieved 0.017 RTF (60x real-time) on a 2022 Apple M1, outperforming Pyannote's 0.025 RTF (40x real-time) on an enterprise-grade Nvidia Tesla V100. This 50% speed advantage on far more accessible hardware is made possible by CoreML optimizations, leveraging ANE through native Swift integration. It's also worth noting that Pyannote performs post-processing clustering across all embeddings, an inherently computationally expensive step that contributed to its slower performance.

We launched this publicly a month ago, and since then, we have seen overwhelming interest. The project quickly grew to over 400 stars in a month, with thousands of models downloaded on Hugging Face and multiple production AI applications rolling it out in their apps.apps.

From Open Source to Enterprise Success

We decided to tackle this challenging problem of enabling local real-time speaker diarization because we believe in the long-term success of local model development, despite the mounting challenges we encountered while trying to get it running locally. With cloud computing, the latency issues would render real-time applications inapplicable or unusable, especially given the computational intensity of AI.

Our knowledge investment has significantly paid off. Through our experience with the speaker diarization conversion, we're excited to share that our Parakeet model is already available for testing in the same repository. Several production macOS and iOS applications have already integrated FluidAudio for speaker diarization, like slipbox.ai, whisper.marksdo.com, and Beingpax/VoiceInk.

We've even expanded beyond Apple platforms, with our models now running on Windows NPU as well, and we're working with Fortune 100 companies to deploy custom models on their AI accelerators.

If you're interested in implementing on-device speaker diarization in your applications or would like to explore collaboration opportunities, join our Discord or drop us an email at hello@fluidinference.com.

Bringing State-of-the-Art AI Models to Intel® NPUs

Alex weng — Tue, 29 Jul 2025 17:23:38 GMT

At Fluid Inference, our goal is simple: make deploying and running local AI models as easy as running application code. For too long, developers have had to rely on cloud infrastructure even for basic AI tasks. But the landscape is rapidly evolving—Intel's new AI PCs with integrated NPUs are making it possible to run sophisticated AI workloads directly on consumer devices.

We're seeing an explosion of developer interest in deploying AI directly into their applications. From indie developers to engineers at major tech companies, there's unprecedented demand for native AI solutions. A fundamental shift is happening: developers are becoming much more educated about AI models, and hardware like Intel's NPUs is finally powerful enough to make local deployment practical.

This is why our collaboration with Intel is so exciting. Together, we optimized transformer models including Whisper v3 Turbo, Qwen3, and Phi-4-mini to harness the full potential of Intel® Core™ Ultra processors with integrated NPUs. These models, typically associated with cloud infrastructure and GPU-heavy workloads, now deliver real-time performance directly on Intel AI PCs.

For more details, read the full article here.

Our Approach

We applied our agent-based optimization system to tackle the complexity of adapting transformer models for Intel NPU execution. Our system takes three key inputs: the models to optimize (Whisper v3 Turbo, Qwen3, Phi-4-mini), Intel NPU hardware specifications, and performance requirements.

The system uses a lead agent that coordinates the entire optimization pipeline, working with specialized agents to handle different aspects of the process. A researcher agent analyzes model architectures and identifies NPU-specific optimization opportunities. The optimization & quantization agent implements these transformations using tools like OpenVINO™, handling precision reduction and graph optimizations. Finally, a benchmark agent validates performance on actual hardware, measuring latency, throughput, and accuracy.

This orchestrated approach allows for rapid iteration and optimization. The agents work together, sharing results and refining their strategies based on real hardware performance data. This enabled us to optimize all three transformer models for Intel NPUs in just a matter of weeks, achieving the impressive performance gains detailed below.

Results

Whisper v3 Turbo runs 40% faster on Intel NPUs than on CPU (down from 0.31s to 0.19s per segment), and we didn't sacrifice any accuracy to get there. The models process audio in real-time, which is crucial for live applications.

For language tasks, Qwen3 and Phi-4-mini deliver about 70-75% of GPT-4's quality on summarization and Q&A, pretty impressive for models running entirely offline. Power consumption dropped significantly compared to GPU inference, though exact numbers vary by workload.

These aren't just benchmark numbers. A Fortune 100 company is already using our NPU-optimized models in their next-generation AI application.

Resources for Developers

All our NPU-optimized models are available on Hugging Face. We're also building a native .NET library to make it easier to deploy GenAI workloads in Windows applications.

If you're looking to deploy local AI solutions—whether with your own models or open-source ones—reach out through fluidinference.com. We'd love to help you bring AI directly to your users' devices.

Where are the local AI apps?

Brandon — Sat, 05 Jul 2025 15:19:13 GMT

Millions are able to build AI apps with cloud APIs with natural language now. Yet their phone's AI chip sits mostly idle. Despite Apple and Microsoft's 2024 promises of on-device AI, we're still waiting. Where's the intelligence in Apple Intelligence? What happened to Windows Recall?

"We're kind of like in this 1960-ish era where LLM compute is still very expensive for this new kind of computer and that forces the LLMs to be centralized in the cloud and we're all just thin clients that interact with it over the network... the personal computing revolution hasn't happened yet because it's just not economical."
Andrej Karpathy, June 2025, YC AI Startup School

Just as personal computers democratized computing by moving it from centralized mainframes to individual devices, AI needs its own personal computing revolution. Cloud computing will not be replaced; the industry relies on cloud computing more than ever, but more AI workloads need to be available on a personal device for it to be personal.

After eight months building a local-first meeting transcriber across macOS, iOS, and Windows, and a year before that building cloud AI agents, deploying AI locally is 10x harder than using cloud services. Despite the cost and privacy benefits of local AI, there hasn't been a breakout app yet. The ecosystem is too fragmented, and everyone is playing catch-up on the model layer.

There's a lot to discuss, but before that, it's worth looking at how we got here.

The CNN era

We've had AI features on our devices for years. Features like Face ID, fingerprinting, and object detection are all powered by machine learning models. Before transformer models became mainstream, the majority of models that ran on edge devices were convolutional neural networks (CNN) models like MobileNet, EfficientNet-Lite, and YOLO. They were exceptionally good at very specific tasks that involved vision.

Most devices have a CPU, GPU, and more recently an AI accelerator, all embedded onto a single SoC (System on Chip) in your devices. CNN models can run on any of these, but if there's a task you have to perform billions or even trillions of times per day, one should optimize it.

CPUs are very general purpose and offer a lot of flexibility at the cost of performance. As graphical interfaces became more popular, it became clear that real-time graphics rendering was too intensive for CPUs, and that's how GPUs were originally born. CPUs may have tens of cores, but GPUs have thousands of smaller streamlined cores, making them extremely good at parallel processing.

At one point we had ~200 models powering our iPhones for tasks like object detection, classification, OCR; likely even more now. Like how CPUs could technically handle graphical rendering, GPUs can handle operations for CNN models as well. However, GPUs are extremely power hungry for edge devices, and that's why AI accelerators are needed. There are different marketing terms for AI Accelerators, but they’re commonly known as Neural Processing Units (NPUs); Apple calls them Apple Neural Engines (ANE). For a more comprehensive overview on the hardware side, watch this video. We will focus on the model deployment.

NPUs were originally designed specifically to only run CNN model operations as efficient as possible. They use less precise math (like int8 or fp16) instead of the full precision (fp32) that GPUs and CPUs use. This trade-off gives much better performance if the model can handle lower precision calculations.

The benefits are quite obvious. You can see from this benchmark the significant difference when running a stable diffusion model: you get 4-6x more battery life on the NPU. Research has shown that NPUs are much more effective for edge AI computing and mobile applications, especially real-time and long running tasks. In practice, one should expect 2-3x battery life improvements when running on NPU as some operations tend to fall back to CPU due to a lack of support.

https://github.com/apple/ml-stable-diffusion/issues/54#issuecomment-1345295645

This NPU-CNN pairing worked well for a couple of years since most models running on the edge were built on CNN and most models deployed were vertically integrated. However, ChatGPT in 2023 disrupted the entire NPU ecosystem as the appetite for AI features grew.

The “large” language model disruption

Even though transformer models were popular back in 2018 with models like BERT and RoBERTa being widely used in natural language processing. It's fair to say that ChatGPT's success caught the industry off-guard and brought transformer models mainstream. Suddenly, models weren't measured in millions of parameters but billions. MobileFaceNet for Android face-unlock uses 0.99M parameters. GPT-3 has 175B. That's a 176,000x increase. ChatGPT's success wasn't just about text generation, it validated transformers as the architecture for the next wave of advanced models. Within months, we saw Stable Diffusion democratize image generation, GPT-4V enabled visual understanding, and Whisper transform speech recognition.

For edge devices, this meant rethinking everything; camera apps that could describe scenes for the visually impaired, voice assistants that actually understood context, AR applications with spatial reasoning, and creative tools that generated content locally. While some workloads like training, large-scale inference would always need cloud resources, chipmakers recognized that bringing transformer capabilities to edge devices for inference would unlock entirely new product categories.

Chipmakers scrambled to catch up, pouring investments into NPUs with ever-higher TOPS

Chipmakers scrambled to catch up, pouring investments into NPUs to push TOPS (Trillions of Operations Per Second). Going from models with a few million parameters to models with hundreds of billions wasn't just a scaling problem; it was a fundamental shift. Unlike classic neural networks, transformer-based models are far more dynamic; their inputs, outputs, and computation graphs can vary significantly, requiring new operations and memory patterns that NPUs weren't originally built to handle. Even when hardware supports parts of the model, precision-sensitive layers like softmax, LayerNorm, or attention often need higher numerical accuracy (e.g., FP32), which most NPUs do not support, forcing fallback to CPU or GPU and breaking the performance gains of full offloading.

The hardware designed for static CNN operations now had to support dynamic transformers that were thousands of times larger in memory. Supporting them is technically possible, but requires a lot of work.

Deployment complexity

A 'simple' meeting note-taker today needs five different AI models: speech recognition, speaker embedding, speaker segmentation, language models for summarization and voice activity detection. What used to be optional features have become the core of modern AI apps. AI has moved from being a nice-to-have add-on to becoming central to the core experience of AI native applications.

Deploy these five models across platforms and suddenly you're dealing with 25 model-to-NPU conversions. Five models with five different NPU architectures (Apple Neural Engine, Qualcomm Hexagon, Intel NPU, AMD XDNA, Google Tensor). For larger models like speech recognition and summarization, you need to offer different variants based on the hardware so 25 is just the minimum.

The conversion timeline is brutal. Whisper-v3-turbo released in October 2024. Qualcomm just released support for it to run on their NPU in June 2025. By the time support arrives, newer models have already taken the (automatic speech recognition) ASR leaderboard. State-of-the-art (SOTA) is being redefined every quarter.

ASR leaderboard showing how quickly models evolve and surpass each other

Each new model requires platform-specific model optimization, performance validation against GPU versions, energy consumption testing, and native language bindings. 80% of our engineering time went to the model layer alone. Meanwhile, the industry compounds the problem with constant churn: Microsoft rebranded their local AI solution four times in one year, Apple's new frameworks only work on their latest devices, and Copilot+ PCs heavily prefer Qualcomm devices.

The pace of model improvement makes the fragmentation worse. GPT-4o (May 2024) uses an estimated 200B parameters. But smaller models are catching up fast, we'll likely see GPT-4o performance in sub-5B parameter models by 2026. Each breakthrough brings new operations that NPUs must support. Each NPU requires new APIs. Each API needs framework integration. The cycle never ends. Even NVIDIA researchers argue "Small Models are the Future of Agentic AI".

artificialanalysis.ai

Historical context

This fragmentation problem isn't unique to NPUs; we've seen this pattern before. The 1970s minicomputer revolution forced developers to choose between incompatible architectures. When Apple moved to ARM, Docker's desktop app official support took 6 months. There are still different CPU architectures, but the problem has mostly been mitigated by solutions like LLVM or the "write once, run anywhere" promise of the likes of Java and Docker.

Running models on GPUs isn't a solved problem either. While NVIDIA's CUDA has become the de facto standard, developers still struggle with compatibility across different GPU generations, memory limitations, and the complexity of optimizing models for specific hardware. Even with CUDA's dominance, getting optimal performance requires deep expertise and careful tuning.

The machine learning community has been working on a solution: unified compiler frameworks that can translate AI models to run efficiently on any hardware. Think of these as the "LLVM for AI" - just as LLVM lets programmers write code once and compile it for different CPUs, these ML compilers aim to let developers train a model once and deploy it anywhere.

Two major projects lead this effort. Apache TVM, started at the University of Washington, creates optimized code for different hardware targets. MLIR (Multi-Level Intermediate Representation), developed at Google, takes a more flexible approach with its "dialect" system that can represent AI computations at different levels of abstraction. Both promise to solve the fragmentation problem by automatically optimizing models for whatever hardware you have.

However, both approach support still relies on hardware vendors either exposing the right interfaces or contributing optimizations back to these projects. Both are slow processes. With SOTA models emerging every few months with new operations, it will take years before we see comprehensive support if we solely rely on these unified runtimes.

And what happens when transformers get replaced? Do we restart this cycle with the next architecture?

The problem isn't just technical; it's economic and threatens AI democratization itself. Each hardware target requires specialized expertise, dedicated testing infrastructure, and ongoing maintenance as models and hardware evolve. For most developers, this overhead makes NPU optimization economically unfeasible despite the compelling efficiency gains.

AI-driven deployment

We don't need another compiler or framework. We need AI to solve its own deployment problem. We've been down that road before with LLVM, Java, and countless other "write once, run anywhere" promises. This time, we have something new: AI itself.

Coding agents are already writing production code, fixing bugs, and even architecting systems. Why not apply that same intelligence to the model deployment problem? Instead of waiting years for vendors to support new operations, an AI agent could analyze a PyTorch model, understand the target NPU's capabilities, and automatically generate the conversion code. When it hits a limitation, it doesn't give up; it finds workarounds, optimizes differently, or falls back gracefully.

Using AI for deployment isn't pie-in-the-sky thinking. We've already used a prototype agent to get speaker diarization models running on Apple's ANE, achieving significant efficiency gains over CPU. The agent patched PyTorch code, worked around unsupported operations, and delivered a model that actually ships in production. We are able to take matters into our own hands.

The key insight is that model optimization is fundamentally a pattern-matching problem with lots of edge cases. That's exactly what AI excels at. Feed it telemetry data, benchmark results, and hardware constraints, and it learns. When a new model architecture emerges, it develops optimization strategies. For operations not supported by the NPU, we can fall back to the CPU until vendors add support. The fragmentation problem becomes a data problem, and data problems are solvable.

Breaking free

Cloud AI won't disappear, just like cloud computing didn't kill on-premise servers. But the physics are undeniable: moving compute closer to data is always more efficient. Your phone recording a meeting shouldn't need to stream audio to a data center for transcription. Your laptop shouldn't need an internet connection to summarize a document. You cannot have ambient computing if it doesn’t work without the internet. These aren't radical ideas; they're obvious ones held back by implementation complexity.

The mainframe-to-PC transition took nearly two decades. We don't have that kind of time. The demand for AI is here now, privacy concerns are mounting, and edge hardware is already capable. As models eat more into the application code logic, what we need isn't more powerful chips or better frameworks. We need to stop treating model deployment like it's 1999, where every platform required manual optimization.

Deploying models to the edge needs to be as simple as deploying application code. Developers will choose between cloud and edge based on their use case, not technical limitations. The hardware exists. The models are small enough. The missing piece is deployment, and AI itself might be the answer.

Acknowledgements

Thank you for the feedback from the community and friends, and special thanks to Ram and Bharat for their hours of debate and brainstorming.

How are we going to get Intelligence everywhere?

Brandon — Thu, 19 Jun 2025 16:10:37 GMT

On June 12th 2025, the world was reminded how fragile our AI infrastructure really is. When GCP and Cloudflare both went down for several hours, they took half the internet with them, including the majority of big AI providers like Gemini and Anthropic.

Credits: Down Detector

This cascade of failures rendered AI apps like Cursor and Gemini and Claude essentially unusable. For millions of users, their AI assistants simply vanished. To be fair to the providers, outages at this scale are quite rare, the last major outage was when Fastly brought down Amazon, Reddit, Spotify and others for 49 minutes in 2021. However, it's a reminder that nearly all AI applications today are heavily dependent on the cloud and a stable internet connection.

We Have the Hardware

Sam Altman recently mentioned that running a single ChatGPT query is equivalent to turning on an oven for one second ≈0.34Wh per prompt, scaling to gigawatt-hours at billions of queries. That might not sound like much, but imagine if every time you asked your phone a question, you were firing up your oven. That’s where Neural Processing Units (NPUs) come in. These specialized chips are designed to run AI inference (that's the process of actually using a trained model, as opposed to training it) at 5-10x less power than traditional processors.

Typical power draw for an image generation task. The NPU Wattage Advantage - Creative Strategies

"NPU" has become the catch-all label for on-chip AI horsepower, but every vendor slaps its own badge on the same idea. Apple ships a Neural Engine (ANE), Qualcomm has Hexagon, MediaTek has an APU, Arm has Ethos, Intel has AI Boost, AMD has XDNA, Google has TPU in the cloud and Edge TPU on devices, Graphcore has an IPU, Horizon Robotics has a BPU, and the list keeps growing. Under the stickers? All of them are matrix engines built to chew through neural-net math - some even stretch to full-blown training.

This echoes the early days of computing when architectural incompatibility was the norm. IBM System/360 code (1964) wouldn't boot on a DEC PDP-11 (1970), and today an ANE binary won't magically light up on a rival NPU. IBM System/360 code wouldn’t boot on a DEC PDP-11, and today an ANE binary won’t magically light up on a rival NPU. Different decade, same fragmentation - just with tensors instead of punched cards. Throughout the 1970s minicomputer revolution, incompatible architectures forced developers to choose between competing platforms, fragmenting the software ecosystem.

The industry is attempting to solve this NPU fragmentation through abstraction layers, much like how graphics APIs eventually unified GPU programming. Google's Android Neural Networks API (NNAPI) tried to abstract NPU access across Android devices, though it was deprecated as an NDK API in Android 15, with developers now migrating to TensorFlow Lite or hardware-specific SDKs. Apple's Core ML provides a unified interface for their Neural Engine. ONNX promises model portability across platforms. But unlike graphics APIs that had decades to mature, or CUDA that could dominate through sheer market force, NPU standards are fragmenting faster than they're converging. Cross-platform solutions like ONNX or Foundry Local exist, but integration with native applications often reveals they're outdated, supporting models that are months behind state-of-the-art.

Fragmented System for running AI inference on Edge Devices

But We Can't Use It

The fragmentation is real, and it's holding everyone back. Apple's recent Foundation Model Framework shows what's possible with the right infrastructure – what used to take us two weeks to implement (getting speech to text and local LLMs running on a Mac) now takes less than a day, especially with their new SpeechAnalyzer and SpeechTranscriber APIs that offer significantly better ASR (Automatic Speech Recognition) models than their previous speech recognition offerings. But that's just on Apple's tightly controlled ecosystem and users have to be running OS 26+. And it was mentioned that model updates may happen with OS updates, it's worrisome if the foundational models are updated once a year but if it's too frequent, it becomes a pain to maintain as well!

Microsoft seems to not be able to make up their mind about how to brand their toolchain either. From Windows AI to Windows App SDK to Windows Copilot Runtime to Windows AI Foundry (their latest rebrand), the frequent rebranding itself illustrates the confusion and uncertainty in the space. Even Microsoft's Phi-silica model that's supposed to ship with Windows 11 is still in developer preview after eight months, and you can only access it on Windows devices running Qualcomm chips.

Microsoft has the challenge of supporting multiple chip makers, and because of that they're still behind Apple. Foundry Local has support for Qualcomm with phi-4-mini-reasoning (Apr 2025) and deepseek-r1 (Jan 2025) on NPU, but other chips still have to run inference on CPU/GPU. Note that phi-4-mini-reasoning is a different model from the standard phi-4-mini, and developers report various implementation challenges even with the officially supported versions.

Foundry Local models supportted on a Snapdragon Microsoft Surface Laptop as of June 2025

Then you have to worry about state-of-the-art models changing every couple of months and working with chip makers to ensure that minor changes in training translate properly when running inference on each platform.

The conversion pipeline itself is a nightmare. Most models start life as PyTorch checkpoints trained on NVIDIA GPUs. Getting them to run on edge devices means navigating a maze of format conversions between PyTorch, Safetensors, ONNX. Each conversion step introduces potential precision loss, unsupported operations, and performance degradation. What worked perfectly in your training environment might completely fail or run 10x slower on the target device.

Most machine learning engineers operate in a Python-rich world. Models are trained and tuned in Python - but application developers work in their native language. Folks building solutions for endpoint devices need to distribute their apps in the native format. That means additional support for Swift, .NET, TypeScript, Rust, C++, Go, Kotlin and many other languages that users can choose from. The SDK fragmentation alone is enough to make most developers stick with cloud APIs. It's not just converting LLM or ASR models either, once users get a taste of AI rich features, they want more.

And Users Want Everything

You can offer a privacy first, cost efficient and offline experience to the end user, but it needs to be nearly or almost as valuable as the cloud-first solutions, or else most of your users will not stick around. Only those that extremely value privacy or have strict compliance requirements may end up making the trade off.

Once you start building AI native applications, you will run into requests that require speaker diarization, vision capabilities, OCR, text-to-speech and the likes that have traditionally been running on the cloud with some GPU. Each of these requires their own model with different architectures. The fragmentation problem compounds exponentially. It's hard enough getting a single LLM to run efficiently across different chips. Now multiply that by every AI capability users expect.

Not only that, most machine learning engineers operate in a Python-rich world. Models are trained and tuned in Python - but application developers work in their native language. Folks building solutions for endpoint devices need to distribute their apps in the native format. That means additional support for Swift, .NET, TypeScript, Rust, C++, Go, Kotlin and many other languages that users can choose from.

The Path to Intelligence Everywhere

Despite these challenges, the momentum is undeniable. Small language models are getting remarkably good, good enough for many everyday tasks. The power efficiency gains are real. And the demand for offline, private, always-available AI is only growing.

So how do we actually get intelligence everywhere? We're at an inflection point. The hardware is here. The models are quite capable already. What we need now is the connective tissue: the standards, frameworks, and tools that will make edge AI as seamless as cloud AI has become.

What we need now is the connective tissue: the standards, frameworks, and tools that will make edge AI as seamless as cloud AI has become. Like the shift in application code deployment on CPUs (x86, x64, ARM), we're unlikely to see a holistic solution where the world converges on one architecture. However, it needs to narrow down to 2-3, not one per chip maker. Even then, the underlying hardware should ideally be abstracted from the end developer through a platform-agnostic layer, like how Docker or the JVM made application code deployment simple on the CPU.

The underlying building blocks for models on NPUs and techniques for conversion should not be the moat. They need to be open-sourced and shared with the community to let everyone accelerate the timeline. We all win if more inference happens on-device.

The path forward isn't mysterious. It's about removing artificial barriers and building the boring but essential infrastructure that lets intelligence flow to every device. By 2030, we'll look back at today's cloud-dependent AI the same way we look at dial-up internet. Quaint, necessary for its time, but ultimately a stepping stone to something far more powerful and pervasive. Software took off the moment code became pure logic: portable bytes that ignored the underlying silicon. When AI models reach that same level of portability and cost-efficiency, intelligence will truly be everywhere we are.