GPU, CPU, or NPU? Matching Inference Workloads to Hardware Across Platforms
Choosing the right hardware for machine learning inference can make or break product performance, cost, and user experience. Between Apple’s Neural Engine on iOS, Android’s NNAPI devices, desktop GPUs, and server-class accelerators, the decision space is crowded—and rapidly evolving. This guide distills what matters for computer vision (CV), natural language processing (NLP), and automatic speech recognition (ASR), explains how quantization and batching change the equation, and offers practical portability and benchmarking tips for Core ML, TensorFlow Lite (TFLite), and ONNX Runtime. Where useful, we reference authoritative sources and real-world examples.
Why hardware choice matters: a quick historical lens
Mobile inference moved from CPU-only to heterogeneous acceleration in the last decade. Apple introduced the Neural Engine (ANE) with A11 in 2017 and continues to expand its capability; the A17 Pro Neural Engine is rated at up to 35 TOPS according to Apple’s newsroom announcement (Apple). On Android, the Neural Networks API (NNAPI), launched in Android 8.1 (Google), provides a standard interface to vendor accelerators such as Qualcomm’s Hexagon DSP and NPUs (Qualcomm). On desktop and in the data center, GPUs—augmented by libraries and runtimes like TensorRT and Triton—have become the default for high-throughput inference (NVIDIA TensorRT; NVIDIA Triton). Standardized benchmarking via MLPerf Inference has added transparency across systems (MLCommons).
Match the workload to the platform
Computer vision (CV)
Image classification and object detection models (e.g., MobileNet, YOLO variants) map well to mobile NPUs and GPU delegates. On iOS, Core ML can place supported ops on the ANE with impressive efficiency; Apple’s ML blog shows diffusion models running with Core ML optimizations on Apple Silicon (Apple ML Research). On Android, TFLite with the NNAPI delegate can route conv-heavy graphs to the device accelerator, with documented gains over CPU-only (TFLite best practices). In the data center, CNNs attain highest throughput on GPUs; MLPerf Inference results consistently show orders-of-magnitude throughput advantages versus CPU-only for vision tasks (MLPerf results).
NLP
Transformer-based models stress memory bandwidth and benefit from mixed precision and batching. On mobile, distilled and quantized variants (e.g., DistilBERT, MobileBERT) perform acceptably with NNAPI/ANE acceleration; larger LLMs typically require aggressive quantization (int8/int4) and operator fusion. On desktop/server, GPUs with tensor cores and optimized runtimes (e.g., TensorRT-LLM) dominate for throughput and latency, particularly with batch sizes greater than one.
ASR (speech)
Streaming ASR prioritizes low, stable latency at batch size 1. Google’s on-device recognizer demonstrated near-cloud parity using RNN-T on mobile with careful optimization (Google AI Blog). On iOS, Core ML can leverage ANE for acoustic models; on Android, NNAPI or the CPU/XNNPACK path may win for streaming if the accelerator scheduling adds jitter. In server settings, GPU inference with small micro-batches can deliver low-latency transcripts for multi-stream workloads.
Platform specifics: iOS, Android, desktop, and servers
iOS and Apple Silicon
Core ML routes layers across ANE, GPU (Metal Performance Shaders), and CPU (BNNS) depending on operator support and shape. For stable latency and energy efficiency, prefer ANE-friendly ops, fused activations, and static shapes when possible (Apple Core ML optimization; Metal Performance Shaders; BNNS).
Android devices
TFLite offers execution paths via XNNPACK (CPU), GPU delegates (OpenGL ES/Vulkan), and NNAPI. The NNAPI delegate taps vendor drivers when available (e.g., Snapdragon’s AI Engine) and falls back gracefully, but operators and shapes unsupported by a given driver may force partial CPU execution. Follow TFLite’s best practices for fusing ops and limiting dynamic shapes, and prefer the GPU delegate for image tasks with large textures (TFLite GPU).
Desktop and server
For throughput-intensive production (CV classification at scale, embedding generation, batched NLP), GPUs paired with CUDA/ROCm and optimized runtimes (TensorRT, ONNX Runtime with CUDA, Triton Inference Server) are the norm. CPUs still shine for small models, control-plane logic, pre/post-processing, or strict-cost environments; libraries like oneDNN and XNNPACK provide strong SIMD-optimized kernels (ONNX Runtime CUDA EP; oneDNN; XNNPACK).
Quantization and batching: the biggest levers
Quantization reduces precision (e.g., FP32 to INT8/INT4) to accelerate compute and reduce memory bandwidth. TFLite reports 2–4x speedups on ARM CPUs with post-training int8 quantization in many models, often with minimal accuracy loss if properly calibrated (TFLite PTQ). For best results, consider quantization-aware training to preserve accuracy on sensitive tasks (QAT). On iOS, Core ML supports linear quantization and weight compression; some operators can remain in higher precision while others are quantized. On server GPUs, mixed precision (FP16/FP8) is standard; INT8 requires calibration but can boost throughput further (see vendor docs for your accelerator).
Batching trades latency for throughput. Mobile apps often run batch=1, prioritizing UX and energy. Server inference thrives on batching (or dynamic micro-batching) to saturate GPUs; Triton’s dynamic batcher can aggregate requests within tight windows to raise utilization (Triton dynamic batching). For streaming ASR or interactive NLP, consider small micro-batches or token-level scheduling to balance tail latency and throughput.
Portability tips for Core ML, TFLite, and ONNX
- Prefer standard ops and avoid custom layers when possible. Check ONNX opset compatibility and Core ML supported layers before committing to an architecture (ONNX operators).
- Use official converters: coremltools for PyTorch/TF to Core ML (coremltools), TFLiteConverter for TF models (TFLite converter), and ONNX exporters in PyTorch/TF.
- Constrain shapes where you can. Static shapes aid graph partitioning on ANE and NNAPI.
- Fuse pre/post-processing into the graph if the backend benefits; otherwise, pin it to CPU to avoid device hops.
- Test fallbacks: verify that partial-acceleration scenarios maintain acceptable latency when an op isn’t supported by the NPU/driver.
- For desktop Windows apps, ONNX Runtime with DirectML is a good portable path across GPUs (DirectML EP).
Benchmarking methodology that actually predicts production
- Warm-up runs and steady-state: run several warm-ups to allow kernels to JIT/compile and caches to prime; record steady-state metrics.
- Latency distributions: capture P50/P90/P99, not just averages; tail latency drives UX and SLOs.
- Thermals and power: on mobile, measure energy and throttling using Xcode Instruments and system logs; on Android, use Perfetto and vendor perf tools (Xcode Instruments; Perfetto).
- Realistic inputs: benchmark with the same resolutions, sequence lengths, and audio sample rates you’ll ship.
- Accuracy checks: validate post-quantization accuracy drift with task-relevant metrics (mAP, WER, F1).
- Contention: simulate concurrent app activity, background services, and network IO where relevant.
Real-world glimpses
- On-device dictation and translation on iOS rely heavily on ANE acceleration for low-latency experiences; the A17 Pro highlights improved ANE throughput for such tasks (Apple).
- Google’s fully on-device ASR in Gboard showcased a compact RNN-T optimized for real-time inference on mobile (Google AI Blog).
- Stable Diffusion on Apple Silicon demonstrates how graph surgery, attention optimizations, and Core ML tooling enable heavy CV workloads on consumer devices (Apple ML Research).
Keyword deep dives
Inference
Inference is the runtime execution of a trained model to produce predictions. For practical engineering, the core metrics are latency (time per request), throughput (requests per second), and efficiency (performance per watt or per dollar). Your choice of hardware changes which metric you can optimize: mobile devices emphasize latency and energy at batch=1, while servers target high throughput with large or dynamic batches. Standardized suites like MLPerf Inference help contextualize expectations across hardware classes (MLPerf).
NPU
A Neural Processing Unit (NPU) is a specialized accelerator optimized for common ML ops (convolutions, attention, matrix multiplies) with low-precision arithmetic and high on-chip bandwidth. On iOS, the ANE automatically accelerates eligible Core ML layers; on Android, NNAPI exposes vendor NPUs and DSPs for TFLite graphs. NPUs are ideal when operator coverage matches your model; gaps can force costly CPU fallbacks, so audit operator support early (NNAPI).
GPU
GPUs excel at massively parallel math with mature software stacks. Mobile GPUs (Metal, Vulkan) shine for image-centric tasks and some transformers, while desktop/server GPUs dominate for batched CV, NLP, and embedding workloads. Runtimes such as TensorRT and ONNX Runtime CUDA provide kernel fusion, quantization, and graph optimizations to fully utilize the hardware (TensorRT; ONNX Runtime CUDA).
CPU
CPUs remain the most portable baseline for inference and are competitive for small models, light traffic, or heavy pre/post-processing. With vectorized libraries (oneDNN on x86, Accelerate/BNNS on Apple silicon, XNNPACK on ARM), int8 or fp16 can yield large speedups without special hardware. For batch=1 interactive tasks, a well-optimized CPU path can be simpler and cost-effective.
Core ML
Core ML is Apple’s model format and runtime that orchestrates execution across ANE, GPU, and CPU. Use coremltools to convert from PyTorch/TF, prefer supported ops, and profile with Xcode Instruments to verify device placement. Apple documents performance strategies, including model quantization and flexible shapes for attention models (Core ML docs; coremltools).
TFLite
TensorFlow Lite targets mobile and embedded inference. It provides CPU (XNNPACK), GPU (OpenGL/Vulkan/Metal), and NNAPI delegates, plus built-in tooling for post-training quantization. Follow the official performance guide for thread tuning, delegate choice, and memory mapping of models to reduce startup overhead (TFLite performance).
ONNX
ONNX (Open Neural Network Exchange) is a portable graph format with broad framework support. ONNX Runtime runs models on CPU, CUDA, ROCm, TensorRT, and DirectML, making it ideal for cross-OS desktop apps and servers. Pay attention to opset versions and execution providers, and use the perf tuning tools to select kernels per platform (ONNX Runtime).
Quantization
Quantization trades precision for speed and memory savings. On ARM CPUs and many NPUs, int8 quantization often brings 2–4x latency reductions; per-channel quantization and proper calibration minimize accuracy loss. For transformers, weight-only quantization (int8/int4) and mixed-precision activations are popular compromises; consider QAT for sensitive tasks and validate with task metrics (TF Model Optimization).
Benchmarking
Benchmarking should mirror production: identical sequence lengths, image sizes, and audio rates; realistic concurrency; warm-ups; and P50/P90/P99 reporting. On mobile, include energy and thermals; on servers, track GPU utilization, memory bandwidth, and request queues. Build automation so every model revision produces comparable metrics—ideally as part of CI.
Mobile AI
Mobile AI is shifting experiences on-device for privacy, latency, and cost control: real-time vision filters, offline speech, summarization, and translation. Success hinges on selecting the right delegate (ANE/NNAPI/GPU), constraining model size, applying quantization, and rigorously profiling on target devices and OS versions. Expect heterogeneous deploys: the same app may use ANE on iOS flagships, NNAPI on premium Android, and CPU/GPU fallbacks on older phones.
Putting it together: a selection guide
- CV on mobile: prefer NPU/NNAPI or GPU delegate; quantize to int8; batch=1.
- Interactive NLP on mobile: distilled + quantized transformer; test NPU vs CPU/XNNPACK for stability; consider token-caching strategies.
- Streaming ASR: low-latency pipeline with minimal device hops; avoid ops that force fallback; micro-batch only if tail latency allows.
- Desktop apps: ONNX Runtime with DirectML/CUDA for portability; keep a CPU path for broad coverage.
- Server scale: GPU with TensorRT/ONNX Runtime, dynamic batching, mixed precision; consider CPU-only for small models and spiky loads.
Need help choosing and proving the winner?
If you want a rigorous, hardware-aware evaluation, Teyrex offers a benchmarking and portability service that compares Core ML, TFLite, and ONNX Runtime paths across representative devices and servers, reports latency distributions, energy, and accuracy deltas post-quantization, and delivers deployment-ready configurations. Get in touch at teyrex.com. If you are also scaling the surrounding web or dashboard components, our teams can help with full‑stack development and performant Next.js interfaces for your AI workloads.