TP10 - Language Model Inference in Multi-Device XR Settings

POSTER

Bakshree Mishra, Evelyn Hou, Siddarth Pattisapu, Sarita Adve

Language models now span a wide spectrum of sizes, including compact models capable of executing on edge devices. While prior work has optimized inference on individual devices, the user's full device ecosystem---spanning multiple devices, heterogeneous accelerators with vastly different compute, memory, and energy envelopes---remains an underutilized resource. One domain that utilizes such a multi-device ecosystem is extended reality (XR), with AR/VR headsets paired with compute pucks, where LLMs are naturally useful for agentic behavior as well as spoken language understanding. We characterize the multi-device edge as a new mapping regime distinct from on-SoC heterogeneous scheduling and datacenter disaggregation, and show that the dominant bottleneck of an inference task is jointly determined by hardware microarchitecture, inference task characteristics, and accumulated runtime state---motivating a workload- and history-aware placement abstraction.

In this work, we provide a methodology for orchestrating language model inference across multiple edge devices, including an analytical–empirical performance and energy predictor that captures system behavior and a 2D inference-space representation to capture inference task properties in terms of the prompt and decode lengths. We show that the dominant hardware bottleneck of an inference task shifts across multi-turn iterations as context accumulates, making static hardware assignment suboptimal even within a single application session. We then propose a Mapper system that orchestrates inference across heterogeneous edge devices. We validate that our analytical performance model achieves within 2.7% mean average percentage error (MAPE) across all systems, and that the analytical energy model is within 5.2% MAPE for total inference energy on the edge GPU. Leveraging the model, we divide the 2D inference-space representation into regions of hardware mapping optimizing for power, energy, and energy-delay constraints, and show that no single configuration consistently achieves the best performance or energy efficiency, and that the proposed Mapper outperforms the best static hardware by up to 2.4X across the workload space.