Mobile QR Code
Title HASC: Heterogeneity-aware Scheduling and Communication for Large Language Model Inference
Authors (Hong Huize) ; 한태희(Tae Hee Han)
Page pp.40-48
ISSN 2287-5026
Keywords LLM inference; Heterogeneity-aware scheduling; Topology-aware communication; Memory-aware batching; Distributed systems
Abstract Distributed Large Language Model (LLM) inference increasingly faces hardware heterogeneity, where compute, memory, and network resources vary across devices. Existing systems typically enforce uniform execution policies that treat all devices as homogeneous, which causes coordination failures when heterogeneous resources interact at step-level synchronization barriers. We show that the resulting performance degradation stems not from a single bottleneck, but from the combined impact of compute imbalance, memory asymmetry, and topology-oblivious communication. This paper presents HASC (Heterogeneity-Aware Scheduling and Communication), a unified runtime framework designed to resolve these resource misalignments by jointly optimizing across all three dimensions. HASC employs online profiling to capture runtime hardware characteristics and dynamically adapts workload scheduling and collective communication to the detected heterogeneity. Evaluated on a heterogeneous multi-GPU cluster, HASC reduces per-token latency by up to 60.7% compared to the DeepSpeed Inference baseline, showing cross-dimension coordination benefits LLM serving on non-uniform hardware.