Mobile QR Code
Title Improving the Execution Speed of Transformer-based Object Tracking Models through Multi-head Attention Parallelization
Authors 김인모(Inmo Kim) ; 김명선(Myungsun Kim)
DOI https://doi.org/10.5573/ieie.2023.60.4.39
Page pp.39-47
ISSN 2287-5026
Keywords Transformer; Multi-head attention; CSWinTT; Object tracking; Multi-threading
Abstract With the recent advance of deep learning-based object tracking technology, it is being used in various application fields such as sports game analysis, video security, and augmented reality. Users require high object tracking accuracy as well as high QoS according to fast object tracking speed. In this study, we improve the object tracking speed of CSWinTT(transformer-based object-tracking model), which is currently considered as the best object tracking solution. The head operations of the Multi-Head Attention(MHA) in the encoder layer of this model occupy the most execution time in the entire inference procedure of the transformer. Each head has a different input value, but is executed in a serial manner. To overcome this, in this study, each head operation is executed in parallel. For parallel operation, the MHA consisting of one module is divided into sub-modules by the number of heads, and each separated sub-module is executed in a multi-threading environment. The pure Python environment does not guarantee a complete multi-threaded run. We thus improve to a C++ implementation environment to enable complete multi-threading. In addition, kernels transmitted asynchronously by each thread can be executed as concurrently as possible inside the GPU. As a result of checking the effect of MHA parallel execution through various experiments, the average execution time of the encoder decreased by 56.8% and the average FPS increased by 63.3% compared to the existing method while maintaining almost the same inference accuracy.