Mobile QR Code QR CODE

References

1 
NVIDIA, NVIDIA Tesla V100 GPU Architecture, 2017.URL
2 
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, ``GPUWattch: Enabling energy optimizations in GPGPUs,'' ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 487-498, 2013.DOI
3 
V. Kandiah, S. Peverelle, M. Khairy, J. Pan, A. Manjunath, T. G. Rogers, T. M. Aamodt, and N. Hardavellas, ``AccelWattch: A power modeling framework for modern GPUs,'' Proc. of MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 738-753, 2021.DOI
4 
N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, ``A case for core-assisted bottleneck acceleration in GPUs: Enabling flexible data compression with assist warps,'' Proc. of the 42nd Annual International Symposium on Computer Architecture, pp. 41-53, 2015.DOI
5 
S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram, ``Warped-compression: Enabling power efficient gpus through register compression,'' Proc. of the 42nd Annual International Symposium on Computer Architecture, pp. 502-514, 2015.DOI
6 
AMD, The Polaris Architecture, 2016.URL
7 
D. Wong, N. S. Kim, and M. Annavaram, ``Approximating warps with intra-warp operand value similarity,'' Proc. of 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016.DOI
8 
S. Sardashti and D. A. Wood, ``Decoupled compressed cache: Exploiting spatial locality for energy-optimized compressed caching,'' Proc. of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 62-73, 2013.DOI
9 
T. M. Nguyen and D. Wentzlaff, ``MORC: A manycore-oriented compressed cache,'' Proc. of the 48th International Symposium on Microarchitecture, pp. 76-88, 2015.DOI
10 
S. Hong, B. Abali, A. Buyuktosunoglu, M. B. Healy, and P. J. Nair, “Touché: Towards ideal and efficient cache compression by mitigating tag area overheads,” Proc. of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 453-465, 2019.DOI
11 
G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and S. W. Keckler, ``A case for toggle-aware compression for GPU systems,'' Proc. of 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016.DOI
12 
S. Lal, J. Lucas, and B. Juurlink, ``E$^2$MC: Entropy encoding based memory compression for GPUs,'' Proc. of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2017.DOI
13 
G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, ``Base-delta-immediate compression: Practical data compression for on-chip caches,'' Proc. of the 21st International Conference on Parallel Architectures and Compilation Techniques, pp. 377-388, 2012.DOI
14 
G. Li, X. Chen, G. Sun, H. Hoffmann, Y. Liu, Y. Wang, and H. Yang, ``A STT-RAM-based low-power hybrid register file for GPGPUs,'' Proc. of the 52nd Annual Design Automation Conference, pp. 1-6, 2015.DOI
15 
W. Jeon, J. H. Park, Y. Kim, G. Koo, and W. W. Ro, ``Hi-End: Hierarchical, endurance-aware STT-MRAM-based register file for energy-efficient GPUs,'' IEEE Access, vol. 8, pp. 127768-127780, 2020DOI
16 
C. E. Shannon, ``A mathematical theory of communication,'' Bell System Technical Journal, vol. 27, no. 3, July 1948.DOI
17 
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, ``Rodinia: A benchmark suite for heterogeneous computing,'' Proc. of 2009 IEEE International Symposium on Workload Characterization (IISWC), 2009.DOI
18 
G. M. Amdahl, ``Validity of the single-processor approach to achieving large scale computing capabilities,'' Proc. of Spring Joint Computer Conference, pp. 483-485, 1967DOI
19 
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, ``Parboil: A revised benchmark suite for scientific and commercial throughput computing,'' Center for Reliable and High-Performance Computing, vol. 127, 2012.URL
20 
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, ``Analyzing CUDA Workloads using a Detailed GPU Simulator,'' Proc. of 2009 IEEE International Symposium on Performance Analysis of Systems and Software, 2009DOI
21 
NVIDIA, NVIDIA’s Fermi: The First Complete GPU Computing Architecture, 2009.URL