To Home Page

Reading List

2024

Lücke, Martin, Oleksandr Zinenko, William S. Moses, Michel Steuwer, and Albert Cohen,
"The MLIR Transform Dialect. Your compiler is more powerful than you think."
arXiv preprint arXiv:2409.03864, September 2024.

Luigi Fusco, Mikhail Khalilov, Marcin Chrapek, Giridhar Chukkapalli, Thomas Schulthess, and Torsten Hoefler,
"Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip"
arXiv preprint arXiv:2408.11556, August 2024.

Elias Frantar, Roberto L. Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh,
"MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models"
arXiv preprint arXiv:2408.11743, August 2024.

Patrik Okanovic, Grzegorz Kwasniewski, Paolo Sylos Labini, Maciej Besta, Flavio Vella, and Torsten Hoefler,
"High Performance Unstructured SpMM Computation Using Tensor Cores"
arXiv preprint arXiv:2408.11551, August 2024.

Tanzima Z. Islam, Aniruddha Marathe, Holland Schutte, and Mohammad Zaeed,
"Data-Driven Analysis to Understand GPU Hardware Resource Usage of Optimizations"
arXiv preprint arXiv:2408.10143, August 2024.

Arunavo Dey, Aakash Dhakal, Tanzima Z. Islam, Jae-Seung Yeom, Tapasya Patki, Daniel Nichols, Alexander Movsesyan, and Abhinav Bhatele,
"Relative Performance Prediction Using Few-Shot Learning"
Proceedings of the 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC '24), Torino, Italy, July 2024, pp. 1764-1769.

Daniel Nichols, Pranav Polasam, Harshitha Menon, Aniruddha Marathe, Todd Gamblin, and Abhinav Bhatele,
"Performance-aligned LLMs for generating fast code"
arXiv preprint arXiv:2404.18864, April 2024.

Dolores Miao, Ignacio Laguna, Giorgis Georgakoudis, Konstantinos Parasyris, and Cindy Rubio-González,
"An automated OpenMP mutation testing framework for performance optimization"
Parallel Computing, Volume 130, February 2024, 103097.

Hui Zhou, Ken Raffenetti, Yanfei Guo, Thomas Gillis, Robert Latham, and Rajeev Thakur,
"Designing and Prototyping Extensions to MPI in MPICH"
arXiv preprint arXiv:2402.12274, February 2024.

Riley Shipley, Garrett Hooten, David Boehme, Derek Schafer, Anthony Skjellum, and Olga Pearce,
"MPI Implementation Profiling for Better Application Performance"
arXiv preprint arXiv:2402.12203, February 2024.

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman,
"SliceGPT: Compress large language models by deleting rows and columns"
arXiv preprint arXiv:2401.15024, January 2024.

Onur Cankur, Aditya Tomar, Daniel Nichols, Connor Scully-Allison, Katherine E. Isaacs, and Abhinav Bhatele,
"Automated Programmatic Performance Analysis of Parallel Programs"
arXiv preprint arXiv:2401.13150, January 2024.

Wei-Chen Lin, Simon McIntosh-Smith, and Tom Deakin,
"Preliminary report: Initial evaluation of StdPar implementations on AMD GPUs for HPC"
arXiv preprint arXiv:2401.02680, January 2024.

Jarmusch, Aaron, Felipe Cabarcas, Swaroop Pophale, Andrew Kallai, Johannes Doerfert, Luke Peyralans, Seyong Lee, Joel Denny, and Sunita Chandrasekaran,
"CI/CD Efforts for Validation, Verification and Benchmarking OpenMP Implementations"
In International Workshop on OpenMP (IWOMP 2024), pp. 111-125, Cham: Springer Nature Switzerland, 2024.

Kousha, Pouya,
"Designing Conversational AI Enabled Services and Performance Analysis Tools for High-Performance Computing"
PhD dissertation, The Ohio State University, 2024.

Augustine Wong,
"Out with outliers: making sense of multi-threaded application performance at scale with NonSequitur"
PhD dissertation, University of British Columbia, 2024.

Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, et al.,
"Alibaba hpn: A data center network for large language model training"
Proceedings of the ACM SIGCOMM 2024 Conference, pp. 691-706, 2024.

Hammad Ather, Jean Luca Bez, Yankun Xia, and Suren Byna,
"Drilling Down I/O Bottlenecks with Cross-layer I/O Profile Exploration"
Proceedings of the 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS '24), pp. 532-543, 2024.

Alan Smith, Gabriel H. Loh, Michael J. Schulte, Mike Ignatowski, Samuel Naffziger, Mike Mantor, Mark Fowler, Nathan Kalyanasundharam, et al.,
"Realizing the AMD Exascale Heterogeneous Processor Vision: Industry Product"
Proceedings of the 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA '24), pp. 876-889, 2024.

Eva Siegmann, Robert J. Harrison, David Carlson, Smeet Chheda, Anthony Curtis, Firat Coskun, Raul Gonzalez, Daniel Wood, and Nikolay A. Simakov,
"First Impressions of the Sapphire Rapids Processor with HBM for Scientific Workloads"
SN Computer Science, Volume 5, Issue 5, Article 623, 2024.
Christian Munley, Aaron Jarmusch, and Sunita Chandrasekaran,
"LLM4VV: Developing LLM-driven testsuite for compiler validation"
Future Generation Computer Systems, 2024 (In Press).

Murali Emani, Sam Foreman, Varuni Sastry, Zhen Xie, Siddhisanket Raskar, William Arnold, Rajeev Thakur, et al.,
"Toward a Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators"
Proceedings of the 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW '24), pp. 1-10, 2024.

Javad Abdi, Gilead Posluns, Guozheng Zhang, Boxuan Wang, and Mark C. Jeffrey,
"When Is Parallelism Fearless and Zero-Cost with Rust?"
Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '24), pp. 27-40, 2024.

Ke Fan, Suraj Kesavan, Steve Petruzza, and Sidharth Kumar,
"TinyProf: Towards Continuous Performance Introspection through Scalable Parallel I/O"
Proceedings of the ISC High Performance 2024 (39th International Conference), pp. 1-12, 2024.

Yuyang Jin, Haojie Wang, Runxin Zhong, Chen Zhang, Xia Liao, Feng Zhang, and Jidong Zhai,
"Graph-Centric Performance Analysis for Large-Scale Parallel Applications"
IEEE Transactions on Parallel and Distributed Systems, 2024 (Early Access).

Daniel Nichols, Aniruddha Marathe, Harshitha Menon, Todd Gamblin, and Abhinav Bhatele,
"HPC-Coder: Modeling Parallel Programs using Large Language Models"
Proceedings of the ISC High Performance 2024 (39th International Conference), pp. 1-12, 2024.

Daniel Nichols, Alexander Movsesyan, Jae-Seung Yeom, Abhik Sarkar, Daniel Milroy, Tapasya Patki, and Abhinav Bhatele,
"Predicting Cross-Architecture Performance of Parallel Programs"
Proceedings of the 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS '24), pp. 570-581, 2024.

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, et al.,
"MegaScale: Scaling large language model training to more than 10,000 GPUs"
Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI '24), pp. 745-760, 2024.

Ahamed Al Nahian and Brian Demsky,
"FlowProf: Profiling Multi-threaded Programs using Information-Flow"
Proceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction (CC '24), pp. 137-149, 2024.

Connor Scully-Allison, Ian Lumsden, Katy Williams, Jesse Bartels, Michela Taufer, Stephanie Brink, Abhinav Bhatele, Olga Pearce, and Katherine E. Isaacs,
"Design Concerns for Integrated Scripting and Interactive Visualization in Notebook Environments"
IEEE Transactions on Visualization and Computer Graphics, 2024 (Early Access).

Sajal Dash, Isaac R. Lyngaas, Junqi Yin, Xiao Wang, Romain Egele, J. Austin Ellis, Matthias Maiterth, Guojing Cong, Feiyi Wang, and Prasanna Balaprakash,
"Optimizing distributed training on frontier for large language models"
Proceedings of the ISC High Performance 2024 (39th International Conference), pp. 1-11, 2024.

Qidong Zhao, Milind Chabbi, and Xu Liu,
"EasyView: Bringing Performance Profiles into Integrated Development Environments"
Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO '24), pp. 386-398, 2024.

Feiyang Jin, Zhizhou Zhang, Rajkishore Barik, Gautam Korlam, and Milind Chabbi,
"Early notice: GenAI-based Datarace Fix for Real-World Golang Programs"
Preprint, 2024.

Alexandre Denis, Emmanuel Jeannot, Philippe Swartvagher, and Samuel Thibault,
"Tracing task‐based runtime systems: Feedbacks from the StarPU case"
Concurrency and Computation: Practice and Experience, Volume 36, Issue 3, e7920, 2024.

2023

Sofia Serrano, Zander Brumbaugh, and Noah A. Smith,
"Language Models: A Guide for the Perplexed"
arXiv preprint arXiv:2311.17301, November 2023.

Chris Cummins, Volker Seeker, Dejan Grubisic, Mostafa Elhoushi, Youwei Liang, Baptiste Roziere, Jonas Gehring, et al.,
"Large language models for compiler optimization"
arXiv preprint arXiv:2309.07062, September 2023.

Sean Lie,
"Cerebras architecture deep dive: First look inside the hardware/software co-design for deep learning"
IEEE Micro, vol. 43, no. 3, pp. 18-30, May-June 2023.

Abhinav Bhatele, Rakrish Dhakal, Alexander Movsesyan, Aditya Ranjan, Jordan Marry, and Onur Cankur,
"Pipit: Enabling programmatic analysis of parallel execution traces"
arXiv preprint arXiv:2306.11177, June 2023.

Camille Coti, Kevin Huck, and Allen D. Malony,
"STaKTAU: profiling HPC applications' operating system usage"
arXiv preprint arXiv:2304.11205, April 2023.

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, et al.,
"Sparks of artificial general intelligence: Early experiments with GPT-4"
arXiv preprint arXiv:2303.12712, March 2023.

Yuanhao Wei,
"General Techniques for Efficient Concurrent Data Structures"
PhD dissertation, Carnegie Mellon University, 2023.

Jayanti, Siddhartha Visveswara,
"Simple, Fast, Scalable, and Reliable Multiprocessor Algorithms"
PhD dissertation, Massachusetts Institute of Technology, 2023.

Yueming Hao, Nikhil Jain, Rob Van der Wijngaart, Nirmal Saxena, Yuanbo Fan, and Xu Liu,
"Drgpu: A top-down profiler for gpu applications"
Proceedings of the 2023 ACM/SPEC International Conference on Performance Engineering (ICPE '23), pp. 43-53, 2023.

Emery D. Berger, Sam Stern, and Juan Altmayer Pizzorno,
"Triangulating python performance issues with SCALENE"
Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI '23), pp. 51-64, 2023.

Mao Lin, Keren Zhou, and Pengfei Su,
"DrGPUM: Guiding memory optimization for GPU-accelerated applications"
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS '23), pp. 164-178, 2023.

Andreas Herten,
"Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview"
Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, pp. 1019-1026, 2023.

Trümper, Lukas, Tal Ben-Nun, Philipp Schaad, Alexandru Calotoiu, and Torsten Hoefler,
"Performance embeddings: A similarity-based transfer tuning approach to performance optimization"
Proceedings of the 37th International Conference on Supercomputing (ICS '23), pp. 50-62, 2023.

Sasongko, Muhammad Aditya, Milind Chabbi, Paul HJ Kelly, and Didem Unat,
"Precise event sampling on amd versus intel: Quantitative and qualitative comparison"
IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 5, pp. 1594-1608, 2023.

Wei, Yating, Zhiyong Wang, Zhongwei Wang, Yong Dai, Gongchang Ou, Han Gao, Haitao Yang, et al.,
"Visual diagnostics of parallel performance in training large-scale dnn models"
IEEE Transactions on Visualization and Computer Graphics, 2023.

Shan, Baodi, Mauricio Araya-Polo, Abid M. Malik, and Barbara Chapman,
"MPI-based Remote OpenMP Offloading: A More Efficient and Easy-to-use Implementation"
In Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores, pp. 50-59, 2023.

Afzal, Ayesha, Georg Hager, Stefano Markidis, and Gerhard Wellein,
"Making applications faster by asynchronous execution: Slowing down processes or relaxing MPI collectives"
Future Generation Computer Systems, vol. 148, pp. 472-487, 2023.

Rosales, Eduardo, Matteo Basso, Andrea Rosà, and Walter Binder,
"Profiling and optimizing java streams"
The Art, Science, and Engineering of Programming, vol. 7, no. 3, Article 10, 2023.

Reed, Daniel, Dennis Gannon, and Jack Dongarra,
"HPC forecast: Cloudy and uncertain"
Communications of the ACM, vol. 66, no. 2, pp. 82-90, 2023.

2022

Daniel Reed, Dennis Gannon, and Jack Dongarra,
"Reinventing high performance computing: challenges and opportunities"
arXiv preprint arXiv:2203.02544, March 2022.

Zhou Keren,
"Performance Measurement, Analysis, and Optimization of GPU-accelerated Applications"
PhD dissertation, Rice University, 2022.

Srinivasan Ramesh,
"Performance Observability and Monitoring of High Performance Computing with Microservices"
PhD dissertation, University of Oregon, 2022.

Allen D. Malony and Sameer S. Shende,
"Translating High-Performance Computing Tools From Research to Practice: Experiences With the TAU Performance System"
Computing in Science & Engineering, vol. 24, no. 5, pp. 65-71, 2022.

Dan Zhang, Safeen Huda, Ebrahim Songhori, Kartik Prabhu, Quoc Le, Anna Goldie, and Azalia Mirhoseini,
"A full-stack search technique for domain optimized deep learning accelerators"
Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '22), pp. 27-42, 2022.

Viotti, Juan Cruz, and Mital Kinderkhedia,
"A survey of JSON-compatible binary serialization specifications"
arXiv preprint arXiv:2201.02089, 2022.

Shipman, Galen M., Jered Dominguez-Trujillo, Kevin Sheridan, and Sriram Swaminarayan,
"Assessing the Memory Wall in Complex Codes"
In Proceedings of the 2022 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC '22), pp. 30-35, IEEE, 2022.

Alawneh, Ahmad, Mahmoud Khairy, and Timothy G. Rogers,
"A SIMT Analyzer for Multi-Threaded CPU Applications"
In Proceedings of the 2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '22), pp. 248-250, IEEE, 2022.

Levit-Gurevich, Konstantin, Alex Skaletsky, Michael Berezalsky, Yulia Kuznetcova, and Hila Yakov,
"Profiling Intel graphics architecture with long instruction traces"
In Proceedings of the 2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '22), pp. 1-11, IEEE, 2022.

Yuting Jiang, Yifan Xiong, Lei Qu, Cheng Luo Luo, Chen Tian, Peng Cheng, and Yongqiang Xiong,
"Moneo: Monitoring fine-grained metrics nonintrusively in AI infrastructure"
ACM SIGOPS Operating Systems Review, vol. 56, no. 1, pp. 18-25, 2022.

Pengcheng Li, Yixin Guo, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, and Xu Liu,
"Graph neural networks based memory inefficiency detection using selective sampling"
Proceedings of the SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC '22), pp. 1-14, 2022.

Efraim Rotem, Adi Yoaz, Lihu Rappoport, Stephen J. Robinson, Julius Yuli Mandelblat, Arik Gihon, Eliezer Weissmann, et al.,
"Intel alder lake CPU architectures"
IEEE Micro, vol. 42, no. 3, pp. 13-19, 2022.

Sascha Hunold, Jordy I. Ajanohoun, Ioannis Vardas, and Jesper Larsson Träff,
"An overhead analysis of MPI profiling and tracing tools"
Proceedings of the 2nd Workshop on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn Strategy (PERMAVOST '22), pp. 5-13, 2022.

Onur Cankur and Abhinav Bhatele,
"Comparative evaluation of call graph generation by profiling tools"
In: International Conference on High Performance Computing, pp. 213-232. Springer, Cham, 2022.

Oliver Rausch, Tal Ben-Nun, Nikoli Dryden, Andrei Ivanov, Shigang Li, and Torsten Hoefler,
"A data-centric optimization framework for machine learning"
Proceedings of the 36th ACM International Conference on Supercomputing (ICS '22), pp. 1-13, 2022.

Matthew Leinhauser, René Widera, Sergei Bastrakov, Alexander Debus, Michael Bussmann, and Sunita Chandrasekaran,
"Metrics and design of an instruction roofline model for AMD GPUs"
ACM Transactions on Parallel Computing, vol. 9, no. 1, pp. 1-14, 2022.
[Source Code]

Jonathan Vincent, Jing Gong, Martin Karp, Adam Peplinski, Niclas Jansson, Artur Podobas, Andreas Jocksch, et al.,
"Strong scaling of OpenACC enabled Nek5000 on several GPU based HPC systems"
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia '22), pp. 94-102, 2022.

2021

Omar Aaziz, Benjamin Allan, James Brandt, Jeanine Cook, Karen Devine, James Elliott, Ann Gentile, et al.,
"Integrated system and application continuous performance monitoring and analysis capability"
Technical Report SAND2021-11184, Sandia National Laboratories, Albuquerque, NM, 2021.

Lexiang Huang and Timothy Zhu,
"tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces"
Proceedings of the ACM Symposium on Cloud Computing (SoCC '21), pp. 76-91, 2021.

Yang Liu, Wissam M. Sid-Lakhdar, Osni Marques, Xinran Zhu, Chang Meng, James W. Demmel, and Xiaoye S. Li,
"GPTune: Multitask learning for autotuning exascale applications"
Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '21), pp. 234-246, 2021.
[Source Code]
[GPTune User Guide]

Yu, Lechen, Joachim Protze, Oscar Hernandez, and Vivek Sarkar,
"ARBALEST: dynamic detection of data mapping issues in heterogeneous OpenMP applications"
In Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS '21), pp. 464-474, IEEE, 2021.

Tobias Gysi, Christoph Müller, Oleksandr Zinenko, Stephan Herhut, Eddie Davis, Tobias Wicky, Oliver Fuhrer, Torsten Hoefler, and Tobias Grosser,
"Domain-specific multi-level IR rewriting for GPU: The Open Earth compiler for GPU-accelerated climate simulation"
ACM Transactions on Architecture and Code Optimization (TACO), vol. 18, no. 4, pp. 1-23, 2021.

Dominik Ernst, Georg Hager, Matthias Knorr, Gerhard Wellein, and Markus Holzer,
"Opening the black box: Performance estimation during code generation for GPUs"
Proceedings of the 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD '21), pp. 22-32, 2021.

Gabriel Marin, Alexey Alexandrov, and Tipp Moseley,
"Break dancing: low overhead, architecture neutral software branch tracing"
Proceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES '21), pp. 122-133, 2021.

E. Bethel, Colleen Heinemann, and Talita Perciano,
"Performance tradeoffs in shared-memory platform portable implementations of a stencil kernel"
Technical Report, 2021.

Xin Zhao, Jin Zhou, Hui Guan, Wei Wang, Xu Liu, and Tongping Liu,
"Numaperf: Predictive NUMA profiling"
Proceedings of the ACM International Conference on Supercomputing (ICS '21), pp. 52-62, 2021.

Robert Dietrich, Frank Winkler, Ronny Tschüter, and Matthias Weber,
"Enabling Performance Analysis of Kokkos Applications with Score-P"
In: Tools for High Performance Computing 2018/2019, pp. 169-182. Springer, Cham, 2021.

Jaemin Choi, Zane Fink, Sam White, Nitin Bhat, David F. Richards, and Laxmikant V. Kale,
"GPU-aware communication with UCX in parallel programming models: Charm++, MPI, and Python"
Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW '21), pp. 479-488, 2021.

Tuowen Zhao, Mary Hall, Hans Johansen, and Samuel Williams,
"Improving communication by optimizing on-node data movement with data layout"
Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '21), pp. 304-317, 2021.

Zhai, XM Shine, David Gutzwiller, Kunal Puri, and Charles Hirsch,
"GPU Acceleration of the FINE/FR CFD Solver in a Heterogeneous Environment with OpenACC Directives"
In Accelerator Programming Using Directives: 7th International Workshop, WACCPD 2020, Virtual Event, November 20, 2020, Proceedings, vol. 7, pp. 47-57, Springer International Publishing, 2021.

2020

Yuyang Jin, Haojie Wang, Teng Yu, Xiongchao Tang, Torsten Hoefler, Xu Liu, and Jidong Zhai,
"ScalAna: Automating scaling loss detection with graph analysis"
Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20), pp. 1-14, 2020.

Michael Knobloch and Bernd Mohr,
"Tools for GPU computing–debugging and performance analysis of heterogeneous HPC applications"
Supercomputing Frontiers and Innovations, vol. 7, no. 1, pp. 91-111, 2020.

Alexandru Calotoiu, Markus Geisenhofer, Florian Kummer, Marcus Ritter, Jens Weber, Torsten Hoefler, Martin Oberlack, and Felix Wolf,
"Empirical Modeling of Spatially Diverging Performance"
Proceedings of the 2020 IEEE/ACM International Workshop on HPC User Support Tools (HUST) and Workshop on Programming and Performance Visualization Tools (ProTools), pp. 71-80, 2020.

2019

Tal Ben-Nun and Torsten Hoefler,
"Demystifying parallel and distributed deep learning: An in-depth concurrency analysis"
ACM Computing Surveys (CSUR), Volume 52, Issue 4, Article 65, pp. 1-43, August 2019.

Hao Xu, Qingsen Wang, Shuang Song, Lizy Kurian John, and Xu Liu,
"Can we trust profiling results? Understanding and fixing the inaccuracy in modern profilers"
Proceedings of the ACM International Conference on Supercomputing (ICS '19), pp. 284-295, 2019.

2015

Curtsinger, Charlie, and Emery D. Berger,
"Coz: Finding code that counts with causal profiling"
Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15), pp. 184-197, 2015.

2014

Michael Bauer, Sean Treichler, and Alex Aiken,
"Singe: Leveraging warp specialization for high performance on GPUs"
Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14), pp. 119-130, 2014.

2010

Nathan R. Tallent, Laksono Adhianto, and John M. Mellor-Crummey,
"Scalable identification of load imbalance in parallel executions using call path profiles"
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10), pp. 1-11, 2010.

Other Resources

Extrae Documentation,
"Extrae Documentation, Release 4.2.2"
BSC Performance Tools, August 30, 2024.

Aurelio A. Vivas Meza, Solomon Bekele, Thomas Applencourt, Brice Videau,
"Tracing Heterogeneous Programming Models with LTTng and Babeltrace"
Argonne National Laboratory, 17th Sept 2023.

Daniel Walsh,
"Podman in Action: Secure, rootless containers for Kubernetes, microservices, and more"
Simon and Schuster, 2023.

Hong Jiang,
"Intel’s Ponte Vecchio GPU Architecture, Systems & Software"
Intel Corporation, August 2022.

CGO's Binary Profiling, Tracing, Sampling session,
"Binary Profiling, Tracing, Sampling session"
In Proceedings of the Conference on Code Generation and Optimization (CGO '22), ACM, 2022.

"Efficient IO with io_uring"