%0 journal article %@ 1095-7197 %A Munch, P.,Dravins, I.,Kronbichler, M.,Neytcheva, M. %D 2023 %J SIAM Journal on Scientific Computing %N %P S71-S96 %R doi:10.1137/22M1503270 %T Stage-parallel fully implicit Runge-Kutta implementations with optimal multilevel preconditioners at the scaling limit %U https://doi.org/10.1137/22M1503270 %X We present an implementation of a stage-parallel preconditioner for Radau IIA type fully implicit Runge–Kutta methods, which approximates the inverse of the Runge–Kutta matrix AQ from the Butcher tableau by the lower triangular matrix resulting from an LU decomposition and diagonalizes the system with as many blocks as stages. For the transformed system, we employ a block preconditioner where each block is distributed and solved by a subgroup of processes in parallel. For combination of partial results, we use either a communication pattern resembling Cannon’s algorithm or shared memory. A performance model and a large set of performance studies (including strong-scaling runs with up to 150k processes on 3k compute nodes) conducted for a time-dependent heat problem, using matrix-free finite element methods, indicate that the stage-parallel implementation can reach higher throughputs near the scaling limit. The achievable speedup increases linearly with the number of stages and is bounded by the number of stages. Furthermore, we show that the presented stage-parallel concepts are also applicable to the case that AQ is directly diagonalized, which requires either complex arithmetic or solutions of two-by-two blocks, both exposing about half the parallelism. Alternatively to distributing stages and assigning them to distinct processes, we discuss the possibility of batching operations from different stages together.