JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at jlesc-workshop-2026@fz-juelich.de.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Please note that all times are shown in the time zone of the conference. The current conference time is: 20th June 2026, 11:24:30am CEST

Daily Overview

Session

Import to your local calendar

Poster Blitz

Time:

Tuesday, 19/May/2026:

5:15pm - 6:00pm

Session Chair: Robert Speck

Location: Central Library (Lecture hall)

Build. 4.7

Support: Leon Thelen

Presentations

AI-Assisted Hardware Design with Multi-Platform RTL Verification and Physical Feasibility Analysise

Connor Bohannon, Kazutomo Yoshii

Argonne National Laboratory, United States of America

We present an automated RTL tetsing workflow for PCAComp (Principle Component Analysis-based compression hardware), a lossy compressor used in scientific data compression for near-sensor and on-chip processing. The proposed framework supports CPU-side testing via Verilator, which is currently implemented, as well as GPU-side testing using NVIDIA GEM for GPU-accelerated RTL simulation, which is under active development. A unified automation pipeline enables reproducible testing across platforms, while a JSON-driven configuration system allows flexible evaluation across different hardware configurations. The testing infrastructure bridges Scala-based Chisel hardware design with Python-based simulation, enabling seamless integration between hardware generation and verification. This work addresses verification bottlenecks in developing GPU-acclerated hardware for HPC applications, providing a scalable approach to multi-platfom RTL veriifcation for scientific computing workloads.

Next Steps in Large-Scale Density Functional Theory

Paul F Baumeister

Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany

The continued miniaturization of semiconductor devices has pushed characteristic length scales into the nanometer regime, where quantum mechanical effects dominate electronic, chemical, magnetic, and mechanical behavior. Density Functional Theory (DFT) provides a reliable and predictive framework for such systems, but conventional eigenvalue-based formulations scale cubically with system size and become computationally prohibitive for realistic nanoelectronic devices containing millions of atoms. In addition, the global communication patterns inherent to eigensolvers limit scalability on modern massively parallel high-performance computing (HPC) architectures.

Linear-scaling electronic structure methods offer a promising alternative. Density-matrix-based approaches achieve linear scaling by exploiting the nearsightedness of electronic matter, but they rely on a finite band gap and are therefore largely restricted to insulating systems. Metallic systems and semiconductor devices with conducting leads remain challenging. Green’s function-based DFT overcomes this limitation by enabling linear scaling through spatial truncation of long-range interactions while remaining applicable to metallic systems.

In this work, we present AngstromCube, a real-space Green’s Function Density Functional Theory (RSGF-DFT) application designed explicitly for large-scale GPU-accelerated HPC platforms. Instead of solving the Kohn–Sham eigenvalue problem, AngstromCube computes the time-independent Green’s function of the effective single-particle Hamiltonian. The electron density is obtained from the imaginary part of the diagonal Green’s function via the Kramers–Kronig relation, eliminating explicit band summations. The method requires contour integration in complex energy space, with sampling strategies informed by established techniques from the Korringa–Kohn–Rostoker multiple scattering community.

To achieve plane-wave-level accuracy, all operators are represented on a uniform Cartesian real-space grid using high-order finite-difference stencils, with derivatives up to 16th order supported. Linear scaling is achieved by truncating the Green’s function beyond a finite spatial radius, introducing a second key convergence parameter in addition to the grid spacing. The resulting computational cost exhibits strong sensitivity to both parameters, motivating ongoing work on optimizing the prefactor and identifying crossover points where the linear-scaling approach becomes more efficient than conventional cubic-scaling DFT.

AngstromCube employs the Projector Augmented Wave (PAW) method for electron–ion interactions. A key performance feature is the sparse treatment of non-local projector functions, which are expanded in a factorizable basis of Hermite–Gauss polynomials and evaluated on the fly on each GPU. This minimizes memory bandwidth requirements and avoids storage of large non-local operator matrices.

The core computation is a GPU-accelerated implicit Hamiltonian operator applied iteratively using the transpose-free Quasi-Minimal Residual (tfQMR) method to compute for Green’s function elements. The code is implemented in CUDA-enabled C++ with templated kernels supporting mixed precision, complex arithmetic, and multiple spin formulations. MPI-level parallelization is achieved by distributing independent columns of the truncated Green’s function across MPI tasks with customised load balancing. Mixed-precision strategies are employed to reduce time to solution while maintaining numerical stability.

AngstromCube demonstrates that real-space Green’s function DFT can deliver high physical fidelity together with true linear scaling on modern GPU-accelerated HPC systems, enabling first-principles simulations of nanoelectronic structures and a wide range of other material classes at previously inaccessible scales.

The Kronos Project: Hybrid Discrete Event Simulations of Computing Infrastructure

Kevin A. Brown

Argonne National Laboratory (ANL), United States of America

Parallel discrete event simulation (PDES) is a modeling methodology that is of key importance studying critical behaviors across many fields including science enterprise design and provisioning, internet and cybersecurity simulations, and simulations for hardware co-design. Despite significant advances in both extreme-scale computing systems and PDES modeling frameworks to take advantage of these platforms, the simulation requirements and computational complexity of PDES hardware co-design models are growing at an intractable rate. Consequently, the timescales over which these hardware co-design models operate is limited to only a few seconds of simulated wall-clock time, making long-timescale PDES simulations of future, disruptive extreme-scale infrastructures out of reach for current PDES frameworks.

We design a hybrid modeling and simulation framework, Kronos, that integrates PDES with surrogate models and demonstrate its initial effectiveness in improving hardware co-design simulations performance. Our approach involves automatically changing the modeling methodology during the execution of the simulation, switching to PDES for periods characterized by complex behaviors and switching to other modeling methodologies (such as analytical and machine learning) for periods that can be easily predicted by fast surrogates. Our efforts in this project focus on four major research objectives: (i) building a scalable workload module for hybrid simulations; (ii) creating machine-learning-driven and analytic surrogate models for PDES network and workload models; (iii) enabling online transitions between PDES and surrogate models; and (iv) automatically transitioning between models.

Along with providing an overview of this project, this poster will focus on the design of a *director* for orchestrating the activities of both the PDES and surrogate models in a hybrid simulation. The director performs three key functions. First, it transparently chooses the type of model (i.e., PDES or surrogate) used for predicting each event. Second, it provides a bidirectional communication channel between the PDES model and the surrogate model for exchanging training data and predictions between both models. And third, it can support phase detection mechanisms to determine when the simulation should be in PDES mode or surrogate mode. We demonstrate the utility of our director design by integrating it within the Kronos hybrid modeling framework and using it to drive hybrid simulations of runtime of large HPC workloads on a dragonfly network. This presentation will cover challenges encountered and strategies used in deploying our director design. It will also discuss how our approach can be applied to other simulation tools and application areas such as the design of energy efficient microarchitecture and resilient integrated research infrastructure.

Design of Identity and Access Control for the Quantum–HPC Hybrid Platform

Tomoya Yuki¹, Shin'ichi Miura¹, Takashi Uchida¹, Yuki Nakano¹, Miwako Tsuji^1,2, Yuetsu Kodama¹, Tamiya Onodera¹, Mitsuhisa Sato^1,3

¹RIKEN R-CCS; ²University of Tsukuba; ³Juntendo University

This work describes a unified identity and access design for the Quantum–HPC hybrid platform that integrates quantum computers and supercomputers. The platform adopts OAuth2.0-based access tokens to enable workflows to securely access multiple computing systems. Job submission to HPC resources is performed via Slurm REST interfaces under token-based authorization. The user management component enforces identity verification procedures, partially automated through digital credential mechanisms. Separately, user information is subject to screening processes aligned with export control and security compliance requirements. The design supports secure and practical hybrid computational environments.

AI-based Sub-Grid Scale Closure for Large Eddy Simulations in the Human Larynx

Prakhar Rathore, Xin Liu, Rakesh Sarma, Luis Cifuentes, Andreas Lintermann

Forschungszentrum Jülich Gmbh, Germany

As we move toward the Exascale era, the computational cost of direct numerical simulations (DNSs) remains a primary bottleneck for high-fidelity fluid flow analyses. While large eddy simulations (LESs) offer a more affordable alternative, they inherently lack the sub-grid-scale (SGS) details necessary for precisely describing turbulence. To bridge this gap, this work presents an AI-driven framework leveraging Super-Resolution (SR) networks to reconstruct lost turbulent information from coarse-grained inputs. This framework exploits a computational fluid dynamics (CFD) dataset of a human larynx [1, 2], where complex, transitional fluid patterns are generated during exhalation. Understanding these dynamics is critical for medical research, as conditions such as asthma, pneumonia, and COVID-19 significantly alter exhalation by airway constriction, inflammation, mucus accumulation, or altered respiration patterns, thereby changing local geometry, flow rates, and turbulence characteristics. Moreover, advanced simulations, e.g., LES or DNS, are increasingly vital due to the fine-grained details of the fluid mechanics, which are necessary to accurately understand patient-specific flow physics and develop corresponding treatments. By adapting SR techniques originally developed in computer vision, using a Convolutional Defiltering Model (CDM) architecture [4, 5], the growing availability of high-fidelity data to learn representations of unresolved physics can be directly exploited. The CDM’s ability to capture both local and global features via skip connections makes it ideal for the SGS reconstruction of turbulent flows, even when training data is limited or noisy.

The implementation utilizes the open-source AI4HPC library [3], which is specifically optimized for deep learning on large-scale datasets, ensuring scalability in high-performance computing (HPC) environments. A two-step training strategy is employed in the CDM: (i) DNS velocity fields are processed using Gaussian filtering to generate LES-level downsampling and a corresponding low-resolution training dataset; (ii) a U-Net model is trained on this filtered data to reconstruct highly resolved velocity fields and SGS quantities. This work demonstrates the efficacy of the AI4HPC ecosystem in managing complex 3D CFD datasets and provides a scalable, AI-based alternative to traditional turbulence-closure problems. By successfully reconstructing fine flow structures, this framework not only advances turbulence modeling but also supports efficient inference in HPC applications, ultimately enabling more detailed medical diagnostics at a fraction of the cost of traditional DNS.

References

[1] S. Voss, C. Arens, G. Janiga. Flow, Turbulence and Combustion, 102, no. 1, (2019).

[2] Abdelsamie A., Voß S., Berg P., et al. Computers Fluids, 255, 105819, (2023).

[3] https://jlesc.github.io/projects/dnn_cfd/

[4] Fukami K., Fukagata K., and Taira K. arXiv preprint arXiv:2301.10937, (2023).

[5] Sarma R., Inanc E., et al. Frontiers in High Performance Computing, 2, 1444337, (2024).

RIKEN-braket: massively parallel simulatior of quantum computers

Naoki Yoshioka¹, Nobuyasu Ito¹, Doru Thom Popovici², Anastasiia Butko²

¹RIKEN Center for Computational Science; ²Lawrence Berkeley National Laboratory

A state-vector simulator of quantum computers has been developed for use on supercomputers. Our simulator, RIKEN-braket, removes the conventional limitations on the number of MPI processes and the size of the state-vector data array imposed by commonly used parallelization methods. We demonstrate that our simulator scales efficiently up to 46 qubits on the supercomputer Fugaku, using up to 55,296 computing nodes. We also describe recent enhancements to the simulator, including gate fusion and support for multiple circuits to enable simulations of variational algorithms.

Following aging in HPC systems

Alix Tremodeux, Guillaume Pallez, Erven Rohou

Univ Rennes, Inria, CNRS, IRISA, France

Today, supercomputers are renewed every 5 to 7 years. This pace no longer seems
sustainable economically, ecologically, or socially. We are already seeing a
slowdown in this replacement frequency.
The goal of this project is to understand how these components are aging, with
the intend to model the aging process in the scope of a supercomputer. The hope
is to then be able to detect and react to the aging process.
In this poster we discuss our methodology to study the aging process. The study focuses on
individual nodes (CPUs and GPUs) including the memory contained within the node.
The study rely on two experiments done a few time a year during the life of the HPC system:
- A quantitative analysis that studies specific performance through
microbenchmarks (like stress-ng).
- A qualitative analysis with a more general benchmark that simulates the
behavior of an application (like HPL) on selected nodes (chosen by their
specificities like their location in the physical system).

Adaptive-precision interatomic potentials (APIP)

David Immel¹, Ralf Drautz², Godehard Sutmann^1,2

¹Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich, Germany; ²Interdisciplinary Centre for Advanced Materials Simulations (ICAMS), Ruhr Universität Bochum, German

Computer simulations of atomistic systems are a cornerstone of research in Physics, Chemistry, Biology and Materials Science and account for a large share of applications on the worlds supercomputers and high-performance computing clusters. Large-scale atomistic simulations rely on interatomic potentials providing an efficient representation of atomic energies and forces. Although modern machine learning (ML) potentials provide a density-functional theory-like accuracy, they are rarely used in large-scale simulations as they remain orders of magnitude slower than simple empirical potentials. Often only a small number of atoms strongly influences the correct outcome of a simulation, while the majority of atoms forms a background for proper field propagation or mechanical properties on a coarser level. Examples include bond-breaking at a crack tip for simulations of crack propagation, defect nucleation sites in simulations of plastic deformation, or reaction centres in biology. An adaptive-precision interatomic potential (APIP) overcomes the performance gap between the precise ML and the fast empirical potentials for such simulations: An APIP combines two potentials of different accuracy and computational costs to a multi-resolution description with an optimum of performance and precision in large complex atomistic systems. The required precision is determined per atom by a local structure analysis and updated automatically during simulation. Our APIP package of the molecular-dynamics simulator LAMMPS makes APIPs available to the community. A load-balancer prevents problems due to the atom dependent force-calculation times, which makes it suitable for large-scale atomistic simulations. Copper and tungsten have been used as demonstrator materials for nanoindentation simulations with embedded atom models as fast potentials and atomic cluster expansions as precise ML potentials, but in principle a broader class of potential combinations could be implemented. All ML-specific observations were reproduced by the APIPs, however, with a significant speedup of 20-30 times compared to the pure ML-potential nanoindentation simulations. The APIP-package supports conservative potentials, i.e., the system can be described by a momentum-conserving Hamiltonian. Alternatively, one can use performance-optimised APIPs that aim for the highest speedup possible and conserve energy and momentum due to local corrections. APIPs can achieve—dependent on both the combined potentials and the atomistic system—a speedup of one or two orders of magnitude compared to a pure precise simulation. Thus, the coupling of fast and precise models is very promising and allows for simulations of either much larger system sizes or diminished resource requirements at constant simulation size.

Enhancing Scientific Image Prediction and Compression through AI Model Fine-Tuning

Joanna Ayulia¹, Ke Cui², Amarjit Singh², Kento Sato²

¹Kanazawa University, Japan; ²RIKEN Center for Computational Science, Japan

Introduction and Objectives

Modern scientific facilities, such as the SPring-8 synchrotron radiation facility, generate an overwhelming volume of data, particularly through techniques like X-ray Computed Tomography (XCT). This rapid data growth presents significant hurdles for efficient storage and high-speed transmission, necessitating the development of advanced compression techniques. While AI-based methods like TEZip have been introduced, they are often limited to time-evolutionary data and suffer from reduced accuracy because they are trained on non-scientific, general-purpose datasets.

The primary objective of this study is to design an AI-based compression framework capable of handling both time-evolutionary and non-time-evolutionary scientific images. By fine-tuning models specifically on scientific data, we aim to enhance prediction accuracy and reconstruction quality, thereby improving the overall compression ratio.

Proposed Methodology and Pipeline

We propose a multi-stage AI-based scientific image compression pipeline:

AI Prediction Phase: The original high-resolution image is first downsampled. This downsampled image is then passed into an AI super-resolution model to produce a reconstructed image.
Delta Image Generation: The system computes the pixel-wise difference between the reconstructed and original images to produce a delta image.
Data Compression Phase: Both the downsampled original image and the delta image are losslessly compressed using the FFV1 video codec. This approach ensures that the original scientific data can be perfectly recovered while leveraging the AI model's predictive power.

Experimental Setup and Fine-Tuning

We use pre-trained SwinIR, a transformer-based image restoration model. And fine-tuned the model using the XCT-2K dataset from SPring-8, which contains 904 16-bit grayscale images.

Training Details: 700 image pairs were used for training, with the remaining images reserved for testing.
Technical Parameters: The model was trained for 300 epochs on an NVIDIA A100 GPU using a batch size of 4, the Charbonnier loss function, and an Adam optimizer with a cosine-annealing learning rate.
Metrics: Accuracy was measured via PSNR, while compressibility was evaluated using Shannon entropy and compression ratio.

Results and Performance Analysis

The experimental results demonstrate that fine-tuning on scientific data significantly outperforms pre-trained baseline models.

Reconstruction Accuracy: The PSNR increased from a baseline of 69.71 dB to 71.52 dB after 300 epochs, indicating that the AI model became better at predicting scientific features.
Compressibility: The entropy of the delta images dropped from 6.4142 to 6.1457, indicating that the delta images became more compressible.
Compression Ratios: The delta image compression ratio improved from 2.357x to 2.415x, pushing the overall compression ratio from 2.227x to 2.279x.

Conclusion

The study confirms that fine-tuning the SwinIR model with specific scientific image data effectively enhances both reconstruction accuracy and the final compression ratio. Future research will explore more efficient encoding methods for the delta image to further push the boundaries of scientific data compression.

JUNIQ Benchmark Suite: Tracking Progress in Quantum Technology Readiness

Ashwin Kumar Karnad, Dennis Willsch, Carlos Daniel Gonzalez Calaza, Jhon Alejandro Montanez-Barrera, Orkun Şensebat, Madita Willsch, Kristel Michielsen

Jülich Supercomputing Centre (JSC), Germany

As quantum computing hardware rapidly evolves, traditional metrics like gate fidelity and Quantum Volume are insufficient for predicting real-world utility. Researchers and industry users need objective, application-centric benchmarks to gauge when quantum processors will be ready for practical tasks. However, the lack of standardized infrastructure often leads to non-reproducible performance claims and makes it difficult to track progress across different hardware generations and technologies.
We present the JUNIQ Benchmark Suite (https://go.fzj.de/juniq-benchmark-suite-page),
a large collection of benchmark problems and a vendor-agnostic, open-source initiative provided by an independent research lab, designed to track quantum technology readiness over decades. Unlike suites focused on synthetic benchmarks, the JUNIQ Benchmark Suite emphasizes application domains. Examples include computational biology and remote sensing (Quantum SVM), logistics (Tail Assignment, TSP), and hardware diagnostics (Effective Qubit Temperature) etc. By benchmarking the same relevant problems on systems ranging from the earliest publicly available QPU devices (e.g. IBM Q Experience in 2016 and D-Wave 2000Q systems in 2017) to the latest state-of-the-art devices such as Advantage2 (2025), we provide a clear, overview of hardware maturation.
A key innovation of our work is the standardized benchmarking workflow, a CLI-based toolchain along with CI pipelines, developed to ensure strict reproducibility. This setup facilitates the entire benchmark lifecycle (setup, generation, verification and execution) while aiding dependency management and enabling CI testing. This ensures that previously generated problems remain reproducible in the future, solving the "dependency drift" problem common in the fast-moving quantum software ecosystem.
The JUNIQ Benchmark Suite hosts benchmark instances from various domains and is actively tracking performance on the latest analog and digital quantum processors available through the JUNIQ cloud platform and beyond. To foster a transparent ecosystem, we include templates and tooling that lower the barrier for community to contribute new problem instances.
Keywords: Quantum computing, Benchmarking, QPU, Quantum computing applications,

ScaFaCoS 2.0: A Performance-Portable Coulomb Solver Library for Exascale Simulations

Rodrigo A . C. Bartolomeu, Rene Halver, Godehard Sutmann

Forschungszentrum Jülich, Germany

The accurate and scalable evaluation of long-range Coulomb interactions remains a central challenge in molecular dynamics, soft-matter physics and materials science, especially as simulations target exascale architectures that are characterised by complex memory hierarchies and heterogeneous computing resources. We present ScaFaCoS 2.0, an exascale-ready library that provides performance-portable implementations of the Particle–Particle Particle–Mesh (P3M) method and the Ewald summation method. These are designed to operate efficiently across diverse hardware platforms.

A primary goal of ScaFaCoS 2.0 is to achieve performance portability to a wider range of architectures without duplicating code. Implementation of electrostatic solvers and methods is based on the Kokkos programming model, enabling the library to target CPUs and GPUs from multiple vendors while maintaining high efficiency from a single source base. By delegating execution and memory management to Kokkos abstractions, ScaFaCoS 2.0 aligns with emerging exascale software ecosystems and ensures that simulation workflows are future-proofed against rapidly evolving hardware.

ScaFaCoS 2.0 also leverages interoperable, exascale-capable libraries to enhance its functionality. Thereby, Cabana manages particle data layouts and enables vectorisation and communication-aware particle operations, whereas HeFFTe provides the scalable, architecture-aware FFT capabilities required by mesh-based solvers.

In addition to portability, the library is designed around a modular execution model that enables short-range and long-range electrostatic calculations to be separated into distinct computational partitions. On HPC systems that support concurrent or heterogeneous partitioning, ScaFaCoS 2.0 enables these components to run independently. This makes it possible to assign and tune computational resources, such as CPU cores, GPUs and memory, according to the performance characteristics of each method. This separation reflects the differing algorithmic demands of real-space and reciprocal-space solvers, enabling more efficient utilisation of machine architectures, improved load balancing and flexible deployment strategies tailored to a given system or simulation workflow.

We present initial benchmarks demonstrating ScaFaCoS 2.0's performance across multiple architectures while preserving numerical accuracy and scalability. The results highlight the viability of a single, portable implementation for long-range electrostatics at scale, and demonstrate how a modular design and ecosystem integration can reduce development complexity while sustaining numerical efficiency and high performance.

Composability from numerical algorithms to programming model

Julien Gaupp, Emmanuel Agullo, Christian Perez

Inria, France

With the continuous advancement of knowledge, new methods and algorithms are regularly proposed, particularly in linear algebra. Implementing them efficiently requires not only strong expertise in linear algebra but also advanced skills in high-performance computing (HPC). To simplify this process for non-HPC experts and to avoid duplication of effort, a higher-level programming model specialized for linear algebra algorithms would be highly valuable.
In this poster, we present our current approach to this challenge, centered on the Sequential Task Flow (STF) programming model. This model offers several advantages: it is relatively easy to use, as the user only needs to declare task dependencies, while still enabling high performance. However, it can be somewhat verbose in practice. On the other hand, C++ already provides more compact abstractions, such as std::future, although these are typically less performant.
Our goal is to combine the strengths of both approaches to deliver a compact, easy-to-use, and high-performance solution that minimizes required code modifications. This objective is achieved through our library, STF++.
We illustrate our approach with numerical linear algebra algorithms expressed on top of STF++, using the HPC StarPU runtime system underneath.

Modeling and reuse of data partitioning code in HPC

Alix Peigue, Christian Perez

Inria, France

Partitioning data is essential to exploit the power of massively parallel machines. However, data partitioning management is still often left to application developers, who thus need to combine domain science expertise as well as HPC expertise. Some programming and execution models manage data partitioning in their models to abstract this aspect. However, this leads to duplication of the logic describing how data is partitioned between different models or libraries. This duplication especially causes problems when dealing with complex data types, such as unstructured meshes. This poster presents a model-based approach that aims at minimizing the workload of porting a data partitioning management code across various models. It details a first work that is a feasibility study of providing a unique implementation of an unstructured mesh partitioning management code across COMET as a programming model and StarPU as an execution model.

Development of a state vector quantum computer simulator for GPUs

Naoto Aoki

Riken Center for Computational Science, Japan

1 Quantum Computer Simulation

The state-vector method stores the full quantum state in memory and applies quantum gates as matrix–vector multiplications. In multi-GPU systems, the state vector is partitioned and distributed across devices. When gates act on high-order qubits, the corresponding operations span multiple partitions and require inter-GPU communication whose volume is comparable to the 2^n-sized state vector.

Qubit Reordering (QR) [1] alleviates this issue by dynamically modifying the mapping between logical qubits and indices so that fewer gates lead to communication.

2 Coordination in QR and GPU Peer-to-Peer Memory Access

QR requires synchronized reordering across GPUs to avoid data conflicts, which introduces non-negligible overhead, especially on systems with high communication bandwidth.

In contrast, GPU Peer-to-Peer (P2P) Memory Access, such as NVLink, enables a GPU to access another GPU’s memory directly and asynchronously without CPU involvement. This capability makes it possible to avoid QR entirely and rely instead on high-bandwidth P2P access. Based on this observation, this work proposes an alternative simulation method that eliminates QR and its associated coordination cost.

3 Exploiting Peer-to-Peer Memory Access

Gate operations in state-vector simulation consist of numerous matrix–vector multiplications over partial state vectors. The assignment of these partial vectors to GPUs largely determines P2P communication cost and overall performance.

The proposed method partitions the global state vector into contiguous segments across GPUs and defines the indices of partial state vectors for k-qubit gates using bit patterns. From these patterns, each GPU can identify the required local data and remote data to be fetched via P2P. With an appropriate index-mapping design, the method achieves a simple communication structure while entirely avoiding QR-based dynamic qubit remapping.

4 Evaluation of the Proposed Method

The method was implemented in CUDA on a machine equipped with eight NVIDIA A100 GPUs connected by NVLink (1.2 TB/s bidirectional). As a benchmark, a circuit applying Hadamard gates to all qubits was used. The proposed P2P-based implementation achieved up to an 8% speedup compared with a conventional QR-based implementation, demonstrating the practical advantage of bypassing QR.

5 Conclusions

A QR-free state-vector simulation method for NVLink-connected GPUs with Peer-to-Peer Memory Access was presented. Evaluation on an eight-A100 system showed approximately 8% improvement over a QR-based method, indicating that removing QR can be beneficial for quantum simulation on high-bandwidth GPU platforms.

ACKNOWLEDGMENTS

The author expresses sincere gratitude to Naoki Yoshioka and Nobuyasu Ito for their valuable guidance on fundamental concepts. This presentation is based on results obtained from a project, JPNP20017, commissioned by the New Energy and Industrial Technology Development Organization (NEDO).

REFERENCES

[1] K. De Raedt, K. Michielsen, H. De Raedt, B. Trieu, G. Arnold, M. Richter, Th. Lippert, H. Watanabe, and N. Ito. 2007. Massively parallel quantum computer simulator. Comput. Phys. Commun. 176, 2 (Jan, 2007), 121–136. DOI: https://doi.org/10.1016/j.cpc.2006.08.007

Improving cardinality constraints for the annealer using bineary search and adders

Daniel Warkentin

Forschungszentrum Jülich, Germany

Annealers are analog quantum computers that can solve only one type of problem: the Ising problem. However, there exists a one‑to‑one mapping between the Ising formulation and the Quadratic Unconstrained Binary Optimization (QUBO) problem. Since many optimization problems of practical interest can be expressed as QUBOs, annealers become promising candidates for solving them.
A central challenge in mapping a problem to QUBO form is the encoding of constraints, because constraints cannot be directly incorporated into the QUBO or into the annealer. A careful choice of encoding is therefore crucial for the performance of the annealer.
One of the most common constraints is the cardinality constraint, which enforces that exactly k out of n binary variables take the value 1, with the remaining variables set to 0. The standard encoding introduces a penalty term by squaring the difference between the sum of the binary variables and k, multiplied by a penalty factor, and adding this term to the objective function. Expanding this square yields coefficients for single variables and for pairs of variables, which fits naturally into the QUBO structure. However, this encoding induces all‑to‑all connectivity: every problem variable becomes coupled to every other variable. The physical connectivity of current annealer hardware is far more limited, making this standard encoding poorly suited for such architectures.
Vyskocil and Djidjev propose an alternative encoding that uses only nearest neighbour connections and is optimized for the specific hardware topology of the annealer. In the special case k=1, information propagates along a line. As the number of variables grows, the performance of this constraint degrades because information must travel increasingly long distances.
This is where our binary‑search‑based idea comes into play: it yields an encoding in which information can propagate in logarithmically many steps between any two points. In the general "k"-case, their method requires O(kn) variables to represent the constraint. Our adder‑based construction significantly reduces this overhead for large k, and in combination with the binary‑search structure, information still flows efficiently. The total number of auxiliary variables required—those used for the binary search plus those used for the adders—is between n for k=1 and 5*n for k=n/2. For all cardinality constraints with k>n/2, the constraint can be reformulated by summing over the negated variables.

Lossy and Lossless Data Compression of Meteorological Data for Lagrangian Transport Modeling

Lars Hoffmann, Sabine Griessbach, Olaf Stein

Jülich Supercomputing Centre, Forschungszentrum Jülich, Germany

High-resolution meteorological data are essential for accurate Lagrangian transport simulations but pose significant challenges in terms of storage, data transfer, and computational efficiency. In this study, we investigated the potential of lossless and lossy data compression techniques to reduce the data footprint of meteorological fields used in the MPTRAC Lagrangian transport model, using meteorological input from the ECMWF ERA5 reanalysis. We implemented and evaluated several state-of-the-art compression methods, including Zstandard (with and without quantization), ZFP, and SZ, applied to key meteorological variables such as wind components, temperature, humidity, and cloud properties. For each method, we assessed compression ratio, compression and decompression speed, and the errors introduced in the reconstructed data. As a practical application, we examined the impact of lossy compression on trajectory calculations performed with MPTRAC. Atmospheric transport simulations are particularly sensitive to accumulated errors in horizontal wind and vertical velocity fields; therefore, we analyzed how compression-induced perturbations affected transport pathways, trajectory deviations, and tracer conservation along simulated trajectories. Our results demonstrated the trade-offs between data reduction, computational performance, and physical accuracy and identified compression strategies that achieved substantial data volume reductions while preserving the fidelity of transport simulations, highlighting the potential of controlled lossy compression as an efficient tool for large-scale atmospheric modeling and data-intensive applications.

An ANARI-Based Pipeline from ParaView to Unreal Engine, Enabled by JuSync for In-Situ HPC Visualization

Thomas George¹, Jens Henrik Göbbert¹, Jonathan Windgassen¹, Thomas Odaker², Elisabeth Mayer², Victor Mateevitsi³

¹Forschungszentrum Jülich GmbH, Germany; ²Leibniz-Rechenzentrum, Germany; ³Argonne National Laboratory, USA

We present a visualization and data transmission pipeline that directly connects ParaView's scientific data‑processing capabilities with Unreal Engine's photorealistic, interactive rendering environment, enabling real‑time exploration of large‑scale simulation results without intermediate disk I/O. The pipeline is enabled by JuSync, a middleware component that facilitates the workflow. In our implementation, a point‑based simulation is generated on‑the‑fly using Catalyst's in‑situ processing across multiple compute nodes on the JURECA HPC system. The simulation data is ingested by ParaView on the JURECA nodes, processed using filters, and prepared as mesh data (glyphs). ParaView with ANARI‑SDK streams the processed data as USD‑ASCII (USDA) scene graphs generated by ANARI‑USD by individual ranks acting as workers, which connect to a broker process on the same JURECA HPC system. The broker forwards these USDA streams to Unreal Engine via the ANARI‑USD device integrated into JuSync. ZeroMQ (ZMQ) handles the broker‑worker communication on the HPC as well as the communication to the Unreal Engine instance.

Within Unreal Engine, JuSync parses the USDA streams, extracts mesh geometry and attributes, and converts them into Unreal‑compatible real-time mesh formats using multiple Realtime Mesh Component (RMC) clients to accommodate higher vertex counts. This enables efficient rendering of large datasets. The converted meshes are dynamically updated in the DaVinCo application—a VR‑enabled Unreal Engine client—allowing interactive exploration on desktop workstations, large‑scale displays, or VR headsets.

By leveraging USD as a common interchange format and the ANARI abstraction layer with ZMQ, the pipeline eliminates the traditional workflow of writing large data files to disk, transferring them, and loading them into a rendering engine. It decouples scientific data processing from rendering while preserving metadata and hierarchical structure. The middleware receives data from HPC via ZeroMQ, with the broker‑worker logic implemented in the ANARI‑USD component on the HPC side, ensuring robust operation on systems such as JURECA, while not being limited to this system.

We demonstrate the pipeline with a point‑based particle dynamics simulation generated in situ via Catalyst and visualized in real time on a VR setup. Initial tests show the pipeline can handle datasets with tens of millions of points, with glyphs (tested with arrow and sphere meshes) applied to them. The approach is simulation-independent: any simulation that can render USDA files within the ParaView ANARI‑SDK interface using ANARI‑USD will work with Unreal Engine, regardless of the underlying simulation code.

Overall, the ANARI‑based pipeline from ParaView to Unreal Engine, enabled by JuSync, provides a practical foundation for real‑time scientific visualization in immersive environments, bridging the gap between high‑performance computing and modern game‑engine rendering. By eliminating disk I/O and leveraging USD‑based streaming, the pipeline delivers high‑fidelity visual feedback, enabling scientists to interactively inspect evolving simulations across a variety of platforms.

Centralised Dashboard for Continuous Benchmarking: From HPC Clusters to Quantum Processors

Filipe Guimaraes, Thomas Breuer, Wolfgang Frings, Ashwin Kumar Karnad, Carlos Daniel Gonzalez Calaza

Forschungszentrum Juelich GmbH, Germany

Do you know if your system performance has dropped since the last update? For many administrators and developers, this is a surprisingly hard question to answer. Continuous benchmarks might run in the background, but if the results are buried in text logs, CI/CD artifacts, or scattered repositories, critical issues go unnoticed until it is too late.
We present a solution to this visibility gap: a centralised, automated dashboard built within LLview, an open-source reporting platform [1].
Our framework takes a novel approach by separating the execution of benchmarks from their visualisation. This allows you to run tests wherever you prefer—on HPC clusters, cloud runners, or even quantum hardware—while the dashboard automatically "pulls" the results into a unified view. By using simple but generic configuration files, you define exactly what to measure and how to plot it, without writing a single line of frontend code.
In this poster, we demonstrate how this flexible tool is used to track standard performance metrics on supercomputers and, notably, to monitor the stability of Quantum Processor Units (QPU). We show how the same framework allows operators to track qubit gate fidelity and error rates over time just as easily as memory bandwidth. Join us to see how you can turn scattered logs into interactive insights, empowering both operators to maintain system health and users to verify application stability.
[1] http://llview.fz-juelich.de

Performance Characterization of TSMP2 on Heterogeneous HPC Architectures

Ana González-Nicolás¹, Stefan Poll¹, Jörg Benke¹, Paul Kenneth Rigor¹, Daniel Caviedes-Voullième^1,2

¹Forschungszentrum Jülich GmbH, Germany; ²Technical University of Dresden, Germany

Extreme-scale heterogenous computing systems are becoming central to coupled Earth system simulations, where strongly interacting component models impose performance and scalability requirements. TSMP2, the second-generation Terrestrial Systems Modelling Platform, uses a Multiple Program Multiple Data (MPMD) coupling strategy to target modern heterogenous nodes composed of multi-core CPUs and accelerators. In this contribution, we present a systematic performance characterization of TSMP2 across heterogenous hardware hierarchies, including intra-chip, intra-node, and inter-node configurations. Experiments are conducted on current leadership-class platforms, including JURECA-DC and the exascale JUPITER system. To isolate architectural and coupling effects, we use a controlled idealized benchmark configuration with uniform soil properties. We analyze resource allocation strategies across heterogenous deployments, and identify architectural configurations that optimize efficiency. The results show trade-offs in the performance of coupled MPMD workflows on emerging HPC architectures and highlight challenges in executing multi-physics models at extreme scale. Our findings provide practical guidance for optimizing coupled simulation frameworks and offer insights that are applicable to other heterogenous HPC applications.

JUBE: An Environment for systematic benchmarking and scientific workflows

Thomas Breuer, Filipe Guimaraes, Jan Oliver Mirus, Pit Steinbach, Wolfgang Frings

Forschungszentrum Juelich GmbH, Germany

A key aspect of developing research software is testing the installation and the expected results on various configurations, as well as benchmarking the performance preferably continuously. This applies especially to software that targets high-performance computing (HPC) installations around Europe and the world. For these applications performance, scalability, and efficiency are key metrics that need to be monitored and compared among systems. Due to the complexity of these technical installations, individual scripts written for a specific system lack portability, reusability and reproducibility.

These challenges were addressed by the development of the Jülich Benchmarking Environment (JUBE) [1] at the Jülich Supercomputing Centre (JSC). JUBE is a generic and lightweight framework that automates the systematic execution, monitoring, and analysis of applications. It is a free, open-source software [2] implemented in Python that operates on a "declarative configuration" paradigm, where experiments are defined in human-readable YAML/XML files, automating script generation, job submission, and result analysis. Due to its standardized configuration format, it simplifies collaboration and usability of research software. JUBE integrates seamlessly with CI/CD pipelines, enabling automated regression testing, performance tracking, and benchmarking as part of HPC software development workflows.

The entry barrier of JUBE is relatively low as it builds upon basic knowledge of the Linux shell and either XML or YAML, and an extensive documentation including tutorials and advanced examples is available [2]. Offering a high degree of flexibility, JUBE may be used in every phase of the HPC software development pipeline. Example use cases comprise standard benchmarks to track a project's development in terms of performance, or systematic studies to explore parameter combinations---including orchestrating scaling experiments, which has already been shown to streamline the application process for HPC compute resources [3]. JUBE has been previously used to successfully automate a large variety of scientific codes and standard HPC benchmarks, with configurations available open-source [4]. The software can be easily installed, with existing configurations also available for the software managers EasyBuild [5] and Spack [6]. Further projects have been built ontop of JUBE [7,8].

In conclusion, JUBE is a well-established software, which has already been used in several national and international projects and on numerous and diverse HPC systems [9-16]. Given its broad scope and range of applications, JUBE is likely to be of interest to those working in the HPC software sector.

This poster will provide an overview of JUBE, covering its core principles and presenting illustrative use cases to demonstrate JUBE's practical applications: - benchmarking as part of the procurement of JUPITER, Europe’s first exascale supercomputer; - a complex scientific workflow for energy system modelling [16]; - continuous insight into HPC system health by regular execution of applications, and the subsequent graphical presentation of their results.

[1] https://apps.fz-juelich.de/jsc/jube/docu/index.html
[2] https://github.com/FZJ-JSC/JUBE
[3] https://www.fz-juelich.de/en/jsc/jupiter/jureap
[4] https://github.com/FZJ-JSC/jubench
[5] EasyBuild: https://github.com/easybuilders/easybuild-easyconfigs/tree/develop/easybuild/easyconfigs/j/JUBE
[6] Spack: https://packages.spack.io/package.html?name=jube
[7] https://github.com/edf-hpc/unclebench
[8] https://dl.acm.org/doi/10.1145/3733723.3733740
[9] MAX CoE: https://max-centre.eu/impact-outcomes/key-achievements/benchmarking-and-profiling/
[10] RISC2: https://risc2-project.eu/?p=2251
[11] EoCoE: https://www.eocoe.eu/technical-challenges/programming-models/
[12] DEEP: https://deep-projects.eu/modular-supercomputing/software/benchmarking-and-tools/
[13] DEEP-EST: https://cordis.europa.eu/project/id/754304/reporting
[14] IO-SEA: https://cordis.europa.eu/project/id/955811/results
[15] EPICURE: https://epicure-hpc.eu/wp-content/uploads/2025/07/EPICURE-BEST-PRACTICE-GUIDE-Power-measurements-in-EuroHPC-machines_v1.0.pdf
[16] UNSEEN: https://juser.fz-juelich.de/record/1007796/files/UNSEEN_ISC_2023_Poster.pdf

Building multi-scale modeling tool for optimizing brain stimulation in Alzheimer's disease

Han Lu¹, Thorsten Hater¹, Mario Ibañez Bolado², Marvin Kaster³, Juliette Courson⁴, Fabian Czappa³, José Luis Bosque Orero², Borja Perez Pavon², Felix Wolf³, Thanos Manos⁴, Sandra Diaz¹

¹Forschungszentrum Jülich, Germany; ²University Cantabria; ³Technical University Darmstadt; ⁴CY Cergy Paris University

Non-invasive brain stimulation offers a promising alternative for treating neurological disorders like Alzheimer’s disease or depressive disorder. Yet, optimization to individual clinical cases remains hindered by inefficient trial-and-error methods and the prior knowledge about individual brain area’s role in affecting the whole brain dynamics remains less understood given the complex inter-area connectivity. To address this, we leverage Arbor, an HPC-optimized library, to transition from empirical testing to high-fidelity digital twins by co-simulating it with other simulators. Our framework achieves exascale-ready neural modeling by integrating single-neuron biophysics with whole-brain dynamics. Key technical milestones include the implementation of scalable spike transmission, the integration of structural plasticity within the simulation kernel, and the development of a multi-scale co-simulation bridge between Arbor and The Virtual Brain (TVB). By combining morphologically detailed biophysical neurons with large-scale connectivity, this project aims to establish a scalable computational platform for investigating the long-term effects of stimulation parameters on complex neural architectures.

A Case Study on Hybrid Quantum-Classical Workflow Modeling

Simon Renard¹, Mar Tejedor², Rosa Badia², Marc Baboulin³, Gabriel Antoniu¹, Silvina Caino-Lores¹

¹Inria Rennes, France; ²Barcelona Supercomputing Center, Spain; ³Inria Saclay, France

Quantum Computing (QC) is increasingly integrated into High Performance Computing (HPC) environments, and is more and more associated with hybrid applications where classical processing stages orchestrate quantum execution. In practice, these applications can be seen as complex workflows where classical pre-processing stages decompose a quantum algorithm into collections of heterogeneous tasks, and additional classical tasks coordinate the hybrid execution, with non-trivial data dependencies and execution constraints amongst them. These workflows must then be mapped onto available computing resources, which may include local HPC nodes for QPU emulation, remote cloud quantum devices, or HPC systems coupled to on-premises quantum hardware.

However, existing quantum software development kits enable circuit execution, but do not provide workflow-level abstractions compatible with HPC systems, while existing HPC workflow managers support task-based execution, yet lack mechanisms to represent quantum tasks and their infrastructure-dependent behavior. Existing approaches for quantum task management in HPC environments are provided by vendors and industry stakeholders, often embedded via ad-hoc or proprietary code that limits portability, interoperability and transparency. Addressing the limited availability of open and flexible hybrid task management solutions requires workflow-oriented representations and characterization methods that enable transparent performance modeling, cross-layer telemetry acquisition, and infrastructure-aware task placement.

To address this gap, we investigate workflow-oriented methods for hybrid QC-HPC applications, focusing on task-level workflow modeling and characterization (e.g. compute intensity, memory footprint, and communication patterns), and on how such information can support task placement and resource allocation on heterogeneous infrastructures. In particular, we study methods to capture relevant telemetry and metadata across software and system layers, including execution times, data volumes, transfer costs, and infrastructure-dependent performance factors. Our objective is to develop workflow abstractions and characterization methods that enable monitoring and systematic evaluation of hybrid QC-HPC workloads.

We ground this work in a use case developed at Barcelona Supercomputing Center (BSC), where circuit cutting techniques are used to transform large quantum circuits into workflow graphs composed of multiple interdependent, but overall more manageable tasks. This workflow structure raises practical challenges in terms of task decomposition, resource allocation, and performance evaluation when executed on QC-HPC systems.

This poster focuses on the fundamental question of how quantum algorithms can be represented as task-based workflows. We present an overview of existing workflow abstractions and discuss their applicability to quantum algorithm decomposition. We illustrate this through a case study based on circuit cutting workflows developed at BSC. This use case serves as an initial application use case to examine how such quantum algorithms can be represented within existing workflow models. It also allows us to assess the availability of application and system telemetry required for future studies on task placement and execution strategies. This study will provide a first step towards enabling infrastructure-aware workflow execution of hybrid QC-HPC applications.

Automated code generation of advanced plasticity rules for the SpiNNaker neuromorphic platform using NESTML

Charl Linssen^1,3, Pooja Babu^1,3, Andreas Rogalski³, Bernhard Rumpe³, Abigail Morrison^2,3

¹Simulation and Data Laboratory Neuroscience, Jülich Supercomputer Centre, Institute for Advanced Simulation, Jülich-Aachen Research Alliance, Forschungszentrum Jülich GmbH; ²Institute for Advanced Simulation IAS-6, Computational and Systems Neuroscience, Forschungszentrum Jülich GmbH; ³Software Engineering, Software Engineering, RWTH Aachen University, Germany

Current neuromorphic and HPC workflows suffer from a usability gap, where mapping complex biological plasticity models to specialized hardware requires manual, error-prone, low-level coding. NESTML is a domain-specific modeling language that allows researchers in computational neuroscience to specify models of neurons and synapses in a precise and accessible way. These models can subsequently be used in dynamical simulations of spiking neural networks on various simulation platforms, such as NEST Simulator [3] or SpiNNaker [4]. This is achieved by means of platform-specific code generated by the NESTML toolchain. The generated code extends a simulation platform with new neuron models and synaptic plasticity rules, that can then be instantiated in a network of any size. Combining a user-friendly modeling language with automated code generation makes large-scale neural network simulation accessible to neuroscience researchers without requiring any prior training in computer science [2].

In this work, we establish an extension of the NESTML code generation toolchain that adds support for simulation of advanced synaptic plasticity rules on the SpiNNaker neuromorphic hardware platform [4]. We demonstrate our approach with code generation for a spike-timing dependent plasticity (STDP) synapse model in a simple network. The dynamics of the network is solved using exact integration [5]. The necessary numerical integration routines are automatically generated by the subsidiary toolchain ODE-toolbox [6]. We validate the simulation results by means of comparison of numerical outcomes of the simulation to those obtained from NEST Simulator running on a standard CPU as a reference. The proposed solution is generic, so that for instance the triplet STDP rule [7] can be expressed directly in the DSL, without requiring any further changes to the toolchain.

NESTML’s features make models easier to write and maintain, more easy to discover, share and reuse, as well as interoperable between platforms. Adding simulation support for the SpiNNaker platform not only bolsters reproducibility in neuroscience by allowing results to be compared between platforms, but also accelerates research and development for applications including edge computing and neuromorphic HPC. This project establishes the foundation for precise benchmarking of performance, memory efficiency, and power consumption. Additionally, support for neuromodulation will enable reinforcement learning as a training method for the network, opening up a wide range of applications. These will further benefit from the upcoming SpiNNaker-2 system installation at the Jülich Supercomputing Centre, making NESTML a key enabler in the transition towards modular supercomputing.

[1] https://github.com/nest/nestml/

[2] Blundell et al., Frontiers in Neuroinformatics 12, 2018

[3] Gewaltig & Diesmann, Scholarpedia 2(4), 2007

[4] Furber et al., Proceedings of the IEEE 102(5), 2014

[5] Rotter and Diesmann, Biological Cybernetics 81, 1999

[6] https://ode-toolbox.readthedocs.org/

[7] Gjorgjieva, Clopath, Audet, & Pfister. Proc. Natl. Acad. Sci. U.S.A. 108 (48) 19383-19388,(2011).

Modeling and simulating spiking neurons and synaptic plasticity with NESTML on HPC

Pooja Babu^1,2, Charl Linssen^1,2, Abigail Morrison^1,2

¹Forschungszentrum Jülich, Germany; ²Software Engineering, RWTH Aachen University, Germany

NESTML is a domain-specific modeling language for spiking neuronal networks incorporating synaptic plasticity [1]. It has been designed over the last 10 years to support researchers in computational neuroscience by allowing them to specify models of neurons and synapses in a precise and intuitive way. These models can subsequently be used in dynamical simulations of spiking neural networks (for potentially very large network sizes), by means of high-performance simulation code generated by the NESTML toolchain. The code extends a simulation platform (such as NEST Simulator [2], NEST GPU [3], or SpiNNaker [4]) with new and easy-to-specify neuron and synapse models, formulated in NESTML. Combining a user-friendly modeling language with automated code generation makes large-scale neural network simulation accessible to neuroscience researchers without requiring any training in computer science [5].

NESTML features a concise yet expressive syntax, inspired by Python. There is direct language support for (spike) events, differential equations, convolutions, stochasticity, and arbitrary algorithms using imperative programming concepts, in addition to flexible event management using handler functions and prioritization. These features make models easier to write and maintain and make models in general more findable, accessible, interoperable, and reusable (‘FAIR’ principles).

Models specified in the NESTML syntax are processed by an open-source toolchain that generates fast code for a given target simulator platform. Here, we demonstrate the code generation approach of NESTML for the NEST simulator, along with the performance and memory benchmarks for large-scale simulations. We achieve this by running a balanced random network of adaptive exponential (AdEx) integrate-and-fire neurons with Spike-Timing Dependent Plasticity (STDP) synapses. The networks are simulated on a high-performance computing (HPC) cluster. We perform strong scaling and weak scaling experiments and assess the performance of the network with NESTML-generated models, as compared to the NEST built-in models.

We compare the following combinations of neuron and synapse models: (i) NEST Simulator built-in neuron model + NEST Simulator built-in synapse model, (ii) NESTML neuron model + NEST Simulator built-in synapse model, (iii) NESTML neuron model + NESTML synapse model. We show that the NESTML-generated models perform as good as the handwritten models, with a small reduction in the performance (5% - 6%) and a slightly higher memory footprint (30%), specifically for the combination (iii), which can be attributed to the generic and model-agnostic process of code generation. We believe that this slight loss in performance is more than compensated for by the significant time savings achieved in writing and verifying the numerics of new models, as the use of NESTML allows the modeling process to be carried out in an agile and incremental manner, further speeding up the entire model development cycle. For the future, we focus on optimizations and improvements in the toolchain leading to performance gains that put the generated code on par or even above the NEST built-in models.

[1] https://nestml.readthedocs.io/
[2] Gewaltig & Diesmann, Scholarpedia 2(4), 2007
[3] Golosio et al., Frontiers in Computational Neuroscience, 2021
[4] Furber et al., Proceedings of the IEEE 102(5), 2014
[5] Blundell et al., Frontiers in Neuroinformatics 12, 2018

Plant your virtual trees! A distributed GPU-Native Octree for the FMM

Ioannis Lilikakis, Arijus Lengvenis, Ivo Kabadshow, Holger Dachsel

Forschungszentrum Jülich, Germany

We present an NVShmem-distributed octree for hierarchical domain decompositions using uniformly resolved grids of multiple depths. Storing parent nodes yields a multi- layer octree that is particularly well suited for applications requiring scalable, multiply-resolved decomposition on heterogeneous GPU systems. We store application-required data contiguously in different resolutions. The implementation is optimized for modern high-performance GPU computing environments. It is fully GPU-aware and supports distributed computing via SHMEM-based one-sided communication for efficient halo exchange. This enables scalable performance across multiple GPUs and compute nodes with minimal communication overhead. Our library is implemented as a header-only C++20 and CUDA template framework. It is modular by design, allowing user-defined data storage, different space-filling curves (e.g., Hilbert, Morton, or striped variants). The index datatype is also customizable, ensuring portability across both low- and high-bit architectures as well as arbitrary tree depth.

From JUWELS to JUPITER: Scaling ICON Toward Kilometer-Scale Earth System Simulation and Implications for a German Earth System Model

Sabine Griessbach, Manoel Römmer, Lars Hoffmann, Catrin Meyer

Forschungszentrum Jülich GmbH, Germany

The development of next-generation Earth system models is tightly coupled to advances in extreme-scale computing. In this contribution, we present performance analyses of the German weather forecast and climate model ICON (ICOsahedral Nonhydrostatic model) across heterogeneous leadership systems, namely JUWELS-cluster, JUWELS-booster and the exascale-system JUPITER. We present strong- and weak-scaling results on CPU and GPU architectures, identifying memory constraints and the impact of heterogeneous physics on load balance and energy efficiency.

Pushing toward kilometer-scale global climate simulations and large ensembles fundamentally shifts application requirements: higher resolution increases communication intensity, stresses memory bandwidth, and amplifies I/O demands. These constraints directly shape optimization strategies and portability decisions across systems.

The natESM initiative coordinates the development of a German Earth system model with ICON-atmosphere as its core component. While ICON provides atmosphere, ocean, and land modules, GPU porting is currently most advanced for the atmospheric component. We give an overview of the current Earth system modelling landscape in Germany, with emphasis on modularization, coupling strategies, and heterogeneous porting status as enabling factors for future fully coupled, high-resolution simulations.

Using ICON and natESM as concrete application drivers, we connect scientific ambition with architectural realities and illustrate how Earth system modeling both drives and exploits advances in extreme-scale HPC architectures.

Controlling the temperature of computing sitesby injecting useful, non-invasive tasks

Kouds Halitim¹, Thomas Collignon^3,4, Raphaël Bleuse¹, Sophie Cerf¹, Bogdan Robu², Eric Rutten¹, Lionel Seinturier³, Alexandre van Kempen⁴

¹Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, Grenoble, France; ²Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, Grenoble, France; ³Univ. Lille, Inria, CNRS, Centrale Lille, CRIStAL, Lille, France; ⁴Qarnot Computing, Montrouge, France

Efficient cooling is a critical bottleneck for next-generation High-Performance Computing (HPC) and large-scale data centers, directly impacting both peak performance and environmental sustainability. Air cooling has long been the preferred method for cooling computing servers, but with the increase in consumption at computing sites, this technique has shown limitations in terms of its efficiency and environmental impact. Water cooling addresses these issues by providing more efficient cooling and reusing the heat produced by servers, which until now has been dissipated into the atmosphere. Some computing resource providers maintain a geographically distributed infrastructure that allows the heat produced by servers to be reused to contribute to heating networks (or other sites requiring a heat source). A data center with such a heat reuse system requires good management of computing loads and water temperature in order to maintain service and heating guarantees. Complying with temperature constraints at the outlet of the cooling circuit is essential but complicated due to the variable and unknown nature of server usage by users. Thus, to ensure a sufficiently high temperature, it is often customary to run synthetic loads (e.g., CPUburn, Minage) on unused servers in order to increase resource utilization and the system's outlet temperature. However, this approach represents a waste of energy and diverts computing capacity to unproductive calculations. In this work, we exploit the variability of the computational load as a lever for thermal action: when additional heat input is required, useful tasks are injected onto selected servers in order to increase their utilization. We present the implementation of such a non-intrusive task injection system on a real infrastructure. We identify categories of useful tasks that can be used for injection (e.g., log file compression, software compilation, unit testing, etc.) and describe how they are executed by a dynamically controllable resource provisioning system added in parallel to the existing system. In addition, a feedback loop uses the measured outlet water temperature to continuously adjust the load of injected tasks, thereby compensating for user-induced fluctuations. Using a model of energy consumption and the thermal dynamics of the outlet water, we design a two-level controller: an upper-level module calculates target workload profiles that comply with outlet temperature constraints, while a lower-level controller tracks temperature targets by regulating the size of injected task batches for better management of system dynamics and disturbances.

GPUs are Fast, You May be Not

Arijus Lengvenis, Ivo Kabadshow, Holger Dachsel, Ioannis Lilikakis

Forschungszentrum Jülich, Germany

Scientific libraries for molecular dynamics must operate at sub-millisecond latency to be practically useful, requiring efficient strong scaling across many GPUs. At scale, even moderately sized systems reduce to only thousands or hundreds of particles per device, and extracting parallelism from these small work sets demands a detailed understanding of the underlying algorithm. We present two case studies from our CUDA optimisation of the Fast Multipole Method, each illustrating how identifying the right atomic work unit transforms GPU utilisation. In the far-field phase, we initially assigned one thread per box-box tensor contraction, which starved the GPU at upper octree levels where few boxes exist. By decomposing each contraction into p^2 fully independent operations, each assignable to a separate thread with no communication overhead, we recovered utilisation at coarse levels and gained an additional dimension for tuning occupancy across all levels, yielding roughly 15 to 20% faster wall-time M2L performance on a 7000-particle sodium chloride system. In the near-field phase, parallelising over particles is inherently limited and load-imbalanced, as the number of pairwise interactions can scale as N^2 while work is distributed over only N threads. Parallelising naively over interactions instead introduces memory inefficiency and many redundant costly atomic force updates. We resolve this by tiling the interaction space into warp-sized blocks, enabling efficient local processing while asymptotically parallelising over N^2 interactions, and opening a natural path to tensor core acceleration in future work. This alone reduces near-field runtime by nearly 50% on the same system compared to either approach.

FMM-azing Adventures Beyond PME in Molecular Dynamics Simulations

Ivo Kabadshow, Arijus Lengvenis, Ioannis Lilikakis, Holger Dachsel

Forschungszentrum Jülich, Germany

Molecular dynamics (MD) simulations have relied heavily on high-performance computing (HPC) resources for decades. For long-range Coulomb interactions—the de facto standard in MD—the Particle-Mesh Ewald (PME) method remains dominant. With its near-optimal O(N log N) runtime complexity, PME enables efficient time steps on the millisecond scale and is widely supported across CPU and GPU architectures. However, its O(p²) communication overhead, stemming from internal Fast Fourier Transform (FFT) operations, poses a significant bottleneck.

In this work, we explore a viable alternative: the Fast Multipole Method (FMM), a long-range solver with linear runtime and communication complexity, making it a compelling candidate to overcome current performance limitations. Historically, energy conservation errors in tree-based FMM approaches have constrained its adoption. Here, we present recent advances that mitigate these challenges, including:

Reduced energy drift through refined numerical formulations,
Improved numerical stability at lower precision levels, and
Lowered accuracy complexity without sacrificing fidelity.

We benchmark FMM against PME-based simulations and discuss scenarios where FMM may emerge as the superior choice, particularly in large-scale or communication-bound systems.

Performance Evaluation and Optimization of an MPS-Based CFD Solver on GPUs

Junya Onishi¹, Ayato Takii², Sangwon Kim¹, Younghwa Cho³, Makoto Tsubokura^1,2

¹RIKEN Center for Computational Science, Japan; ²Kobe University, Japan; ³Hokkaido University, Japan

The growing disparity between computational throughput and memory bandwidth has become a major bottleneck in large-scale computational fluid dynamics (CFD). As grid resolution increases, the volume of data movement often dominates runtime, motivating the exploration of alternative data representations that can reduce memory traffic. In this work, we investigate Matrix Product States (MPS), a tensor-network representation originally developed in quantum physics, as a compressed representation for CFD variables and evaluate its performance on modern GPUs.

We implement a three-dimensional incompressible Navier–Stokes solver in which all flow variables are stored and updated entirely in MPS form. The solver is discretized using a finite-volume method on structured grids with a fractional-step time integration scheme. To analyze performance characteristics, we focus on the core MPS operations required for the solver, particularly the MatVec operation with matrix product operators that encode CFD stencil operations. GPU-oriented optimizations, including loop-order tuning, tiling strategies, and bond-dimension–aware execution policies, are systematically investigated using Kokkos-based implementations.

Our results demonstrate that MPS-based representations can substantially reduce memory requirements while enabling large-scale simulations, including a 1024^3 problem on a single GPU. Performance profiling reveals that runtime is dominated by bandwidth-sensitive tensor contractions and that kernel efficiency strongly depends on loop ordering and tiling parameters. Appropriate tuning significantly improves GPU utilization and mitigates bottlenecks caused by irregular bond dimensions.

These findings highlight that MPS-based CFD should be treated not only as a compression technique but also as a performance engineering problem. The study provides practical optimization strategies and a performance-oriented perspective for advancing tensor-network–based CFD solvers on emerging GPU architectures.

Lessons Learned from Power-Saving Operations on Fugaku: Operational Analysis and Benchmark Validation

Masaaki Terai¹, Eiji Nagata², Yoshitaka Furutani², Fumichika Sueyasu², Shin'ichi Miura¹

¹RIKEN R-CCS, Japan; ²Fujitsu Ltd.

The A64FX chip in Fugaku provides power control mechanisms known as power knobs, including adjustable CPU frequency at 2.0/2.2 GHz, eco mode that disables one of two FP pipelines, and core retention. By combining these features, we defined boost-eco mode, which operates at 2.2 GHz with a single FP pipeline, and adopted it as the system-wide default in March 2025.

In a previous report, we described the phased implementation of these power-saving features from 2021 to 2025, achieving a 29.4% reduction in average node power consumption. Through clustering analysis of production job records, we found that boost-eco mode benefits the majority of workloads, approximately 77%, but compute-intensive jobs, 19.6% of the total, show degradation in elapsed time and Energy Delay Product (EDP).

The degradation of compute-intensive jobs is an expected consequence of disabling one FP pipeline, but our previous analysis relied solely on operational statistics. To supplement this with baseline data at the microarchitectural level, we conducted controlled benchmark experiments using DGEMM and STREAM with Fujitsu's CPU performance analysis profiler. For DGEMM, disabling one of two FP pipelines causes 16.3% performance loss. The 16.2% power reduction is entirely offset by longer execution time, resulting in virtually no energy savings and 19.5% EDP degradation. For STREAM, FP pipelines are largely idle regardless of mode, so disabling one has minimal impact at 3.7% performance loss. The power reduction translates into 7.9% energy savings and 4.4% EDP improvement.

These results provide a straightforward microarchitectural explanation for the patterns observed in four years of operational data. For workloads bottlenecked by FP computation, eco mode removes the resource they need; for memory-bound workloads, it removes a resource they are barely utilizing, yielding a net energy benefit.

These findings point toward a workload-aware approach for future operations, where eco mode could be selectively applied based on job characteristics identified at submission time, improving overall system energy efficiency without penalizing compute-intensive applications. As future work, we plan to extend our evaluation using the NAS Parallel Benchmarks (NPB) and representative mini-applications to characterize eco mode behavior across a broader range of computational patterns, including mixed compute/memory workloads and communication-intensive codes.

Practical FHE Inference with Parameter Study on GPU

Shuxin Zheng, Thomas Spendlhofer, Hugo Sanz-González, Antonio J. Peña

Barcelona Super-Computing Center, Spain

Fully homomorphic encryption (FHE) enables neural network inference on encrypted data, but CKKS-based approaches remain computationally expensive and challenging to deploy, with existing GPU implementations often focusing on micro-benchmarks or single-operation performance rather than end-to-end behavior of realistic CNN models under practical cryptographic constraints. We present a fully automated compilation and execution pipeline for privacy-preserving CNN inference on GPUs without manual cryptographic engineering. Our approach leverages the ORION framework to compile standard PyTorch models into CKKS-executable computation graphs, which we then execute on an enhanced GPU backend (HEonGPU) extended with unified-memory management and additional homomorphic operators. This end-to-end automation enables systematic evaluation of multiple standard CNN architectures—including ResNet-20, ResNet-34, and variants with both ReLU and SiLU activations—under realistic security levels (128-bit and 192-bit). Through systematic exploration of CKKS parameter configurations on ResNet-20, we identify settings that balance accuracy, latency, and security guarantees, and demonstrate that these configurations transfer effectively to deeper models. Our GPU implementation achieves up to 30× speedup over CPU baselines while maintaining cleartext-level accuracy. We further provide detailed performance characterization through operation-level runtime decomposition and layer-wise precision analysis, revealing where numerical errors accumulate and quantifying the precision requirements of encrypted CNNs via controlled bootstrapping experiments. Our results demonstrate that moderate CKKS configurations, coupled with system-level GPU optimizations, suffice for practical encrypted inference across a range of CNN architectures, bridging the gap between cryptographic theory and deployable privacy-preserving machine learning systems.

DPA-CCL: Offloading Collective Communications to Data Path Accelerators

Muhammad Usman, Mariano Benito, Sergio Iserte, Antonio Peña

Barcelona Supercomputing Center, Spain

Recent SmartNICs consist of programmable hardware as FPGA, CPU cores or an array of packet processing cores.

NVIDIA ConnectX devices has one such array of packet processing cores.

An intuitive application for these cores is offloading collective communication algorithms to them.

Performance of Quantum Autoencoders in Time-Series Anomaly Detection

Lourens van Niekerk

Forschungszentrum Jülich, Germany

Anomaly detection plays a critical role across diverse domains, including medical risk assessment, environmental monitoring, and financial fraud detection, and these applications commonly rely on time-series data where subtle deviations can carry significant meaning. But despite the demonstrated effectiveness of autoencoders in identifying such anomalies, a rigorous and systematic benchmark for quantum autoencoders in time-series contexts is still lacking. Therefore, we present a comprehensive benchmarking study that evaluates multiple quantum autoencoder models against state-of-the-art classical models, demonstrating their performance and potential advantages in time-series anomaly detection tasks.

TRANSLATE with x English ArabicHebrewPolish BulgarianHindiPortuguese CatalanHmong DawRomanian Chinese SimplifiedHungarianRussian Chinese TraditionalIndonesianSlovak CzechItalianSlovenian DanishJapaneseSpanish DutchKlingonSwedish EnglishKoreanThai EstonianLatvianTurkish FinnishLithuanianUkrainian FrenchMalayUrdu GermanMalteseVietnamese GreekNorwegianWelsh Haitian CreolePersian TRANSLATE with COPY THE URL BELOW Back EMBED THE SNIPPET BELOW IN YOUR SITE Enable collaborative features and customize widget: Bing Webmaster Portal Back

Malleability in hybrid quantum-classical programs: adaptive classical resource allocation across iterative workflows

Íñigo Aréjula Aísa, Sergio Iserte, Petter Sandås, Antonio J. Peña

Barcelona Super Computer Center, Spain

Dynamic resource management (DMR) offers a promising path to converge High-Performance Computing (HPC) and Quantum Computing (QC) by enabling hybrid applications to adapt their resource usage at runtime. In this work, DMR is integrated with malleable MPI applications to dynamically resize the set of allocated classical resources according to the current phase of a hybrid HPC-QC workflow. During classical phases, the application can expand to exploit multiple nodes, while in quantum phases it shrinks, releasing unused classical resources while waiting for quantum execution. This phase-aware adaptation reduces idle time on HPC nodes and improves overall system utilization in scenarios where quantum resources are scarce and accessed as accelerators. The proposed approach targets transparent integration with existing batch schedulers and MPI codes, paving the way for more efficient execution of hybrid workloads and making HPC-QC convergence practical from the resource management perspective.

18^th JLESC Workshop

19–21 May 2026

Jülich, Germany

Conference Agenda