I. INTRODUCTION
Molecular dynamics (MD) simulations are essential tools for modeling the physical
behavior of materials at the atomic and molecular scale. In particular, the semiconductor
device industry depends on accurate predictions of atomic-level interactions, as the
devices are aggressively scaled to deep submicrometer technologies. MD simulations
play a critical role in understanding phenomena such as thin film growth [2], Secondary ion mass spectrometry (SIMS) [3] and examination the physical properties of nanotechnological devices [4], contributing to the development of new material and device design.
One of the key components for accurate modeling of these atomic-level interactions
is to use appropriate interatomic potentials, such as the Tersoff potential [5]. The Tersoff potential is a widely used empirical potential specifically designed
for simulating covalently bonded materials including silicon and carbon, which are
base components in semiconductor devices. It incorporates bond order and angular dependence,
enabling realistic simulation of bond formation and breaking, as well as capturing
mechanical and thermal properties of materials [6]. By utilizing the Tersoff potential, MD simulations can more precisely predict the
behavior of semiconductor materials under various conditions, which is critical for
device design and analysis.
To perform MD simulations, the forces between multiple atoms must be computed. These
forces are derived from the gradients of interatomic potentials, thereby effectively
capturing the interactions between atoms. Once the forces are determined, the evolution
of the system over time is computed by solving Newton's equations of motion. One of
the popular approaches for the motion equation in the MD simulation is the Verlet
[7] algorithm due to its simplicity and numerical stability, and it efficiently integrates
the equations of motion by updating atomic positions and velocities based on the calculated
forces. This allows for precise modeling of atomic trajectories, enabling simulations
to capture the dynamic behavior of materials at the atomic level accurately.
However, MD simulations are computationally intensive, requiring large-scale data
processing, which leads to a fundamental issue of performance degradation. The LAMMPS
experiences performance degradation in message passing interface (MPI) [8]-based parallel computations due to overheads from inter-chip communication. For example,
when using 2 MPI tasks with 256K atoms, the forward communication data for each process
involves 22,282 atoms. Under the same conditions, increasing the number of atoms to
512K results in 46,066 atoms involved in forward communication per process. If this
data is transferred via PCIe, which has a bandwidth of 8 GB/s as shown in Fig. 2(a), the communication time is 16711.5 ns for 256K atoms and 34549.5 ns for 512K atoms.
As the volume of simulation data increases, these communication overheads increase,
increasing the overall computation time and energy consumption, thus reducing the
efficiency of large-scale simulations.
To address this problem, various software packages for the parallelization such as
OPENMP and KOKKOS have been proposed [1]. OPENMP supports parallel processing through multithreading in shared memory systems,
and KOKKOS provides flexible parallelization strategies for efficient computation
across different hardware platforms. While these approaches are effective at improving
computational performance, they have limitations in terms of communication overhead
during the multi-core/-chip data transfer. In this study, we propose a compressed
molecular dynamics (CMD) architecture to reduce communication overhead in large-scale
MD simulations, thereby contributing to improved system-level energy efficiency as
follows.
1) We analyze the data communication pattern with the floating-point (FP) number representation,
specifically the sign and exponent fields.
2) For the most frequent patterns, we used the compressed data format with the reduced
bit-width for the sign and exponent fields.
3) Based on these methods, we introduce a CMD (Compressed Molecular Dynamics) architecture
that minimizes energy consumption.
4) Finally, our approach is the first work to minimize the amount of data communication
in the fast and parallel MD computation.
II. PRELIMINARIES
1. Verlet Algorithm
MD simulations rely on computational methods that numerically solve Newton's equations
of motion to predict the trajectories of atoms over time. Accurate force calculations
are essential in these simulations, as they determine how atoms interact within a
system.
The Verlet algorithm is one of the most widely used time integration methods in MD
simulations, monitoring the dynamic behavior of the system by updating the positions
and velocities of atoms. In this algorithm, the new position of each atom is calculated
based on its previous and current positions, as well as the acceleration. To accelerate
the computation, the forces acting on the atom must be accurately calculated, which
requires considering the interactions with neighboring atoms.
The core of force calculation needs to collect the list of the neighboring atoms around
each atom, and then it evaluates their interactions. To do this, a cutoff radius is
used, which serves as a threshold of distance between effective atoms considered for
interactions. Each atom considers its neighbors within the cutoff radius based on
its current position and then calculates the forces using the distance and directional
information between them.
Through this procedure, atoms interact in the following ways:
1) Neighbor list generation: Each atom must identify its neighboring atoms within
the cutoff radius based on its position. To improve the indexing performance, spatial
decomposition techniques (e.g., cell lists or grid methods) are used to reduce the
search range.
2) Position information exchange: In a parallel computing environment, atoms may be
distributed across different processors or nodes. If a neighboring atom within the
cutoff radius is included in a different processor, communication between processors
is necessary to retrieve the position information of that atom.
3) Force calculation and update: After receiving the positions of the neighboring
atoms, each atom calculates the forces resulting from their interactions. The calculated
forces determine the acceleration of the atom, which is then used to update its new
position and velocity. In Fig. 1, the labels 0, 1, 2, 3, and 4 indicate the MPI process ranks. The blue and red regions
represent the cut-off distance areas for each process. Fig. 1 illustrates the communication of atomic information, referred to as "ghost" atoms
in LAMMPS, between Process 0 and neighboring Processes 1 to 4, and the computation
of forces resulting from the interactions between atoms. A ghost atom is not a physical
atom, but a copied atom used to handle boundary conditions, playing a crucial role
in accurately calculating interactions at the boundary region. Process 0 sends the
position or force information of atoms within the cut-off distance to Process 2. Subsequently,
Process 0 receives atomic information within the cut-off distance from Process 2.
Fig. 1. Overview of MPI communication for parallel MD computation.
Fig. 2. Dataflow of (a) baseline off-chip communication and (b) our FP data compression
approach. MD chip: Conventional molecular dynamics simulation chip, CMD chip: Compressed
molecular dynamics simulation chip.
2. Tersoff Potential
The forces involved in this calculation are generally derived from interatomic potentials,
which are mathematical functions that describe the potential energy between atoms
based on their positions. Among these potentials, the Tersoff potential [5] is particularly significant for modeling covalently bonded materials such as silicon
and carbon, which are fundamental to semiconductor devices. The total energy E of
the system is calculated as follows [9]:
In this equation, $r_{ij}$ represents the distance between particle pair $(i,j)$.
The term $f_R(r_{ij})$ is the repulsive function, and $f_A(r_{ij})$ is the attractive
function, which is multiplied by the bond order parameter $b_{ij}$. The total energy
is computed as the summation of the above terms, where the result is further multiplied
by the cutoff function $f_c(r_{ij})$, ensuring that interactions beyond a certain
distance are neglected.
The repulsive function represents the short-range repulsive forces between atoms.
It is defined as
where $A$ is a two-body interaction parameter representing the strength of the repulsive
interaction, and ${\lambda }_1$ is a decay constant that determines how quickly the
repulsive force decreases with distance. The attractive function represents the attractive
forces driving bond formation between atoms. It is given by
where $B$ is a two-body interaction parameter that determines the strength of the
attractive interaction, and ${\lambda }_2$ is a decay constant for the attractive
potential.
The cutoff function reduces to zero beyond a certain distance because calculations
become more complex, making it difficult to handle in MD simulations [9]. It ensures that only nearby atoms within a specified cutoff radius $R$ contribute
significantly to the total energy.
The cutoff function is defined as
where $D$ is a smoothing parameter that ensures continuity of the potential and its
first derivative at the cutoff boundary.
III. PROPOSED COMPRESSED MOLECULAR DYNAMICS (CMD) ARCHITECTURE
In the previous section, we observed that it is necessary to exchange positional information
of atoms with different processors or nodes during Verlet communication in a parallel
computing environment. The energy consumption of inter-chip communication is significantly
high as a result of MPI-based parallelized multi-chip computations in largescale simulation
data cases. To mitigate the problem, we propose a CMD architecture in this section.
The key idea behind our work is to count the sign and exponent cases for communicating
atoms and encode the most frequent cases into a 1-bit.
1. Motivation: Expens4e Board to Board Communication
Fig. 2 shows the comparison of board-to-board communication between the baseline and our
work. For the simple explanation, we assume an MPI-based parallelization with two
processors, referred to as board 0 and board 1. For inter-chip communication, board
0 reads the position information of atoms from DRAM of board 0 and transmits this
data to board 1 via the data bus (PCI-Express). Simultaneously, board 1 reads the
position information of atoms from DRAM of board 1 and sends it to board 0 over the
same data bus (PCI-Express). Upon receiving the position data, board 0 writes the
data from board 1 into DRAM of board 0, while board 1 similarly writes the data received
from board 0 into DRAM of board 1. This process involves a massive movement of positional
information for the DRAM communication and data bus.
As shown in Fig. 2(a), the bandwidth of the DRAM is much larger than the bandwidth of PCIe, resulting in
a bottleneck at the PCIe interface. Additionally, the energy consumption per bit for
PCIe is higher than that of DRAM, making the board-to-board communication particularly
critical in terms of energy efficiency.
To address this issue, as shown in Fig. 2(b), we reduce the data width of the atom information sent to the other processor to
reduce energy consumption. The sign and exponent cases of the read data from the DRAM
of board 0 are counted, and the most frequently used case is compressed into a 1bit
representation. The compressed data is then transmitted to board 1 via the data bus,
and board 1 decompresses the received data and writes it to the DRAM of board 1. Board
1 performs the same operations in parallel with board 0 (The method will be explained
in the following section).
2. Case Count
Fig. 3 illustrates the process of case counting with an example. In step 1 of the case count
process, when atom data in a conventional floating-point (FP) format is input, the
sign and exponent of the input are used to analyze the frequency of the sign and exponent
values. If the sign and exponent fields are matched by the case count SRAM, the count
value in the corresponding address is increased by 1. If the incremented count (count
$+ 1$) outside the top entries (top1/2) is larger than the current second-highest
count value (top2), the top2 value is replaced with the new value (count $+ 1$). Similarly,
if top2 exceeds the current highest count value (top1), the top2 replaces the previous
top1 entry. This process is repeated for all atoms involved in communication.
Once the counts for all atoms have been completed, as shown in step 2 of Fig. 3, a lookup table is generated, containing the count values for the frequent sign and
exponent cases of the input. The two cases with the highest count values are then
passed to the compressor. For the Verlet algorithm, atomic positions and velocities
are updated using the Initial_integrate() function. In this step, the sign and exponent
cases of the position values are counted. Therefore, no additional iteration over
the data is required for counting, and the associated latency overhead is negligible.
Fig. 3. The detailed illustration of the proposed compression approach. For the compressed
floating-point format, 2 stages including case count and compressor are used.
3. Compressor
When atom data in a conventional FP format is input, the sign and exponent of the
input are evaluated to determine whether they correspond to one of the two cases identified
in the previous case counting procedure. If the input belongs to a non-frequent case,
the data is sent to another board in the vanilla FP format without compression. However,
if the input matches the most frequent case, it is compressed into a 1-bit representation.
As shown in Fig. 3, the decimal values in the conventional FP format and the proposed most frequently
used case (MFC) format remain unchanged. This indicates that compression does not
affect the accuracy of the data.
Fig. 4. Distribution of sign and exponent patterns in FP16 communication data during
the simulation of 512K atoms with 2 MPI tasks. (a) Forward communication. (b) Reverse
communication.
Fig. 4 illustrates the distribution of sign and exponent patterns in FP16 communication
data when simulating 512K atoms using 2 MPI tasks.
Fig. 4(a) shows the pattern distribution of the 1-bit sign and 5-bit exponent during forward
communication, while Fig. 4(b) presents the distribution during reverse communication. When using two patterns for
compression, 70% of the total communication data can be compressed. As the number
of patterns increases, the portion of data that can be compressed also increases.
However, as the compression target patterns expand to 2, 4, and 8, the number of compressed
bits increases by 1 bit for each pattern. As a result, the overhead grows faster than
the portion of compressible data. Therefore, we chose to use only two patterns for
compression.
To analyze this, we performed a simulation with 512K FP16-format atoms using 2 MPI
tasks. In the baseline approach that transfers all data without compression, 270K
16-bit data points were transmitted. When applying the 2-pattern compression technique,
210K data points were transmitted as 11-bit compressed data and 60K data points were
transmitted as uncompressed 16-bit data. Based on the PCIe energy consumption of 7.5
J per bit, the proposed compression method reduced the total energy consumption from
32.4 kJ to 24.5 kJ. As a result, we confirmed that our method achieved a 23.6% reduction
in energy consumption compared to the baseline approach. In the same simulation conditions
compression using 4 patterns results in the transmission of 250K data points as 12-bit
data and 20K data points as uncompressed 16-bit data. In this case, the total energy
consumption is 24.9 kJ. Consequently, the proposed method achieves a 22.6% reduction
in energy consumption compared to the baseline.
4. Verlet Algorithm with Compression
Algorithm 1 presents the pseudo-code for the Verlet algorithm in the proposed method.
In each timestep, the Initial_integrate() function updates the position and velocity
of the atoms. The count value for the frequent sign and exponent case of the updated
position (Updated PSE case) is read. The read value is incremented by 1 and written
back, and this process is repeated for the entire communication atom amount (EA).
After the counting is completed, the position (P) is input for communication. If the
PSE case matches the most frequently used case (M1), the PSE is compressed into 1-bit
as 1 and concatenated with the mantissa of the position (PM). If the PSE case matches
the second most frequently used case (M2), the PSE is also compressed into 1-bit as
0 and concatenated with PM in the same manner. If the PSE does not match M1 or M2,
no compression is performed. Once all compression is completed, off-chip communication
occurs with another processor to exchange the position (P) values of ghost atoms within
the cut-off distance using the Forward_comm() function.
In the Force_compute() step, forces (F) for all atoms are computed based on $f_R\left(r_{ij}\right)$
and $f_A\left(r_{ij}\right)$, as described in Section II. The force (F) is processed
in the same manner as the previous steps, with counting and compression repeated for
EA. The compressed force data is then exchanged with another processor through off-chip
communication using the Reverse_comm() function. Finally, the Final_integrate() function
updates the velocity of the atoms. This procedure is repeated for timestep iteration
(N).
To clearly distinguish between compressed data and uncompressed FP16 values during
decoding, we employ a compression map. This map is implemented as a 1-bit stream,
introducing an overhead of 6.25% over the original FP16 data. As shown in Fig. 4(a), if 71.3% of the total elements are compressed, the average number of bits per element
is reduced to 13.4 bits. Therefore, the proposed compression method reduces the communication
volume by 16% and the overhead introduced by the compression map becomes negligible.
5. Update Unit
The position update unit performs a function equivalent to Initial_integrate() in
the Verlet algorithm. In the position update unit, Newton's second law $F=ma$ is used
to update the velocity $V$ and position P based on the input force F and velocity
$V$. Two parameters are used in this process: $dtfm$ and $dtv$. The dtfm parameter
represents $F/M$ differentiated by time, effectively the velocity $V$, which is the
acceleration integrated over time. The dtv parameter represents the velocity $V$ differentiated
by time, yielding the position $P$. The value of input F, multiplied by the dtfm parameter,
is added to the velocity $V$, while the value of input $V$, multiplied by the $dtv$
parameter, is added to the position $P$. This update is performed for all atoms, and
once communication and the force computation are completed, the velocity update unit
is triggered.
The velocity update unit performs a function equivalent to Final_integrate() in the
Verlet algorithm. In this unit, the $V$ is updated based on the input force $F$. The
input velocity $V$, multiplied by the dtfm parameter, is added to the velocity $V$.
Similar to the position update unit, updates are performed for all atoms.
6. Compute Unit
The Compute Unit contains four calculation units: the repulsive unit, zeta unit, force_zeta
unit, and attractive unit, each performing functions equivalent to the pair_tersoff()
in the Verlet algorithm. The Compute Unit is responsible for calculating forces for
both two-atom and three-atom interactions. The unit iterates over the neighbor list
of all atoms. It receives the positions of atom $i$ and atom $j_1$, computes the distance
$r$ between them, and activates the repulsive unit only if the distance is less than
the cutoff distance, generating a repulsive force value through the repulsive unit.
The process is repeated for each neighboring atom $j_1$ of atom $i$. At each iteration,
the force value from the repulsive interaction is subtracted from the existing force
for atom $j_1$.
Once the force generation for neighboring atom $j_1$ is complete, the positions of
atom $j_2$ and atom $k_1$ are received. The distance between atom $i$ and atom $j_2$
is calculated, and if it is less than the cutoff distance, the distance between atom
$i$ and atom $k_1$ is also computed. If this distance is also below the cutoff distance,
the zeta unit is triggered, accumulating the zeta value for the $ij$ pair. Zeta is
a parameter used to adjust the bond-order.
After repeating the zeta unit process, the force_zeta unit is executed to generate
force values. Finally, the positions of a new atom $k_2$ are received, and the attractive
unit is activated if the distance between atoms is below the cutoff distance.
During the repetitions for $k_2$ iterations, the forces from the attractive unit for
atoms $i$, $j_2$, and $k_2$ are accumulated. After the loop for $k_2$ ends, the forces
for atom $j_2$ are accumulated during the repetitions for $j_2$ iterations. Finally,
once the loop for $j_2$ is complete, the forces for atom $i$ are accumulated during
the repetitions for $i$ iterations. The accumulated or subtracted forces up to this
point are written back to the global buffer at the end of each iteration.
7. Hardware Design
Fig. 5 illustrates the operation of the main components in the CMD architecture and the
top-level structure with MPI based parallelization. The top-level architecture consists
of multiple boards (K) connected by a data bus (PCIe). In this stage, we assume that
each MPI process is dedicated to 1 board. Each board is composed of DRAM and a CMD
chip.
The CMD chip includes an integrate unit, a compute unit, a communication unit, 9 global
Buffers where each of them has 128KB, and a global controller. The 9 global buffers
consist of 3 position buffers (PB), 3 velocity buffers (VB), and 3 force buffers (FB).
The 3 buffers (PB/VB/FB) correspond to $x$, $y$, and $z$ dimensions, repectively.
These global buffers distribute the required data to the integrate unit, compute unit,
and communication unit.
The Integrate unit comprises velocity update unit and position update unit. The compute
unit contains the repulsive unit, attractive unit, force_zeta unit, zeta unit. Additionally,
the CMD architecture includes a case_counter and 8KB of case_count SRAM.
The communication unit consists of a compression unit, a decompression unit, and MPI_receive
and MPI_send units used for off-chip communication. The position and velocity information
are received from the global buffer and used to update the system. As mentioned in
Section III-2 and III-4, the counting process for the cases is conducted. The case
counter then stores the sign and exponent information along with the count value in
the case count SRAM. As explained in Section III-3, the most frequently used case
is sent to the compression unit, and the communicated atom data is received from the
global buffer for compression. The compressed data is sent to another processor via
MPI_send unit, and the atom information from the other processor is received through
MPI_receive unit. The information received is then decompressed and sent to the global
buffer.
The force compute unit receives input from the global buffer, performs force calculations,
and the resulting force data is counted for force cases through the case_counter.
Similar to the position compression, the most frequent case of force is sent to the
compression unit, and the communication force data from the global buffer is compressed.
The off-chip communication of the force data with another processor is performed via
MPI_send and MPI_receive unit, and the processed data is stored in the global buffer.
The force and velocity information stored in the global buffer are transmitted to
the velocity update unit, where the received force and velocity are used to update
the velocity. The updated velocity is then written back to the global buffer.
Fig. 5. Top-level architecture of our proposed CMD approach. The number of CMD chip/board
can be changed depending on the target MPI parallelism.
Fig. 6. The energy consumption comparison between the baseline and the proposed architecture
across various conditions.
IV. RESULTS
1. Experimental Setup
In this section, we evaluate the performance of the CMD architecture in terms of energy
consumption during data communication over the PCIe protocol. We set the vanilla communication
method in LAMMPS as the baseline and compared it with the proposed compact communication
scheme. Subsequently, we varied the total number of atoms, floating-point format,
and MPI tasks for the evaluation. In this design, we used PCIe protocol $4.0 \times
4$ lanes, with the maximum bandwidth set to 8GB/s and energy per bit measured at 7.5
pJ/bit.
As the DRAM model, we used the DDR4, with a maximum bandwidth of 25.6GB/s and energy
per bit of 5.1 pJ/bit [10]. The CMD architecture was designed using Verilog HDL, and gate-level synthesis was
performed using Synopsys Design Compiler. The area of top-level architecture was obtained
through PrimeTime.
2. Results
1) Normalized Communication Energy
In FP64 with 2 MPI configuration, the proposed compression scheme reduces the board-to-board
communication by 11.64% in the 256K atom case. In the lower bits, the energy efficiency
becomes higher considering that the number of mantissa bits is much smaller than the
sign and exponent bits.
As a result, the energy consumption is reduced by 16.3% compared to the baseline.
Under the same conditions, the floating-point 16-bit format shows even lower energy
consumption than the 32-bit format by 23.2% reduction. This result indicates that
reducing the bit width of the floating-point format leads to lower energy consumption.
When comparing the energy consumption of the proposed communication method with a
total of 256K atoms using floating-point 16bit formats, 4 MPI tasks achieve a 23.67%
energy reduction compared to the baseline, which is more efficient than 2 MPI tasks.
Under the same conditions, using 8 MPI tasks results in a 24.6% energy optimization
compared to the baseline, leading to further energy savings.
2) Area breakdown
Fig. 7 represents the area breakdown of the CMD architecture. The memory components consist
of input SRAM and count_SRAM, while the logic components include force_compute, initial_integrate,
final_integrate, and the extra compressor and case_counter modules for the compressed
MPI communication. The input SRAM and force_compute module occupy 97.93% of the total
area. The area overhead including count_sram, compressor, and case_counter, accounts
for 0.43% of the total area. The CMD architecture we propose can reduce energy consumption
by up to 24.6% with the negligible area overhead.
Fig. 7. The area breakdown of the proposed CMD architecture.