Loop Fusion
Numerical algorithms and optimization methods are typically composed of numerous consecutive sparse matrix computations. For example, in iterative solvers such as Krylov methods, sparse kernels that apply a preconditioner or update the residual are repeatedly executed inside and between iterations of the solver. Separately optimizing such kernels creates synchronization and load imbalance between threads. Also, opportunities for data reuse between two sparse computations might not be realized when sparse kernels are optimized separately. Sympiler uses sparse fusion to fuse sparse methods for improving locality and thread-level parallelism. Sparse fusion has a transformation and an inspector.
Transformation
To generate the fused code, a fused transformation is applied
to the initial AST at compile time and two variants of the
fused code are generated, as shown below. The transformation
variants are separated and interleaved. The fused code
uses a reuse ratio at runtime to select the correct variant
for the specific input. The reuse ratio (reuse_ratio
) specifies
how much common data between the two loops exists.
The variable fusion in line 2 of the two fused variants
is set to False
if the inspector determines fusion is not
profitable. The separated variant is selected when
the reuse ratio is smaller than one. In this variant, iterations
of one of the loops run consecutively without checking the
loop type. The interleaved variant is chosen when the reuse
ratio is larger than one. In this variant, iterations of both
loops should run interleaved, and the variant checks the loop
type per iteration.
/// The input AST to the fusion transformation
Fuse:for(I1){//loop 1
...
for(In)
x[h(I1,...,In)] = a*y[g(I1,...,In)];
}
Fuse:for(J1){//loop 2
...
for(Jm)
z[h’(J1,...,Jm)] = a*x[g’(J1,...,Jm)];
}
/// Separated variant of the fusion transformation
if(FusedSchedule.fusion && reuse_ratio < 1){
for (every s-partition s){
#pragma omp parallel for
for (every w-partition w){
for(v ∈ FusedSchedule[s][w].L1){//loop 1
...
for(In)
x[h(v,...,In)] = a*y[g(v,...,In)];
}
for(v ∈ FusedSchedule[s][w].L2){//loop 2
...
for(Jm)
z[h’(v,...,Jm)] = a*x[g’(v,...,Jm)];
}
}
}
}
/// Interleaved variant of the fusion transformation
if(FusedSchedule.fusion && reuse_ratio >= 1){
for (every s-partition s){
#pragma omp parallel for
for (every w-partition w){
for(v ∈ FusedSchedule[s][w]){
if(v.type == L1){//loop 1
for(In)
x[h(v.id,...,In)] = a*y[g(v,...,In)];
} else {//loop 2
for(Jm)
z[h’(v.id,...,Jm)] = a*x[g’(v,...,Jm)];
}
}
}
}
}
Inspector
Sparse fusion uses the multi-sparse DAG partitioning (MSP) algorithm to create an efficient fused partitioning that will be used to schedule iterations of the fused code. MSP partitions vertices of the DAGs of the two input kernels to create parallel load-balanced workloads for all cores while improving locality within each thread. The inputs to MSP are the dependency matrix between sparse methods, the DAG of each method, and a reuse ratio. Sparse fusion analyzes the code of both methods, available from its AST, to generate inspector components that create these inputs. MSP takes the inputs and goes through three steps of (1) vertex partitioning and partition pairing with the objective to aggregate iterations without joining the DAGs of the inputs kernels; (2) merging and slack vertex assignment to reduce synchronization and balance workloads; and (3) packing to improve locality.
A detailed explanation of sparse fuision is provided in these references.