Block-Tridiagonal Solver Extension to SUNDIALS

Solver Concept

For some engineering problems the structure of the resulting Jacobian matrix is block-tridiagonal. While such problems can be solved with the generic band solver included in Sundials, a more efficient implementation is proposed here. This is work in progress, so please be prepared to encounter bugs and problems.

Implementation Details

The Block-Tridiagonal solver named BTridiag is implemented as DlsMat structure and as plain C function variant. Memory layout is band-wise, lower block-band L , followed by main band M and upper band U, L[0] and U[n-1] are used as additional blocks in first and last block row (see TechRep. UCID-30150, "Solution of Block-Tridiagonal Systems of Linear Algebraic Equations", Hindmarsh 1977).

The solver is implemented based on the original band solver implementation (basically just a copy + paste + renaming and substitution of function content).

Implemented functions are the typical create, copy, scale functions of DlsMat structures, and most importantly LU and backsolve.

Mathematical Considerations

The LU factorization is an in-place matrix-based Crout's method.

Two options are implemented: a pivoting version and a non-pivoting version. The routine selects the appropriate code based on the pivoting array pointer passed to the function (if NULL pointer, the non-pivoting algorithms are used).
Pivoting is done only on block-level, not on global scope.

Special Algorithms

Besides the generic algorithm, the 1x1 version (plain tridiagonal matrix) is specialized always as non-pivoting version.

Currently specialized (manual loop unrolling) versions exist for matrix dimensions 2x2, 3x3 and 4x4.

Implementation Variants

Besides the pivoting and non-pivoting versions of LU and backsolve, two alternatives exist regarding the internal addressing of memory blocks.

Variant 1

The addresses of the bands are stored in a vector:

``````A->cols[0]  /* points to start of lower block diagonal */
A->cols[1]  /* points to start of main block diagonal */
A->cols[2]  /* points to start of upper block diagonal*/

/* or alternatively the realtype** version */
a[0]  /* points to start of lower block diagonal */
a[1]  /* points to start of main block diagonal */
a[2]  /* points to start of upper block diagonal*/```
```

Extra memory needed for these address pointers: 3 realtype * pointers.

Within the LU factorization and backsolution algorithms, the offsets to the individual blocks are computed, for example:

``````/* blockmemsize is the memory consumption of a dense matrix block */
blockmemsize = blockdim*blockdim;
/* block 6 (index 5) in lower band diagonal */
realtype * block = a[0] + 5*blockmemsize;```
```

Variant 2

Addresses of all blocks are stored in the A->cols[] array (or a[] array, respectively).

``````A->cols[0]    /* points to start of lower block diagonal */
A->cols[n]    /* points to start of main block diagonal */
A->cols[2*n]  /* points to start of upper block diagonal*/

/* or alternatively the realtype** version */
a[0]    /* points to start of lower block diagonal */
a[n]    /* points to start of main block diagonal */
a[2*n]  /* points to start of upper block diagonal*/```
```

Extra memory needed for these address pointers: 3 * n realtype * pointers.

Within the LU factorization and backsolution algorithms, the individual blocks can be accessed directly, for example:

``````/* block 6 (index 5) in lower band diagonal */
realtype * block = a[5];```
```

Performance Comparison

Performance is compared to the current band-matrix implementation within SUNDIALS. Two test cases must be used: one using matrices that require pivoting, and one using matrices that do not require pivoting.

Pivoting Version

The test creates block-tridiagonal matrices with randomized content and small main diagonal elements, which means that both BTridiag and Band routines need to do at least a little pivoting effort. The number of block-rows \$n\$ is kept constant (n=1000), and the block-dimension \$m\$ is increased gradually.

Each operation is done for a number of cycles. The time needed to restore the original matrices and right-hand-side vectors is benchmarked beforehand and subtracted.

Results: LU Factorization

Results optained on a Intel Core I7 - 2820QM 2.3 GHz, Windows 7 Professional, Visual Studio 2010 Express

Cubic complexity with respect to m dominates the LU routine, therefore the times are scaled by \$m^3\$. The special implementation for 1x1 (no pivoting) explains the large performance improvement. Generally, the block-based version outperforms the band implementation.

Regarding Variant 1 and 2, the difference in implementation (addressing of blocks) is only notable for small block dimensions, because only here the offset computation may be a noticeable overhead. For block-dimensions of 4x4 and up the difference in computation time is negligible, yet variant 2 is always a bit faster (also the code looks more concise).

Results: Backsolution

Results optained on a Intel Core I7 - 2820QM 2.3 GHz, Windows 7 Professional, Visual Studio 2010 Express

Quadratic complexity with respect to m dominates the LU routine, therefore the times are scaled by \$m^2\$.
Again, the special implementation for 1x1 (tridiagonal matrix, no pivoting) yields much better performance of the block-tridigonal matrix. For block dimensions 2x2 and 3x3, though, the band implementation performs better (probably because band matrix columns fit into few cache lines).

Regarding Variant 1 and 2, the difference in implementation (addressing of blocks) is only notable for small block dimensions, because only here the offset computation may be a noticeable overhead. For block-dimensions of 4x4 and up the difference in computation time is negligible.

Speedup Ratios

The long-term trend of both routines can be best seen using a speedup ratio plot. Results again obtained on a Intel Core I7 - 2820QM 2.3 GHz, Windows 7 Professional, Visual Studio 2010 Express:

Effect of Special Implementations for Lower Block-Dimensions

For 2x2, 3x3 and 4x4 specialized versions of the LU factorization and backsolving routines have been written, with the matrix-matrix and matrix-vector multiplication loops unrolled. The following diagram shows comparison between generic routines (only 1x1 tridiag special) and special implementations for 1x1 through 4x4.

Adding the Block-Tridiagonal solver to SUNDIALS 2.4 release

Library Extension

The simplest way is to add the files in the archive sundials_btridiag_extension.zip to the sundials sources in the respective directories.

The files

sundials_direct.h
sundials_direct.c

are modified versions of the original files, while

sundials_btridiag.h
sundials_btridiag.c

are new files.

Also, the updated makefiles CMakeLists.txt and Makefile.in need to be placed in the respective directories. Since the archive already contains the directory structure, simply extracting the files to the sundials source root directory should
place all files in the correct paths.

Example

The examples sundials_btridiag_examples.zip contain a functionality test and demonstration program, and a direct performance comparison example between the band and block-tridiagonal solver.

Integration of Block-Tridiagonal Linear Solver Package into Sundials Solvers

Currently, the integration has only been done for CVODE. You need to download the file code_btridiag_integration.zip.

The files

cvode_direct.h
cvode_direct.c

are modified versions of the original files, while

cvode_btridiag.h
cvode_direct_impl.h
cvode_btridiag.c

are new files.

Also, the updated makefiles CMakeLists.txt and Makefile.in need to be placed in the respective directories. Since the archive already contains the directory structure, simply extracting the files to the sundials source root directory should
place all files in the correct paths.