Different ways to add entries to Tpetra sparse matrix

Lesson topics

This lesson explains how to add entries to a Tpetra sparse matrix (Tpetra::CrsMatrix). It explains how you can choose any of various ways to add entries to the matrix. Some are more flexible but less efficient, and others are more efficient but less flexible. In general, knowing the sparse matrix structure in advance will reduce both peak memory usage and total run time.

This lesson does not currently include code examples. The section Exercise at the end suggests modifying the example Tpetra Lesson 03: Power method in order to learn about the different ways to add or change the entries of a sparse matrix.

Filling a sparse matrix

We call adding entries to a sparse matrix filling it, and call the general procedure "sparse matrix fill" or just fill. Often people have finite element assembly in mind when they use this expression, but fill here means adding entries for any reason. For example, you might want to

solve an optimization problem with a set of linear constraints, so you would add one row to the sparse matrix for each constraint; or
represent a weighted graph as a sparse matrix, so that each row of the matrix represents a vertex's adjacency list and corresponding edge weights.

Also, by extension, "fill" might refer to modifying existing entries of a sparse matrix.

Rowwise access, not rowwise distribution

CrsMatrix inherits from the RowMatrix interface; it "is a" RowMatrix. Some people have the misconception that "RowMatrix" (and therefore CrsMatrix) refers to a rowwise distribution of the sparse matrix over parallel processes. It does not. CrsMatrix supports an arbitrary "two-dimensional" distribution of rows and columns over processes. The term instead refers to rowwise access to the data. That is, the methods in this interface and in CrsMatrix let you add or access entries on each process by row.

This distinction matters because two or more processes might share entries in a row. Asking for "all the entries in a row" on a particular process only accesses the entries owned by that process, which is not necessarily all the entries in a row.

Whether adding entries or modifying existing ones, you may always do so for any number of entries in the row, such that their columns are owned by the calling process. You should always modify as many entries with one method call as possible, in order to amortize function call and data access overhead.

Phases of fill

A sparse matrix may be in any of three different states:

Post construction, before first call to fillComplete()
fillComplete() has been called after construction or resumeFill()
resumeFill() has been called

In States 1 and 3, you are allowed to set or modify values in the matrix. If the matrix was not created with a constant graph (see below), and if there is room in the row (see "static profile" discussion below), you are allowed to insert entries into the matrix.

We call a matrix in State 2 "fill complete." In State 2, you are not allowed to modify the matrix at all. (This is different than Epetra_CrsMatrix, which allows modifications to the values (but not the structure) of a fill complete matrix.)

Calling fillComplete() does a lot of work. In general, it requires distributed-memory communication as well as local rearrangement of data. The intent is that you should exploit this work for as long as possible before calling resumeFill(). You can do so by performing multiple sparse matrix-vector multiplies or triangular solves.

Trade-offs between flexibility and performance

In all the examples below, we assume the following "preamble" of code:

using Teuchos::RCP;
using Teuchos::rcp;
// Your scalar type; the type of sparse matrix entries. e.g., double.
typedef ST ...; 
// Your local ordinal type; the signed integer type 
// used to store local sparse matrix indices.  e.g., int.
typedef LO ...; 
// Your global ordinal type; the signed integer type 
// used to index the matrix globally, over all processes.
// e.g., int, long, ptrdif_t, int64_t, ...
typedef GO ...; 
// The Node type.  e.g., Kokkos::DefaultNode::DefaultNodeType, 
// defined in KokkosCompat_DefaultNode.hpp.
typedef NT ...; 

Allocation with a size hint

CrsMatrix lets you make trade-offs between flexibility and performance. All of these trade-offs revolve around how much you know about the structure of the sparse matrix – that is, its graph – in advance. In the most general case, all you know is which process owns which row – the row Map.

RCP<const Map<LO, GO, NT> > rowMap = ...; // whatever row Map you want

RCP<CrsMatrix<ST, LO, GO, NT> > A = rcp (new CrsMatrix<ST, LO, GO, NT> (rowMap ,sizehint));

This code snippet says that you want to create a sparse matrix with the given row distribution with a certain approximate number of entries per row. It also says that you intend each process to own all the entries in a given row. If you want to allow a more general "2-D" distribution, you must also tell the constructor which process owns which columns of the matrix. You do so by supplying a "column Map":

RCP<const Map<LO, GO, NT> > rowMap = ...; // whatever row Map you want
RCP<const Map<LO, GO, NT> > colMap = ...; // whatever column Map you want
RCP<CrsMatrix<ST, LO, GO, NT> > A = rcp (new CrsMatrix<ST, LO, GO, NT> (rowMap, colMap, sizehint));

All the examples that follow allow you to supply a column Map if you which, so we omit this for brevity.

CrsMatrix lets you make trade-offs between flexibility and performance. All of these trade-offs revolve around how much you know about the structure of the sparse matrix – that is, its graph – in advance. In the most general case, all you know is which process owns which row – the row Map — and an approximate number of nonzeros per row. You may do so either with a single number, which is an expected upper bound on the number of entries in any one row, or with an upper bound for each row of the matrix. The typical use case is a single upper bound, which we illustrate below.

RCP<const Map<LO, GO, NT> > rowMap = ...; // whatever row Map you want
const size_t maxEntriesPerRow = 3;
RCP<CrsMatrix<ST, LO, GO, NT> > A = 
  rcp (new CrsMatrix<ST, LO, GO, NT> (rowMap, maxEntriesPerRow));

Giving an upper bound on the number of entries per row allows the matrix to preallocate space for each row, which is especially critical on architectures where memory allocation is relatively expensive (e.g. GPUs).

Constant graph

The most efficient but least flexible fill method is to create the CrsMatrix with a constant graph. That is, you create a CrsGraph separately, fill it, call its fillComplete() method, then pass the graph to the CrsMatrix constructor. This completely constrains the structure of the CrsMatrix. You may only set or modify values in the matrix, not the structure. (This means you may not call insertGlobalValues() or insertLocalValues(), only the replace*Values(), sumInto*Values, scale(), and setAllToScalar() methods.

using Tpetra::CrsGraph;
// Note that the CrsMatrix constructor takes a "const" graph,
// meaning that the CrsMatrix is not allowed to change the
// graph.  You shouldn't either.  This is why we use an
// RCP<const CrsGraph> ("pointer to const graph") below.
// The graph must be fill complete.
RCP<const CrsGraph<LO, GO, NT> > graph = ...; 
// The graph comes with the row and column Maps already built in.
RCP<CrsMatrix<ST, LO, GO, NT> > A = rcp (new CrsMatrix<ST, LO, GO, NT> (graph));

Filling the CrsGraph works much like filling a CrsMatrix, except that you only specify structure, not values. Furthermore, the same size hint applies to CrsGraph.

Local vs. global indices

Please see Tpetra Lesson 02: Map and Vector for a detailed explanation of global and local indices.

In the sparse matrix's final state, it represents column indices as local indices. This is faster for operations like sparse matrix-vector multiply. It saves storage as well, especially if sizeof(LocalOrdinal) < sizeof(GlobalOrdinal). However, accessing or inserting entries by local index is only allowed under certain conditions. In particular, the matrix must have a column Map, which tells it how to convert between global and local indices for the column indices of the matrix. If you didn't create the matrix with a precomputed column Map (either in its constructor, or in the constant graph), it has to compute the column Map on its own. It does so in fillComplete().

If the matrix doesn't have a column Map yet, you must use global indices, since local indices don't exist yet. However, if you can use local indices, you should. For global indices, the matrix must go to the Map to look up the corresponding local index for every global index. Local indices only require an array access.

Remember that if column indices are local, then they are counted with respect to the column Map. Local row and column indices may not necessarily be the same. For example, the matrix entry at position (i_local, i_local), where the first i_local is the local row index and the second i_local the local column index, may not necessarily be a diagonal entry of the matrix. However, the matrix entry at global position (i_global, i_global) is always a diagonal entry of the matrix.

Exercise

Start with the Tpetra Lesson 03: Power method example, which uses global indices. Try all of the different fill techniques described in this lesson. If you like, you may also try cyclic boundary conditions instead of Dirichlet boundary conditions in the 1-D Poisson discretization. (This makes the number of entries per row the same for all rows.)