Table of Contents

Fortran

GPU

Map Reduce

Open MP and Open MPI

Broadcasting and Cycling

Serialization

Inner Product

Outer Product

Linear Transformation

Coordinate Transformation

Adjacency Matrix

Strassen Multiplication

Low Rank Matrix

If A is a matrix with dimensions m × r (m rows and r columns) and B is a matrix with dimensions r × n, then the product matrix C is defined and has dimensions m × n. The value of the entries of C are:

(1)

$$ \begin{align} c_{ij} = \sum_{k=1}^r a_{ik} b_{kj} \end{align} $$

C

If a language does not have support for two dimensional arrays, there are a few ways to implement them using one dimensional arrays.

One could create an array of pointers to arrays. It is faster to store all the entries in a contiguous region of memory, however.

Row major order is a method of storing a 2D array in a 1D array where each row is contiguous. Row major order is used in C code.

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>

static inline size_t
matrix_idx(size_t rows, size_t cols, size_t row, size_t col) {
  return col + row * cols;
}

double*
matrix_multiply(size_t rows_A, size_t cols_A, double *A,
                size_t rows_B, size_t cols_B, double *B) {
  assert(cols_A == rows_B);

  size_t rows_C = rows_A, cols_C = cols_B;
  int r, c, i;
  double *C = (double *)calloc(rows_C * cols_C, sizeof(double));

  for (r = 0; r < rows_A; ++r) {
    for (c = 0; c < cols_B; ++c) {
      C[matrix_idx(rows_C, cols_C, r, c)] = 0.0;
      for (i = 0; i < cols_A; ++i) {
        C[matrix_idx(rows_C, cols_C, r, c)]
          += A[matrix_idx(rows_A, cols_A, r, i)]
          * B[matrix_idx(rows_B, cols_B, i, c)];
      }
    }
  }

  return C;
}

void
matrix_print(size_t rows_A, size_t cols_A, double* A) {
  size_t r, c;

  for (r = 0; r < rows_A; ++r) {
    for (c = 0; c < cols_A; ++c) {
      printf("%9.3f ", A[matrix_idx(rows_A, cols_A, r, c)]);
    }
    printf("\n");
  }
}

int
main(int argc, char **argv) {
  double A[4][3] = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}, {10, 11, 12}};
  double B[3][2] = {{6, 5}, {4, 3}, {2, 1}};
  double *C = matrix_multiply(4, 3, (double *)A, 3, 2, (double *)B);

  printf("# of entries in A: %lu\n", sizeof(A) / sizeof(A[0][0]));
  printf("# of rows of A: %lu\n", sizeof(A) / sizeof(A[0]));
  printf("# of columns of A: %lu\n", sizeof(A[0]) / sizeof(A[0][0]));

  matrix_print(4, 2, C);
  free(C);

  return(0);
}

A two dimensional array is valid C type and we can use the sizeof operator to get the number of rows and columns. Hard-coding matrix dimensions into the code is inconvenient, however. We don't want to define multiplication functions for every possible dimension of the multiplicands, and if the end user has control over the number of dimensions, this isn't even possible.

Note that we cast double[M][N] to double * and not double ****. A double ** type would be for the less efficient array of pointers of arrays.

Setting the dimension dynamically is necessary for flexibility, but passing around the matrix dimensions is error prone. Worse yet, these errors cause garbage output or segmentation faults. We can minimize such errors by defining a matrix type which records the dimensions.

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct {
  size_t rows;
  size_t columns;
  double *entries;
} matrix_t;

static inline size_t
matrix_idx(matrix_t *A, size_t row, size_t col) {
  return col + row * A->columns;
}

matrix_t *
matrix_alloc(size_t rows, size_t cols) {
  matrix_t *A = malloc(sizeof(matrix_t));
  A->rows = rows;
  A->columns = cols;
  A->entries = calloc(rows * cols, sizeof(double));

  return A;
}

void
matrix_free(matrix_t *A) {
  free(A->entries);
  free(A);
}

matrix_t *
matrix_multiply(matrix_t *A, matrix_t *B) {
  assert(A->columns == B->rows);

  matrix_t *C = matrix_alloc(A->rows, B->columns);
  int r, c, i;

  for (r = 0; r < A->rows; ++r) {
    for (c = 0; c < B->columns; ++c) {
      C->entries[matrix_idx(C, r, c)] = 0.0;
      for (i = 0; i < A->columns; ++i) {
        C->entries[matrix_idx(C, r, c)]
          += A->entries[matrix_idx(A, r, i)]
          * B->entries[matrix_idx(B, i, c)];
      }
    }
  }

  return C;
}

void
matrix_print(matrix_t *A) {
  size_t r, c;

  for (r = 0; r < A->rows; ++r) {
    for (c = 0; c < A->columns; ++c) {
      printf("%9.3f ", A->entries[matrix_idx(A, r, c)]);
    }
    printf("\n");
  }
}

int
main(int argc, char **argv) {
  double A_entries[4][3] = {{1, 2, 3}, {4, 5, 6},
                            {7, 8, 9}, {10, 11, 12}};
  matrix_t *A = matrix_alloc(4, 3);
  memcpy(A->entries, A_entries, 4 * 3 * sizeof(double));

  double B_entries[3][2] = {{6, 5}, {4, 3}, {2, 1}};
  matrix_t *B = matrix_alloc(3, 2);
  memcpy(B->entries, B_entries, 3 * 2 * sizeof(double));

  matrix_t *C = matrix_multiply(A, B);

  matrix_print(C);
  matrix_free(A);
  matrix_free(B);
  matrix_free(C);

  return(0);
}

Perhaps one would notice an additional performance improvement if the matrix dimensions were stored contiguously with the entries, but I don't think we could use a struct in that case.

Fortran

Fortran has a built-in multiplication function, and two dimensional arrays can be declared with dimensions chosen at run time.

subroutine print_matrix(m, n, A)
  integer, intent(in) :: m, n
  real, intent(in) :: A(m, n)
  integer :: i, j

  do i = 1, m
     do j = 1, n
        write(*, "(F9.3) ", advance="no") A(i, j)
     enddo
     write(*, *)
  enddo
end subroutine

program hello
  real :: A(4, 3) = &
    reshape((/ 1, 4, 7, 10, 2, 5, 8, 11, 3, 6, 9, 12 /), &
    (/ 4, 3 /))
  real :: B(3, 2) = &
    reshape((/ 6, 4, 2, 5, 3, 1 /), &
    (/ 3, 2 /))
  real::C(size(A, 1), size(B, 2))

  C = matmul(A, B)
  call print_matrix(size(C, 1), size(C, 2), C)
end program hello

It isn't necessary to pass the dimensions of the matrices to matmul, so maybe it is possible to define a print_matrix subroutine that doesn't take the dimensions as arguments. It is a compilation error, at least under some circumstances, to call matmul(A, B) when the number of columns of A don't match the number of rows of B.

Other Languages

There are a host of languages in which it is more convenient to do matrix multiplication than either C or Fortran:

matrix × vector
matlab	r	numpy	mathematica	pari/gp	apl
A = [1 2 3; 4 5 6] v = [1; 2; 3] A * v	A = matrix(c(1, 2, 3, 4, 5, 6), nrow=2, byrow=T) v = c(1, 2, 3) A %*% v	A = numpy.array( [[1, 2, 3], [4, 5, 6]]) v = numpy.array([1, 2, 3]) numpy.dot(A, v)	A = {{1, 2, 3}, {4, 5, 6}} v = {1, 2, 3} Dot[A, v]	A = [1, 2, 3; 4, 5, 6] v = [1, 2, 3]~ A * v	A ← 2 3 ⍴ 1 2 3 4 5 6 v ← 1 2 3 A +.× v

In Matlab, vectors are row vectors or column vectors; e.g. they are the same as 1 × n or n × 1 matrices. If we intend to perform matrix by vector multiplication, we must use a column vector. Multiplication by a row vector on the right will cause a run time exception. Also, note that scalars are identical to 1 × 1 matrices.

R does not have syntax for a matrix literal; a one dimensional array of data is provided to the matrix constructor which is then reshaped by specifying the number of rows or columns. By default the reshaping assumes column major order; the byrow=T parameter changes it to row major order. In R, a vector is a distinct data type from a 1 × n and a n × 1 matrix. We can multiply a vector with matrix on the right or the left of matrix without the need to transpose it. On the other hand, a scalar is a vector with length of 1.

matrix × matrix
matlab	r	numpy	mathematica	pari/gp	apl
A = [1 2; 3 4; 5 6] B = [8 7 6 5; 4 3 2 1] A * B	A = matrix(c(1, 2, 3, 4, 5, 6), nrow=3, byrow=T) B = matrix( c(8, 7, 6, 5, 4, 3, 2, 1), nrow=2, byrow=T) A %*% B	A = numpy.array( [[1, 2], [3, 4], [5, 6]]) B = numpy.array( [[8, 7, 6, 5], [4, 3, 2, 1]]) numpy.dot(A, B)	A = {{1, 2}, {3, 4}, {5, 6}} B = {{8, 7, 6, 5}, {4, 3, 2, 1}} Dot[A, B]	A = [1, 2; 3, 4; 5, 6] B = [8, 7, 6, 5; 4, 3, 2, 1] A * B	A ← 3 2 ⍴ 1 2 3 4 5 6 B ← 2 4 ⍴ 8 7 6 5 4 3 2 1 A +.× B

In NumPy, vectors are one dimensional entities which are distinct from n × 1 or 1 × n matrices.

In Mathematica, a vector is a simple list and a matrix is a nested list. Thus vectors are distinct from n × 1 or 1 × n matrices.

transpose and dot product
matlab	r	numpy	mathematica	pari/gp	apl
A = [1 2; 3 4; 5 6] A' % works for row and col. vecs: dot([1; 1; 1], [2; 2; 2]) dot([1 1 1], [2 2 2]) dot([1 1 1], [2; 2; 2])	A = matrix(c(1, 2, 3, 4, 5, 6), nrow=3, byrow=T) t(A) c(1, 1, 1) %*% c(2, 2, 2)	A = numpy.array( [[1, 2], [3, 4], [5, 6]]) A.transpose() numpy.dot( numpy.array([1, 1, 1]), numpy.array([2, 2, 2]))	A = {{1, 2}, {3, 4}, {5, 6}} Transpose[A] Dot[{1, 1, 1}, {2, 2, 2}]	A = [1, 2; 3, 4; 5, 6] A~ [1, 1, 1] * [2, 2, 2]~	A ← 2 3 ⍴ 1 2 3 4 5 6 ⍉ A 1 1 1 +.× 2 2 2

Pari/GP has a vector type and a matrix type. Unlike the other languages described here, there is no support for three and higher dimensional arrays.

dimensions, rows, and columns
matlab	r	numpy	mathematica	pari/gp	apl
A = [1 2; 3 4; 5 6] ndims(A) size(A, 1) size(A, 2)	A = matrix(c(1, 2, 3, 4, 5, 6), nrow=3, byrow=T) length(dim(A)) dim(A)[1] dim(A)[2]	A = numpy.array( [[1, 2], [3, 4], [5, 6]]) len(A.shape) A.shape[0] A.shape[1]	A = {{1, 2}, {3, 4}, {5, 6}} Length[Dimensions[A]] Dimensions[A][[1]] Dimensions[A][[2]]	A = [1, 2; 3, 4; 5, 6] \\ t_VEC has 1 dim; \\ t_MAT has 2 dims: matsize(A)[1] matsize(A)[2]	A ← 2 3 ⍴ 1 2 3 4 5 6 ⍴ ⍴ A (⍴ A)[1] (⍴ A)[2]

GPU

Map Reduce

Matrix multiplication can be expressed as a map reduce job.

First consider the case of a matrix with entries m_ij and a vector with entries v_j. The matrix is partitioned among mappers, something which is easy to do if each entry is represented as a triple (i, j, m_ij). Each mapper reads the entire vector, and for each entry it encounters, it creates the key-value pair (i, m_ij × v_j). The reducer then sums all the m_ij × v_j values associated with an i to get the i-th entry of the output vector.

If the vector is too large to keep in memory, the matrix should be partitioned into vertical strips. Each strip will only contain triples (i, j, m_ij) where j is within a small range. The mapper processing the strip only needs to read the corresponding v_j.

matrix multiplication: two job method

Now consider the multiplication of two matrices with entries m_ij and n_jk.

We can have mappers create tuples (j, (M, i, m_ij)) and (j, (N, k, n_jk)) from the two matrices, where M and N are identifiers indicating which matrix the tuple came from. The reducer then creates tuples of the form ((i, j), m_ij × n_jk). This is fed to a second map reduce job where the mapper is an identity no-op and the reducer sums the values m_ij × n_jk for the key (i, j) to get the value of the product matrix at row i and column j.

matrix multiplication: one job method

One Job method and the number of reducers vs size of reducer trade-off

Open MP and Open MPI

Broadcasting and Cycling

Serialization

Inner Product

Outer Product

Linear Transformation

Coordinate Transformation

Adjacency Matrix

Strassen Multiplication

Suppose that X and Y are two n×n matrices where n is a power of two. There is a subcubic algorithm for multiplying the matrices. Divide the matrices X and Y into quadrants labeled as follows:

(2)

$$ \begin{align} X = \begin{bmatrix} A & B \\ C & D \end{bmatrix} \\ Y = \begin{bmatrix} E & F \\ G & H \end{bmatrix} \end{align} $$

Define the following matrices in terms of the quadrants

(3)

$$ \begin{align} P_1 &= A(F-H) \\ P_2 &= (A+B)H \\ P_3 &= (C+D)E \\ P_4 &= D(G-E) \\ P_5 &= (A+D)(E+H) \\ P_6 &= (B-D)(G+H) \\ P_7 &= (A-C)(E+F) \end{align} $$

One can easily very that the following two methods for computing the product of X and Y yield the same answer:

(4)

$$ \begin{align} X\cdot Y = \begin{bmatrix} AE + BG & AF + BH \\ EC + DG & CF + DH \end{bmatrix} = \begin{bmatrix} P_5 + P_4 - P_2 + P_6 & P_1 + P_2 \\ P_3 + P_4 & P_1 + P_5 - P_3 - P_7 \end{bmatrix} \end{align} $$

The first uses 8 matrix multiplications, but the 2nd uses only 7, which is why it has subcubic performance asymptotically.

Low Rank Matrix

Suppose that A is an m × n matrix and x is an n vector.

Computing the product Ax requires mn multiplications.

However, if we can factor A = BC where B is an m × r and C is an r × n matrix, then if we compute Ax = BCx = B(Cx), we will use mr + nr multiplications. If r is small relative to mn, this can be much faster. However, since rank(A) = rank(BC) ≤ min(rank(B), rank(C)) ≤ r, not all matrices can be factorized such that r is small. Even for high rank matrices A, we might be able to find an approximate factorization with small r.

CLARK GRUBB

C

Fortran

Other Languages

GPU

Map Reduce

Open MP and Open MPI

Broadcasting and Cycling

Serialization

Inner Product

Outer Product

Linear Transformation

Coordinate Transformation

Adjacency Matrix

Strassen Multiplication

Low Rank Matrix