Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Optimality & Convergence of Conjugate Gradients: Minimizing Quadratic Functions - Prof. Mi, Study notes of Computer Science

The optimality and convergence properties of the conjugate gradients (cg) algorithm, which is an iterative process for minimizing quadratic functions. The cg algorithm is a popular method for solving large-scale linear systems and can be interpreted as an optimization algorithm. How the cg iteration minimizes the quadratic function φ(x) over the krylov subspace kn at each step, and how the choice of the search direction pn ensures that it minimizes the function over all of kn. The document also discusses the analogy between the cg iteration and the lanczos iteration, and the connection between krylov subspace iteration and polynomials of matrices. The rate of convergence of the cg iteration is determined by the location of the spectrum of a, and the document provides two theorems that give insights into the convergence behavior of the cg iteration for matrices with distinct eigenvalues or large 2-norm condition numbers.

Typology: Study notes

Pre 2010

Uploaded on 08/18/2009

koofers-user-5rv
koofers-user-5rv 🇺🇸

10 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CSE 275 Matrix Computation
Ming-Hsuan Yang
Electrical Engineering and Computer Science
University of California at Merced
Merced, CA 95344
http://faculty.ucmerced.edu/mhyang
Lecture 21
1 / 23
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Optimality & Convergence of Conjugate Gradients: Minimizing Quadratic Functions - Prof. Mi and more Study notes Computer Science in PDF only on Docsity!

CSE 275 Matrix Computation

Ming-Hsuan Yang

Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang

Lecture 21

Overview

Conjugate gradient Convergence rate of conjugate gradient Preconditioning

Optimality of conjugate gradients (cont’d)

It implies that ‖e‖^2 A = e> n Aen + (∆x)>A(∆x) Only the second term depends on ∆x and since A is positive definite, the first term is larger or equal to 0 The second term is 0 if and only if ∆x = 0 , i.e., xn = x Thus ‖e‖A is minimal if and only if xn = x as claimed The monotonicity property is a consequence of the inclusion Kn ⊆ Kn+1, and since Kn is a subset of IRm^ of dimension n as long as convergence has not yet been achieved, convergence must be achieved in at most m steps That is, each step of conjugate direction cuts down the error term component by component

Optimality of conjugate gradients (cont’d)

The guarantee that the CG iteration converges in at most m steps is void in floating point arithmetic For arbitrary matrices A on a real computer, no decisive reduction in ‖en‖A will necessarily be observed at all when n = m In practice, however, CG is used not for arbitrary matrices but for matrices whose spectra are well behaved (partially due to preconditioning) that convergence to a desired accuracy is achieved for n  m The theoretical exact convergence at n = m has no relevance to this use of the CG iteration in scientific computing

Conjugate gradients as an optimization algorithm (cont’d)

Cannot use ‖e‖A or ‖e‖^2 A as neither can be evaluated without knowing x∗ On the other hand, given A and b and x ∈ IRm, the quantity

φ(x) =

x>Ax − x>b

can certainly be evaluated as

‖en‖^2 A = e> n Aen = (x∗ − xn)>A(x∗ − xn) = x> n Axn − 2 x> n Ax∗ + x>∗ Ax∗ = x> n Axn − 2 x> n b + x>∗ b = 2φ(xn) + constant

Like ‖e‖^2 A, it must achieve its minimum uniquely at x = x∗

Conjugate gradients as an optimization algorithm (cont’d)

The CG iteration can be interpreted as an iterative process for minimizing the quadratic function φ(x) of x ∈ IRm At each step, an iterate xn = xn− 1 + αnpn− 1 is computed that minimizes φ(x) over all x in the one dimensional space xn− 1 + 〈pn− 1 〉 It can be readily confirmed that the formula

αn =

r> n− 1 rn− 1 p> n− 1 Apn− 1

ensures αn is optimal in the sense among all step lengths α What makes the CG iteration remarkable is the choice of the search direction pn− 1 , which has the special property that minimizing φ(x) over xn− 1 + 〈pn− 1 〉 actually minimizes it over all of Kn

Conjugate gradients and polynomial approximation

Connection between Krylov subspace iteration and polynomials of matrices The Arnoldi and Lanczos iterations solve the Arnoldi/Lanczos approximation problem Find pn^ ∈ Pn^ such that ‖pn(A)b‖ = minimum The GMRES iteration solves the GMRES approximation problem Find pn ∈ Pn such that ‖pn(A)b‖ = minimum For CG, the appropriate approximation problem involves the A-norm of the error Find pn ∈ Pn such that ‖pn(A)e 0 ‖A = minimum (1) where e 0 denotes the initial error e 0 = x∗ − x 0 = x∗, and Pn is again defined as GMRES (i.e., the set of polynomials p of degree ≤ n with p(0) = 1)

Rate of CG convergence

Theorem

If the CG iteration has not already converged before step n (i.e., rn− 1 6 = 0 ), then ‖pn(A)e 0 ‖A has a unique solution pn ∈ Pn, and the iterate xn has error en = pn(A)e 0 for this same polynomial pn. Consequently, we have ‖en‖A ‖e 0 ‖A = inf p∈Pn

‖p(A)e 0 ‖A ‖e 0 ‖A ≤ inf p∈Pn max λ∈Λ(A) |p(λ)|

where Λ(A) denotes the spectrum of A

From theorem in the last lecture, it follows that en = p(A)e 0 for some p ∈ Pn The equality is a consequence of this and monotonic convergence

Rate of CG convergence (cont’d)

First, we suppose that the eigenvalues are perfectly clustered but assume nothing about the locations of these clusters

Theorem

If A has only n distinct eigenvalues, then the CG iteration converges in at most n steps

This is a corollary of (1), since a polynomial p(x) =

∏n j=1(1^ −^ x/λj^ )^ ∈^ Pn^ exists that is zero at any specified set of n points {λj } At the other extreme, suppose we know nothing about any clustering of the eigenvalues but only that their distances from the origin vary by at most a factor κ ≥ 1 In other words, suppose we know only the 2-norm condition number κ = λmax /λmin, where λmax and λmin are the extreme eigenvalues of A

Rate of CG convergence (cont’d)

Theorem

Let the CG iteration be applied to a symmetric positive definite matrix problem Ax = b, where A has 2-norm condition number κ. Then the A-norm of the errors satisfy

‖en‖A ‖e 0 ‖A

/[( √

κ + 1 √ κ − 1

)n

κ + 1 √ κ − 1

)−n] ≤ 2

κ − 1 √ κ + 1

)n

Since (^) √ κ − 1 √ κ + 1

k as κ → ∞, it implies that if κ is large but not too large, convergence to a specified tolerance can be expected in O(

κ) iterations An upper bound, and convergence may be faster for special right hand sides or if the spectrum is clustered

Example: CG convergence (cont’d)

For τ = 0.01, A has 3,092 nonzero entries and κ ≈ 1 .06, the CG convergence takes place in 9 steps For τ = 0.05, A has 13,062 nonzero entries with κ ≈ 1 .83, and convergence takes place in 19 steps For τ = 0.1, A has 25,526 nonzero entries with κ ≈ 10 .3 and the process converges in 20 steps For τ = 0.2 with 50,834 nonzero entries, there is no convergence at all For this example, the CG beats Cholesky factorization by a factor of about 700 in terms of operation counts

Preconditioning

The convergence of a matrix iteration depends on the properties of the matrix - the eigenvalues, the singular values, or sometimes other information In many cases, the problem of interest can be transformed so that the properties of the matrix are improved drastically The process of preconditioning is essential to most successful applications of iterative methods

Preconditioning for Ax = b (cont’d)

Two extreme cases: I (^) If M = A, then (4) is the same as (2), and nothing has been gained I (^) If M = I , then (3) is the same as (2), and then it is a trivial solution Between these two extremes lie the useful preconditioners, I (^) structured enough (4) can be solved quickly I (^) but close enough to A in some sense that an iteration for (3) converges more quickly than an iteration for (2) What does it mean for M to be “close enough” to A? If the eigenvalues of M−^1 A are close to 1 and ‖M−^1 A − I ‖ 2 is small, then any of the iterations we have discussed can be expected to converge quickly However, preconditioners that do not satisfy such a strong condition may also perform well A simple rule of thumb: preconditioner M is good if M−^1 A is not too far from normal and its eigenvalues are clustered

Left, right and Hermitian preconditioners

What we have described may be more precisely terms as left preconditioner Another idea is to transform Ax = b into AM−^1 y = b with x = M−^1 y in which case M is called a right preconditioner If A is Hermitian positive definite, then it is usual to preserve this property in preconditioning Suppose M is also Hermitian positive definite, with M = CC ∗^ for some C , then (2) is equivalent to

[C −^1 AC −∗]C ∗x = C −^1 b

The matrix in brackets is Hermitian positive definite, so this equation can be solved by conjugate gradient or related iterations At the same time, since C −^1 AC −∗^ is similar to C −∗C −^1 A = M−^1 A, it is enough to examine the eigenvalues of the non-Hermitian matrix M−^1 A to investigate convergence