Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Anomaly Detection in Hyperspectral Imagery: Techniques and Algorithms, Study Guides, Projects, Research of Machine Learning

Harris School of Business - Cherry Hill Campus Machine Learning

A conceptual discussion of anomalousness in the context of hyperspectral imaging, along with a survey of anomaly detection algorithms and approaches. The main challenge in anomaly testing is characterizing the background, and the document covers subtraction-based and distribution-based approaches to anomalous change detection. Anomaly detection is used to reduce the quantity of incoming data for further analysis.

Typology: Study Guides, Projects, Research

2021/2022

Uploaded on 09/12/2022

jannine 🇺🇸

4.9

(14)

240 documents

1 / 21

This page cannot be seen from the preview

Don't miss anything!

Anomaly testing

James Theiler

Space Data Science and Systems Group

Los Alamos National Laboratory

Abstract

An anomaly is something that in some unspecified way stands out from its background, and the

goal of anomaly testing is to determine which samples in a population stand out the most. This chapter

presents a conceptual discussion of anomalousness, along with a survey of anomaly detection algorithms

and approaches, primarily in the context of hyperspectral imaging. In this context, the problem can be

roughly described as target detection with unknown targets. Because the targets, however tangible they

may be, are unknown, the main technical challenge in anomaly testing to characterize the background.

Further, because anomalies are rare, it is the characterization not of a full probability distribution but

of its periphery that most matters. One seeks not a generative but a discriminative model, an envelope

of unremarkability, the outer limits of what is normal. Beyond those outer limits, hic sunt anomalias.

CONTENTS

I Introduction 2

I-A Anomaly testing as triage . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

I-B Anomalies drawn from a uniform distribution . . . . . . . . . . . . . . . . 2

I-B1 Nonuniform distributions of anomalousness . . . . . . . . . . . 4

I-C Anomalies as pixels in spectral imagery . . . . . . . . . . . . . . . . . . . 4

I-C1 Global and local anomaly detectors . . . . . . . . . . . . . . . 5

I-C2 Regression framework . . . . . . . . . . . . . . . . . . . . . . . 6

II Evaluation 6

III Periphery 7

IV Subspace 9

V Kernels 10

V-A Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

V-B Feature space interpretation: the “kernel trick” . . . . . . . . . . . . . . . 10

VI Change 13

VI-A Subtraction-based approaches to anomalous change detection . . . . . . . 13

VI-B Distribution-based approaches to anomalous change detection . . . . . . . 15

VI-C Further comments on anomalous change detection . . . . . . . . . . . . . 16

VII Conclusion 16

Appears as: Chapter 19 in Statistical Methods for Materials Science: The Data Science of Microstructure

Characterization, J. P. Simmons, C. A. Bouman, M. De Graef, and L. F. Drummy, Jr., eds. (CRC Press, 2018).

ISBN 9781498738200.

Partial preview of the text

Download Anomaly Detection in Hyperspectral Imagery: Techniques and Algorithms and more Study Guides, Projects, Research Machine Learning in PDF only on Docsity!

Anomaly testing

James Theiler

Space Data Science and Systems Group

Los Alamos National Laboratory

Abstract An anomaly is something that in some unspecified way stands out from its background, and the goal of anomaly testing is to determine which samples in a population stand out the most. This chapter presents a conceptual discussion of anomalousness, along with a survey of anomaly detection algorithms and approaches, primarily in the context of hyperspectral imaging. In this context, the problem can be roughly described as target detection with unknown targets. Because the targets, however tangible they may be, are unknown, the main technical challenge in anomaly testing to characterize the background. Further, because anomalies are rare, it is the characterization not of a full probability distribution but of its periphery that most matters. One seeks not a generative but a discriminative model, an envelope of unremarkability, the outer limits of what is normal. Beyond those outer limits, hic sunt anomalias.

I Introduction 2 I-A Anomaly testing as triage........................... 2 I-B Anomalies drawn from a uniform distribution................ 2 I-B1 Nonuniform distributions of anomalousness........... 4 I-C Anomalies as pixels in spectral imagery................... 4 I-C1 Global and local anomaly detectors............... 5 I-C2 Regression framework....................... 6

II Evaluation 6

III Periphery 7

IV Subspace 9

V Kernels 10 V-A Kernel density estimation........................... 10 V-B Feature space interpretation: the “kernel trick”............... 10

VI Change 13 VI-A Subtraction-based approaches to anomalous change detection....... 13 VI-B Distribution-based approaches to anomalous change detection....... 15 VI-C Further comments on anomalous change detection............. 16

VII Conclusion 16

Appears as: Chapter 19 in Statistical Methods for Materials Science: The Data Science of Microstructure Characterization, J. P. Simmons, C. A. Bouman, M. De Graef, and L. F. Drummy, Jr., eds. (CRC Press, 2018). ISBN 9781498738200.

I. INTRODUCTION

Traditionally, anomalies are defined in a negative way, not by what they are but by what they are not: they are data samples that are not like the data samples in the rest of data. “There is not an unambiguous way to way to define an anomaly,” one review notes, and then goes on to ambiguously define it as “an observation that deviates in some way from the background clutter” [1]. Anomalies are defined “without reference to target signatures or target subspaces” and “with reference to a model of the background” [2]. Indeed, as Ashton [3] remarks, “the basis of an anomaly detection system is accurate background characterization.” This exposition will concentrate on anomaly detection in the context of imagery, with particular emphasis on hyperspectral imagery (in which each pixel encodes not the usual red, green, and blue of visible images, but a spectrum of radiances over a range of wavelengths that often includes upwards of a hundred spectral channels). Testing for anomalies is an exercise that has application in a variety of scenarios, however. Since anomalies are deviations from what is normal, particularly in situations where the nature of that deviation is not be predictable or well characterized, anomaly detection has been used in a variety of fault detection contexts [4]–[8]. What we are calling anomaly testing here is essentially the same as what the machine learning community calls “novelty detection” [9]–[11] or “one-class classification” [12], [13].

A. Anomaly testing as triage

Although we may have difficulty defining anomalies, the reason we seek them is that (being rare and unlike most of the data) they are potentially interesting and possibly meaningful. We have to acknowledge that “interesting” is even harder to define than “anomalous” but in this exposition, we will make this distinction, partly because it separates the mystical “by definition undefined” [14] aspect of the interesting-data detection problem into two components, which correspond to the two boxes in Fig. 1. We can indeed define “anomalous,” but will leave “interesting” and “meaningful” to be domain- specific concepts. From this point of view, anomaly detection is a kind of triage. If anomalies can be defined and detected in a relatively generic way, then experts in the specific area of application can decide which anomalies are interesting or meaningful. Here the goal of anomaly detection is to reduce the quantity of incoming data to a level that can be handled by the more expensive downstream analysis. It is this later analysis that judges which of the anomalous items are in fact meaningful for the application at hand. This judgment can be very complicated and domain-specific, and can involve human acumen and intuition. What makes anomaly detection useful as a concept is that the anomaly detection module has more generic goals, and is consequently more amenable to formal mathematical analysis.

B. Anomalies drawn from a uniform distribution

Anomalies are rare, and where we expect to find them is in the far tails of the background distribution pb(x). We can express “anomalousness” as varying inversely with with this density function, and can derive this expression in two distinct ways, each providing its own insight. In the first and most direct approach, we make an explicit generative model for anomalies, and say that they are samples drawn from a uniform distribution. This is a simple statement, but it is in some ways revolutionary; in contrast to conventional wisdom [1]–[3], we are defining anomalies directly, without respect to the background distribution. To distinguish these anomalies from the background, we treat the detection as a hypothesis testing problem. The null hypothesis

We remark that the first approach can also accommodate additive anomalies. Here, the nu- merator of the likelihood ratio is a Bayes factor, but it still evaluates to a constant independent of x:

L(x) =

P (x = z + t) P (x = z)

pb(x − t)u(t)dt pb(x)

pb(z)dz pb(x)

c′′ pb(x)

We again obtain the result that anomalousness varies inversely with pb(x), the probability density function of the background. Contours of anomalousness will be level curves of the background density functions. For a Gaussian distribution, these contours are ellipsoids of constant Ma- halanobis distance [18], with larger distances corresponding to smaller densities and greater anomalousness; we can therefore use Mahalanobis distance as a measure of anomalousness

A(x) = (x − μ)T^ R−^1 (x − μ) (4)

where μ is a vector-valued mean and R is a covariance matrix. The Mahalanobis distance is the basis of the Reed-Xiaoli (RX) detector [19]–[21]. Although RX, as originally introduced [19], refers specifically to multispectral imagery, and in fact is a local anomaly detector, the term “RX” is often used as a shorthand for Mahalanobis distance based anomaly detection.

Nonuniform distributions of anomalousness: Since the assumption of a uniform distribution for anomalies is explicit in the derivation of Eq. (1) – and by extension, Eq. (4) – that allows us to consider other, nonuniform, distributions if we are looking for other kinds of anomalies. Examples include anomalous change (to be further discussed in Section VI), and anomalous “color” in multispectral imagery. Here, in place of a uniform distribution of anomalies, a distribution of anomalously colored pixels are generated from the product of marginal distributions associated with each individual spectral band. To sample at random from this distribution, one creates a vector-valued pixel where each component is independently sampled from the corresponding component of the multispectral image [22]. Another example is given by the blind gas detection algorithm. In the traditional gas detection problem, one is looking for plumes that contain a gas-phase chemical of interest. When the absorption spectrum of that chemical of interest is known, one can derive a matched filter that can detect impressively low concentrations of the chemical [23]. In the blind gas detection problem, one does not have a single chemical (or even a short list of chemicals) of interest; one wants to detect chemical plumes without knowing the chemical species in the plume. But one does know that gas-phase chemicals usually have very sparse spectral signatures – i.e., there is zero or nearly-zero absorption at all but a few wavelengths. A sparse RX (or “spaRX”) algorithm was developed for detecting spectrally sparse additive anomalies t based on the RX derived in Eq. (2), but with t constrained to a limited number of nonzero components [24].

C. Anomalies as pixels in spectral imagery

Traditional statistical analysis treats data as a set of discrete samples that are drawn from a common distribution. Because each pixel in a hyperspectral image contains so much information (a many-channel spectrum of reflectances or radiances), one can often quite profitably treat the pixels as independent and identically distributed. It is as if the image were a “bag of pixels.” But however spectrally informative individual pixels are, they comprise an image, and the spatial structure in an image provides further leverage for characterizing the background and discovering anomalies.

Hyperspectral imagery provides a rich and irregular data set, with complex spatial and spectral structure. And the more accurately we can model this cluttered background, the better our detection performance. Simple models can be very effective, but the mismatch between simple models and the complicated nature of real data has driven research toward the development of more complex models [25].

Global and local anomaly detectors: A pixel is anomalous in the context of a background. In global anomaly detection that background is the full image, but in local anomaly detection that background restricted to the immediate neighborhood, often defined in terms of an annulus that surrounds the pixel. Local anomaly detection is one of the most straightforward (and, in practice, more effective) ways to exploit the spatial structure of imagery. Indeed, the initial formulation of the RX algorithm [19] computed anomalousness at each pixel in terms of a local mean and a local covariance matrix, each computed from the pixels in an annulus surrounding that pixel. The trade-off in choosing the size of the annulus is that a larger annulus will have better “statistics” – since it has more pixels, it will better average out the fluctuations in the pixels values; but a smaller annulus will be less affected by spatial nonstationarity [26]. Matteoli et al. [1] observed that one could use a smaller annulus for the local mean and a larger annulus for the local covariance. This is very sensible, since the need for “good statistics” is greater for the covariance matrix than for the mean vector. As the covariance annulus approaches the size of the image, this approach can be simplified by using a local mean and a global covariance, but with the global covariance based on subtraction of the spatially varying local mean. A broad survey of approaches used to improve estimates of local covariance, including regularization, segmentation, and robustification, is provided by Matteoli et al. [27]. The importance of regularization derives from the fact that RX requires the inverse of the covariance matrix. If the covariance matrix is singular, then the inverse does not exist, and regularization is required. But even for a well-conditioned covariance matrix, the best estimator of the inverse is not the same as the inverse of the best estimator, and some amount of regularization is still beneficial. The most common and straightforward regularization is by shrinkage. Here, we estimate the covariance matrix with a linear combination of the sample covariance Rs (which tends to overfit the data) and a very simple estimator Ro that tends to underfit the data

R̂ = (1 − α)Rs + αRo (5)

where typically α 1. In the simplest case, Ro is just a multiple of the identify matrix [28], [29] (choosing the multiple so that Ro has trace equal to Rs ensures that α is dimensionless). An argument can be made for shrinkage against the diagonal matrix [30], [31], an approach that is generalized in the sparse matrix transform [32], [33]. Caefer et al. [34] recommended a quasi- local estimator that combines local and global covariance estimators by using local eigenvalues with global eigenvectors. The idea of segmentation is to replace the moving window with a static segment of similar pixels that surround the pixel of interest in a more irregular way. Here the image is partitioned into distinct segments of (usually contiguous) pixels, and a pixel’s anomalousness is based on the mean and covariance of the pixels in the segment to which the pixel belongs [34], [35]. This sometimes leads to extra false alarms on the boundaries of the segments, and one way to deal with this problem is with overlapping segments [36]. The estimation of covariance in the local annulus can be corrupted by one or a few outliers^2 and (^2) Outliers and anomalies are essentially the same thing, and we make no formal distinction between them. But informally we think of anomalies as rare nuggets deserving of further analysis, while outliers are nuisance samples that contaminate the data of interest.

Fig. 2. Three anomaly detectors with the same false alarm rate (Pfa = 0. 001 ). The contour marks the boundary between what is normal (inside) and what is anomalous (outside). If anomalies are presumed to be uniformly distributed, then the anomaly detector with smallest volume will have the fewest missed detections.

Plotting volume (or log-volume, which is often more convenient, especially in high dimen- sions) against false alarm rate provides a ROC-like curve that characterizes the anomaly detector’s performance [51], [52]. For a covariance matrix, the volume is proportional to the determinant of the matrix. For local anomaly detectors described in Section I-C1, one can still use a global covariance based on the difference between measured and estimated (e.g., by local mean) values at each pixel, and the smaller that covariance, the better the estimator. A natural choice, from a signal processing perspective is the total variance of that difference, ∑

(xn − ̂xn)T^ (xn − ̂xn) (6)

which corresponds to the trace of the covariance matrix. Smaller values of this variance imply that x̂ is closer to x, but Hasson et al. [45] point out that, in terms of target and anomaly detection performance, closer is not necessarily better. When, instead of a global covariance estimator, we use a separate covariance for each pixel, based on the local neighborhood of that pixel, then it is more complicated. It is clear that the volumes of the individual covariances should be small, but it is not obvious how best to combine them. In Bachega et al. [53], it is argued, more on practical than theoretical grounds, that an average of the log volume is a good choice.

III. PERIPHERY

For anomaly detection, low false alarm rates are imperative. So the challenge is to characterize the background density in regions where the data are sparse; that is, on the periphery (or “tail”) of the distribution. Unfortunately, traditional density estimation methods, especially parametric estimators (e.g., Gaussian), are dominated by the high-density core. And it bears pointing out that “robust” estimation methods (e.g., [37], [54], [55]) achieve their robustness by paying even less attention to the periphery. Robustness to outliers can be achieved by essentially removing the outliers from the data set. This direct approach is taken by the MCD (Minimum Covariance Determinant) of Rousseeuw et al. [37], [55]. For a data set of N samples, the idea is to take a core subset H of h < N samples and to compute the mean and covariance from just the samples in H, ignoring the rest. The

More robust against outliers ↓ All data samples Subset of h samples Sample covariance matrix Mahalanobis/RX MCD More attention to periphery → Minimum volume matrix MVEE MVEE-h

Fig. 3. Four algorithms for estimating ellipsoidal contours. All four algorithms seek ellipsoidal contours for the data, and can all four be expressed with an equation of the form A(x) = (x − μ)T^ R−^1 (x − μ). The top two algorithms use sample mean and sample covariance to estimate μ and R, respectively; the bottom two seek a minimum volume ellipsoid that strictly encloses the data. The left two algorithms use all of the data in the training set; the right two algorithms use a subset H that includes almost all of the data. Note that the MVEE-h algorithm [41] is both robust to outliers and sensitive to data on the periphery of the distribution.

formal aim is to choose the subset so as to minimize the volume of the ellipsoid corresponding to the sample covariance. Specifically,

min H det(R) where R = (1/h)

xn∈H

(xn − μ)(xn − μ)T

and μ = (1/h)

xn∈H

and #{H} ≥ h

As stated, this is an NP-hard problem, but an iterative approach can be employed to find an approximate optimum. Given an initial set of core samples H, we can compute μ and R as sample mean and covariance of the core set. With this μ and R, we can use Eq. (4) to compute A(x) for all of the samples. Taking the h samples with smallest A(x) values yields a new core set H′. This process can be iterated, and is guaranteed to converge, though it is not guaranteed to converge to the global optimum defined in Eq. (7). Various tricks can be used both to speed up the iterations and to achieve lower minima [55]. Where MCD concentrates on identifying the core, the Minimum Volume Enclosing Ellipsoid (MVEE) algorithm concentrates on the periphery of the data. In contrast to Eq. (7), the aim is to optimize min μ,R

det(R) where (xn − μ)T^ R−^1 (xn − μ) ≤ 1 for all n (8)

Unlike the optimization in Eq. (7), the optimization here is convex and can be efficiently per- formed, using Khachiyan’s algorithm [56], possibly including some of the further improvements that have since been suggested [57], [58]. Although robustness against outliers and sensitivity to the periphery are seemingly opposite requirements, practical anomaly detection actually wants both. For data sets in which a very small number of samples are truly outliers (or are truly anomalies), we do not want to include these samples in our characterization of the background. But absent these outliers, we do want to identify where the tail of background distribution is, and that requires attention to the samples on the periphery [51]. Fig. 3 illustrates this tension between attention to the periphery and robustness to outliers by showing four algorithms lined up along two axes. All four algorithms seek ellipsoidal contours

Another approach based on projection to a lower dimensional space was proposed by Kwon et al. [70]; here the projection operator is based on eigenvalues of a matrix that is the difference of two covariance matrices, one computed from an inner window (centered at the pixel under test in a moving window scenario) and one from an outer window (an annulus that surrounds the inner window and provides local context).

V. KERNELS

A. Kernel density estimation

Given that the aim of anomaly detection is to estimate the background distribution pb(x), one of the most straightforward estimators is the kernel density estimator, or Parzen windows [71] estimator:

pb(x) = (1/N )

∑^ N

κ(x, xn), (10)

where the sum is over all points in the data set, and where κ is a kernel function that is integrable and is everywhere non-negative. A popular choice is the Gaussian radial basis kernel,

κ(x, xi) =

2 πσd^

exp

‖x − xi‖^2 2 σ^2

Eq. (11) requires the user to choose a “bandwidth” σ that characterizes, in some sense, the range of influence of each point. Since density can vary widely over a distribution, variable and data-adaptive bandwidth schemes have been proposed [72], [73]. In the limit as bandwidth goes to zero, the anomalousness at x is dominated by the κ(x, xi) associated with the xi that is closest to x. Indeed, the anomalousness in that case is equivalent to that distance. An anomaly detector based on distance to the nearest point has been proposed [74], though with an additional step that uses a graph-based approach to eliminate a small fraction (typically 5%) of the points to be used as xi. An updated variant was later proposed [75] that included normalization, subsampling, and a distance defined by the average of the distances to the third, fourth, and fifth nearest points.

B. Feature space interpretation: the “kernel trick”

A particularly fruitful (if initially counter-intuitive) interpretation of kernel functions is as dot products in a (usually higher-dimensional) feature space. Let φ(x) be a function that maps x to some some feature space. Typically φ is nonlinear, and the map is to a feature space that is of higher dimension than x. Scalar dot product in this feature space can be expressed as (again, typically nonlinear) functions of the values in the original data space. That is: κ(r, s) = φ(r)T^ φ(s). (12)

The “kernel trick” is the observation that even though the function φ and the feature space are presumed to “exist” in some abstract mathematical sense, we do not actually need to use φ, as long as we have the kernel function κ. A popular choice is the Gaussian kernel

κ(r, s) = exp

‖r − s‖^2 2 σ^2

but many options are available. Polynomial kernels, for example, are of the form κ(r, s) = (c + rT^ s)d^ for some polynomial dimension d. More general radial-basis kernels are scalar functions of the scalar value ‖r − s‖^2 ; functions that are more heavy-tailed than the Gaussian have been proposed for this purpose [76]. This enables us re-derive the Parzen window detector from a different point of view. Given our data, {x 1 ,... , xN }, we first map to the feature space: {φ(x 1 ),... , φ(xN )}. In this feature space we define the centroid

μφ =

N

∑^ N

φ(xn) (14)

and we define anomalousness as distance to the centroid in this feature space.

A(x) = ‖φ(x) − μφ‖^2 = (φ(x) − μφ)T^ (φ(x) − μφ) = φ(x)T^ φ(x) − 2 φ(x)T^ μφ + μTφ μφ (15)

We observe that first term φ(x)T^ φ(x) = κ(x, x) is constant for radial basis kernels, that the third term is also constant, and that

φ(x)T^ μφ =

N

∑^ N

φ(x)T^ φ(xn) =

N

∑^ N

κ(x, xn). (16)

This leads to

A(x) = constant −

N

∑^ N

κ(x, xn), (17)

which is a negative monotonic transform of the density estimator (1/N )

∑N

n=1 κ(x,^ xn), and therefore equivalent to anomaly detection based on Parzen windows density estimation. The power of kernels in this case is that a seemingly trivial anomaly detector (Euclidean distance to the centroid of the data) in feature space maps back to a more complex data-adaptive anomaly detector in the data space. The power of this feature-space interpretation of kernels is that it enables us to derive other expressions for anomaly detection, starting with very simple models in feature space that are then mapped back to more sophisticated data-adaptive anomaly detectors in the data space. For instance, instead of Euclidean distance to the centroid μφ, consider a more periphery- respecting model that uses an adaptive center aφ that is adjusted to minimize the radius of the sphere that encloses all of the data (see Fig. 4). That is,^3

min r,aφ r^2 subject to: ‖φ(xn) − aφ‖^2 ≤ r^2 (18)

or more generally, that mostly encloses the data:

min r,aφ,ξ r^2 + c

ξn (19)

subject to: ‖φ(xn) − aφ‖^2 ≤ r^2 + ξn (20) and: ξn ≥ 0 , (21) (^3) Another way of expressing the centroid μφ is as the solution to the minimization of the average squared radius: μφ = argminμ ∑ n ‖φ(xn)^ −^ μ‖ (^2) ; by comparison, we can say aφ is the solution to the minimization of the maximum squared radius:

aφ = argminamaxn‖φ(xn) − a‖^2. We can interpret Eq. (19) as the minimization of a “soft” maximum.

Nasrabadi [80], the pseudoinverse was taken. The effect of the pseudoinverse is to project data (in the feature space) to the in-sample data plane, but this projection can be problematic for anomaly detection [81]. Anomalies are different from the rest of the data, and this difference will be suppressed by projection back into the in-sample data plane. Indeed, another kernelization that can be effective is the kernel subspace anomaly detector [82], [83]. Here, principal components analysis is performed in the feature space, and a subspace is defined that includes the first few principal components. Anomalousness is defined in terms of the distance to this subspace.

VI. CHANGE For the anomalous change detection problem, the aim is to find interesting differences between two images, taken of the same scene, but at different times and typically under different viewing conditions [84]. There will be some differences that are pervasive – e.g., differences due to overall calibration, contrast, illumination, look-angle, focus, spatial misregistration, atmospheric or even seasonal changes – but there may also be changes that occur in only a few pixels. These rare changes potentially indicate something truly changed in the scene, and the idea is to use anomaly detection to find them. But our interest is in pixels where changes between the pixels are unusual, not so much in unusual pixels that are “similarly unusual” in both images. Informally speaking, we want to learn the “patterns” of these pervasive differences, and then the changes that do not fit the patterns are identified as anomalous. An important precursor to anomalous change detection is the co-registration of the two images. We say that images are registered if corresponding pixels in the two images correspond to the same position in the scene. Registering imagery is a nontrivial task, yet misregistration is one of the main confounds in change detection [85]–[88]. In what follows, let x and y refer to corresponding pixels in two images.

A. Subtraction-based approaches to anomalous change detection

The most straightforward way to look for changes in a pair of images is to subtract them, e = y − x, and then to restrict analysis to the difference image e [89]. Simple subtraction, although it has the advantage of being simple, has the disadvantage that it folds in pervasive differences along with the anomalous changes. Most anomalous change detection algorithms are based on subtracting images, but involve transforming the images to make them more similar. For instance, the chronochrome [90] seeks a linear transform of the first image to make it as similar as possible (in a least squares sense) to the second image. That is, it seeks L so that ‖y − Lx‖^2 , averaged over the whole image, is minimized. To simplify notation, we will assume means have been subtracted from x and y, and define the covariance matrices X =

xxT^

, Y =

yyT^

, and C =

yxT^

. The linear transform that minimizes the least square fit of y to Lx is given by L = CX−^1. Now the subtraction that is performed is e = y − Lx, and this reduces the effect of pervasive differences on e while still “letting through” the anomalous changes. Note that there is an asymmetry in the chronochrome; by swapping the role of x and y, and seeking L′^ to minimize ‖e = x−L′y‖^2 , one obtains a different anomalous change detector. Clifton [91] proposed a neural network version of chronochrome, in a nonlinear function L(x) is chosen to minimize e = y − L(x), with the aim of even further suppressing the pervasive differences. A more symmetrical approach, which is sometimes called covariance equalization [92], [93] or whitening/de-whitening [94], transforms the data in both images before it subtracts them:

(a) (b)

pa(x, y) = u(x)u(y)

pa(x, y) = p(x)u(y)

(e) (f)

pa(x, y) = u(x)p(y)

(g) (h)

pa(x, y) = p(x)p(y)

Fig. 5. Four anomalous change detectors, derived from four distinct and explicit definitions of anomalous change. Here x is represented by the horizontal axis and y is the vertical axis. The left panels show the non-anomalous data (sampled from a correlated Gaussian distribution) and the boundaries outside of which are the points that the detectors consider anomalous changes. The panels to the right include samples drawn from pa(x, y), the model for the distribution of anomalous changes. (a,b) RX detector is obtained from “straight” anomaly detection; (c,d) Chronochrome detector optimized for x → y changes; (e,f) Chronochrome optimized for y → x changes; and (g,h) hyperbolic anomalous change detector.

Another choice for pa(x, y) has also been suggested [99]. Here, pa(x, y) = pb(x)pb(y), and x → y and y → x changes are treated equally. The informal interpretation is that unusual changes are pixel pairs (x, y) which are collectively unusual, but individually normal. That is, the x pixel value is typical for the x-image, and the y pixel value is typical for the y-image, but the (x, y) pair is unusual. If we use this model for anomalies in Eq. (26), and take a logarithm, we obtain log L(x, y) = log pb(x) + log pb(y) − log pb(x, y) (27)

an expression for anomalousness that looks like negative mutual information of x and y. In the case of Gaussian pb(x, y), Eq. (27) reduces to a quadratic expression in x and y that has hyperbolic contours (see Fig. 5(g,h)). Experiments with real and simulated anomalous changes in real imagery indicated that this hyperbolic anomalous change detection (HACD) generally outperformed the subtraction-based anomaly detectors [95]. A further advantage of the distribution-based approach is that the distribution needn’t be Gaus- sian. Indeed, we can take a purely nonparametric view, and treat the problem of distinguishing pervasive differences from anomalous changes as a machine learning classification. Steinwart et al. [100] used support vector machines for just this purpose. But a simpler approach, that has also proven effective, is to consider a parametric distribution, but one slightly more general than the Gaussian. The class of elliptically-contoured (EC) distributions are, like the Gaussian, primarily parameterized by a mean vector and covariance matrix, but do not share the sharp e−r 2 tail of the Gaussian. Heavy-tailed EC distributions have been suggested for hyperspectral imagery in general [101], and for anomalous change detection in particular [102], [103]. Although EC distributions do not affect the “straight anomaly detector” shown in Fig. 5(a,b), they do generalize chronochrome and hyperbolic anomalous change detectors in a way that can lead to improved performance [103]. Kernelization of the EC-based change detector has also been shown to be advantageous [104].

C. Further comments on anomalous change detection

The description here treats pixel pairs as independent samples from an unknown distribution. But there is a lot of spatial structure in imagery, and further gains can be made by incorporating spatial aspects along with the spectral [105], [106]. The description here also considers only pairs of images; often there are more than two images, and these algorithms can be extended to that case [107], [108], though this approach may not be optimal for sequences of images (e.g., anomalous activities in video) where the order of the images in the sequence matters. Another issue that arises in remote sensing is that the anomalous targets may be subpixel in extent, which leads to a different optimization problem [109]. Anomalous change detection is a problem that is particularly well matched to remote sensing imagery, and in that context, a variety of practical issues have been discussed [110], [111]. One of the biggest of these issues is misregistration, when the images don’t exactly line up (and they never exactly line up). Although the effects of misregistration can to some extent be learned from the pervasive differences it creates in image pairs, it is still one of the main confounds to change detection [85], [86]. Gains can be made by explicitly adapting the change detection algorithm to be more robust to misregistration error [87], [88].

VII. CONCLUSION Anomaly detection is seldom a goal in its own right. It is the first step in a search for data samples that are relevant, meaningful, or – in some sense that depends on where the data came

from and what they are being used for – interesting. The mystical “by definition undefined” aspect of anomaly detection mostly derives from the ambiguity of what one means by “interesting” and this has led to a wide variety of ad hoc anomaly detection algorithms, justified by hand-waving arguments and validated (if at all) by anecdotal performance on imagery with a statistically inadequate number of pre-judged anomalies. By employing a framework in which anomalies are in fact well-defined, as samples drawn from some broad and flat distribution, anomaly detection algorithms can be objectively tested, and improvements can be confidently constructed. Within this framework, many of the tools that have been developed for signal processing, machine learning, and data analytics in general, can be brought to bear on the detection of anomalies. These range from the venerable Gaussian distribution to kernels and subspaces (and kernelized subspaces!), and invoke the usual issues in underfitting and overfitting data. The technical challenge of anomaly detection is not usually the anomalies themselves, but with characterizing what can be a complex and highly structured background. Since most anomaly detection scenarios require a low false alarm rate, it is out on the periphery of this background where the modeling is emphasized. This is something of a challenge, since the data density is much lower there. The modeling, however is discriminative, not generative, which means that the aim is not the model the distribution per se, but to find the boundary that separates the non-anomalous data from the anomalies.

REFERENCES

S. Matteoli, M. Diani, and G. Corsini, “A tutorial overview of anomaly detection in hyperspectral images.” IEEE A&E Systems Magazine 25 , 5–27 (2010).
D. W. J. Stein, S. G. Beaven, L. E. Hoff, E. M. Winter, A. P. Schaum, and A. D. Stocker, “Anomaly detection from hyperspectral imagery.” IEEE Signal Processing Magazine 19 , 58–69 (Jan, 2002).
E. A. Ashton, “Multialgorithm solution for automated multispectral target detection.” Optical Engineering 38 , 717–724 (1999).
B. A. Whitehead and W. A. Hoyt, “Function approximation approach to anomaly detection in propulsion system test data.” J. Propulsion and Power 11 , 1074–1076 (1995).
K. Worden, “Structural fault detection using a novelty measure.” J. Sound and Vibration 201 , 85–101 (1997).
C. Manikopoulos and S. Papavassiliou, “Network intrusion and fault detection: a statistical anomaly approach.” IEEE Communications Magazine 40 , no. 10, 76–82 (2002).
M. Thottan and C. Ji, “Anomaly detection in IP networks.” IEEE Trans. Signal Processing 51 , 2191– (2003).
D. M. Cai, M. Gokhale, and J. Theiler, “Comparison of feature selection and classification algorithms in identifying malicious executables.” Computational Statistics and Data Analysis 51 , 3156–3172 (2007).
B. Sch¨olkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt, “Support vector method for novelty detection.” Advances in Neural Information Processing Systems 12 , 582–588 (1999).
M. Markou and S. Singh, “Novelty detection: a review – part 1: statistical approaches.” Signal Processing 83 , 2481–2497 (2003).
M. Markou and S. Singh, “Novelty detection: a review – part 2: neural network based approaches.” Signal Processing 83 , 2499–2521 (2003).
D. M. Tax, One-class classification: Concept-learning in the absence of counter-examples. (TU Delft, Delft University of Technology, 2001).
L. M. Manevitz and M. Yousef, “One-class SVMs for document classification.” J. Machine Learning Res. 2 , 139–154 (2001).
J. Theiler, “By definition undefined: adventures in anomaly (and anomalous change) detection.” Proc. 6th IEEE Workshop on Hyperspectral Signal and Image Processing: Evolution in Remote Sensing (WHISPERS) (2014).

Optical Engineering 46 , 076402 (2007).

J. Theiler, “Symmetrized regression for hyperspectral background estimation.” Proc. SPIE 9472 , 94721G (2015).
N. Hasson, S. Asulin, S. R. Rotman, and D. Blumberg, “Evaluating backgrounds for subpixel target detection: when closer isn’t better.” Proc. SPIE 9472 , 94720R (2015).
J. Theiler and B. Wohlberg, “Regression framework for background estimation in remote sensing imagery.” Proc. 5th IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS) (2013).
W. F. Basener, E. Nance, and J. Kerekes, “The target implant method for predicting target difficulty and detector performance in hyperspectral imagery.” Proc. SPIE 8048 , 80481H (2011).
J. Theiler, “Matched-pair machine learning.” Technometrics 55 , 536–547 (2013).
D. Tax and R. Duin, “Uniform object generation for optimizing one-class classifiers.” J. Machine Learning Res. 2 , 155–173 (2002).
T. Hastie, R. Tibshirani, and J. Friedman, Elements of Statistical Learning: Data Mining, Inference, and Prediction. (Springer-Verlag, New York, 2001). This anomaly detection approach is developed in Chapter 14.2.4, and illustrated in Fig 14.3.
J. Theiler and D. Hush, “Statistics for characterizing data on the periphery.” Proc. IEEE Int. Geoscience and Remote Sensing Symposium (IGARSS) 4764–4767 (2010).
J. Theiler, “Ellipsoid-simplex hybrid for hyperspectral anomaly detection.” Proc. 3rd IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS) (2011).
L. Bachega, J. Theiler, and C. A. Bouman, “Evaluating and improving local hyperspectral anomaly detectors.” IEEE Applied Imagery and Pattern Recognition (AIPR) Workshop 39 (2011).
N. A. Campbell, “Robust procedures in multivariate analysis I: Robust covariance estimation.” Applied Statistics 29 , 231–237 (1980).
P. J. Rousseeuw and K. Van Driessen, “A fast algorithm for the minimum covariance determinant estimator.” Technometrics 41 , 212–223 (1999).
L. G. Khachiyan, “Rounding of polytopes in the real number model of computation.” Mathematics of Operations Research 21 , 307–320 (1996).
P. Kumar and E. A. Yildirim, “Minimum-volume enclosing ellipsoids and core sets.” J. Optimization Theory and Applications 126 , 1–21 (2005).
M. J. Todd and E. A. Yildirim, “On Khachiyan’s algorithm for the computation of minimum-volume enclosing ellipsoids.” Discrete Applied Mathematics 155 , 1731–1744 (2007).
J. Theiler, B. R. Foy, and A. M. Fraser, “Characterizing non-Gaussian clutter and detecting weak gaseous plumes in hyperspectral imagery.” Proc. SPIE 5806 , 182–193 (2005).
P. Bajorski, “Maximum Gaussianity models for hyperspectral images.” Proc. SPIE 6966 , 69661M (2008).
S. M. Adler-Golden, “Improved hyperspectral anomaly detection in heavy-tailed backgrounds.” Proc. 1st IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS) (2009).
S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding.” Science 290 , 2323–2326 (2000).
J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction.” Science 290 , 2319–2323 (2000).
C. M. Bachmann, T. L. Ainsworth, and R. A. Fusina, “Exploiting manifold geometry in hyperspectral imagery.” IEEE Trans. Geoscience and Remote Sensing 43 , 441–454 (2005).
D. Lunga, S. Prasad, M. M. Crawford, and O. Ersoy, “Manifold-learning-based feature extraction for classification of hyperspectral data: A review of advances in manifold learning.” IEEE Signal Processing Magazine 31 , 55–66 (Jan, 2014).
A. K. Ziemann and D. W. Messinger, “An adaptive locally linear embedding manifold learning approach for hyperspectral target detection.” Proc. SPIE 9472 , 94720O (2015).
K. Ranney and M. Soumekh, “Hyperspectral anomaly detection within the signal subspace.” IEEE Geoscience and Remote Sensing Letters 3 , 312–316 (2006).
L. Ma, M. M. Crawford, and J. Tian, “Anomaly detection for hyperspectral images based on robust locally linear embedding.” J. Infrared Millimeter Terahertz Waves 31 , 753–762 (2010).
G. A. Tidhar and S. R. Rotman, “Target detection in inhomogeneous non-Gaussian hyperspectral data based

on nonparametric density estimation.” Proc. SPIE 8743 , 87431A (2013).

H. Kwon, S. Z. Der, and N. M. Nasrabadi, “Adaptive anomaly detection using subspace separation for hyperspectral imagery.” Optical Engineering 42 , 3342–
E. Parzen, “On estimation of probability density function and mode.” Ann. Mathematical Statistics 33 , 1065– 1076 (1962).
S. Matteoli, T. Veracini, M. Diani, and G. Corsini, “Background density nonparametric estimation with data- adaptive bandwidths for the detection of anomalies in multi-hyperspectral imagery.” IEEE Geoscience and Remote Sensing Lett. 11 , 163–167 (2014).
S. Matteoli, T. Veracini, M. Diani, and G. Corsini, “A locally adaptive background density estimator: An evolution for RX-based anomaly detectors.” IEEE Geoscience and Remote Sensing Lett. 11 , 323–327 (2014).
B. Basener, E. Ientilucci, and D. Messinger, “Anomaly detection using topology.” Proc. SPIE 6565 , 65650J (2007).
W. F. Basener and D. W. Messinger, “Enhanced detection and visualization of anomalies in spectral imagery.” Proc. SPIE 7334 , 73341Q (2009).
C. Scovel, D. Hush, I. Steinwart, and J. Theiler, “Radial kernels and their reproducing kernel Hilbert spaces.” J. Complexity 26 , 641–660 (2010).
D. Tax and R. Duin, “Data domain description by support vectors.” In Proc. ESANN99, M. Verleysen, ed. (D. Facto Press, Brussels, 1999), pp. 251–256.
A. Banerjee, P. Burlina, and C. Diehl, “A support vector method for anomaly detection in hyperspectral imagery.” IEEE Trans. Geoscience and Remote Sensing 44 , 2282–2291 (2006).
D. Cremers, T. Kohlberger, and C. Schn¨orr, “Shape statistics in kernel space for variational image segmentation.” Pattern Recognition 36 , 1929–1943 (2003).
H. Kwon and N. Nasrabadi, “Kernel RX-algorithm: A nonlinear anomaly detector for hyperspectral imagery.” IEEE Trans. Geoscience and Remote Sensing 43 , 388–397 (2005).
J. Theiler and G. Grosklos, “Problematic projection to the in-sample subspace for a kernelized anomaly detector.” IEEE Geoscience and Remote Sensing Lett. (2016). To appear.
H. Hoffmann, “Kernel PCA for novelty detection.” Pattern Recognition 40 , 863–874 (2007).
N. M. Nasrabadi, “Kernel subspace-based anomaly detection for hyperspectral imagery.” Proc. 1st IEEE Workshop on Hyperspectral Signal and Image Processing: Evolution in Remote Sensing (WHISPERS) (2009).
A. Schaum and E. Allman, “Advanced algorithms for autonomous hyperspectral change detection.” IEEE Applied Imagery Pattern Recognition (AIPR) Workshop 33 , 33–38 (2005).
J. Theiler, “Sensitivity of anomalous change detection to small misregistration errors.” Proc. SPIE 6966 , 69660X (2008).
J. Meola and M. T. Eismann, “Image misregistration effects on hyperspectral change detection.” Proc. SPIE 6966 , 69660Y (2008).
J. Theiler and B. Wohlberg, “Local co-registration adjustment for anomalous change detection.” IEEE Trans. Geoscience and Remote Sensing 50 , 3107–3116 (2012).
K. Vongsy, M. T. Eismann, and M. J. Mendenhall, “Extension of the linear chromodynamics model for spectral change detection in the presence of residual spatial misregistration.” IEEE Trans. Geoscience and Remote Sensing 53 , 3005–3021 (2015).
L. Bruzzone and D. F. Prieto, “Automatic analysis of the difference image for unsupervised change detection.” IEEE Trans. Geoscience and Remote Sensing 38 , 1171–1182 (2000).
A. Schaum and A. Stocker, “Long-interval chronochrome target detection.” Proc. ISSSR (Int. Symposium on Spectral Sensing Research) (1998).
C. Clifton, “Change detection in overhead imagery using neural networks.” Applied Intelligence 18 , 215– (2003).
A. Schaum and A. Stocker, “Linear chromodynamics models for hyperspectral target detection.” Proc. IEEE Aerospace Conference 1879–1885 (2003).
A. Schaum and A. Stocker, “Hyperspectral change detection and supervised matched filtering based on covariance equalization.” Proc. SPIE 5425 , 77–90 (2004).
R. Mayer, F. Bucholtz, and D. Scribner, “Object detection by using “whitening/dewhitening” to transform target signatures in multitemporal hyperspectral and multispectral imagery.” IEEE Trans. Geoscience and Remote Sensing 41 , 1136–1142 (2003).
J. Theiler, “Quantitative comparison of quadratic covariance-based anomalous change detectors.” Applied

Anomaly Detection in Hyperspectral Imagery: Techniques and Algorithms, Study Guides, Projects, Research of Machine Learning

Related documents

Partial preview of the text

Download Anomaly Detection in Hyperspectral Imagery: Techniques and Algorithms and more Study Guides, Projects, Research Machine Learning in PDF only on Docsity!

Anomaly testing

James Theiler

Space Data Science and Systems Group

Los Alamos National Laboratory

CONTENTS

I. INTRODUCTION

III. PERIPHERY

V. KERNELS

∑^ N

N

∑^ N

N

∑^ N

N

∑^ N

N

∑^ N

∑N

, Y =

pa(x, y) = u(x)u(y)

pa(x, y) = p(x)u(y)

pa(x, y) = u(x)p(y)

pa(x, y) = p(x)p(y)