













Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A conceptual discussion of anomalousness in the context of hyperspectral imaging, along with a survey of anomaly detection algorithms and approaches. The main challenge in anomaly testing is characterizing the background, and the document covers subtraction-based and distribution-based approaches to anomalous change detection. Anomaly detection is used to reduce the quantity of incoming data for further analysis.
Typology: Study Guides, Projects, Research
1 / 21
This page cannot be seen from the preview
Don't miss anything!
Abstract An anomaly is something that in some unspecified way stands out from its background, and the goal of anomaly testing is to determine which samples in a population stand out the most. This chapter presents a conceptual discussion of anomalousness, along with a survey of anomaly detection algorithms and approaches, primarily in the context of hyperspectral imaging. In this context, the problem can be roughly described as target detection with unknown targets. Because the targets, however tangible they may be, are unknown, the main technical challenge in anomaly testing to characterize the background. Further, because anomalies are rare, it is the characterization not of a full probability distribution but of its periphery that most matters. One seeks not a generative but a discriminative model, an envelope of unremarkability, the outer limits of what is normal. Beyond those outer limits, hic sunt anomalias.
I Introduction 2 I-A Anomaly testing as triage........................... 2 I-B Anomalies drawn from a uniform distribution................ 2 I-B1 Nonuniform distributions of anomalousness........... 4 I-C Anomalies as pixels in spectral imagery................... 4 I-C1 Global and local anomaly detectors............... 5 I-C2 Regression framework....................... 6
II Evaluation 6
III Periphery 7
IV Subspace 9
V Kernels 10 V-A Kernel density estimation........................... 10 V-B Feature space interpretation: the “kernel trick”............... 10
VI Change 13 VI-A Subtraction-based approaches to anomalous change detection....... 13 VI-B Distribution-based approaches to anomalous change detection....... 15 VI-C Further comments on anomalous change detection............. 16
VII Conclusion 16
Appears as: Chapter 19 in Statistical Methods for Materials Science: The Data Science of Microstructure Characterization, J. P. Simmons, C. A. Bouman, M. De Graef, and L. F. Drummy, Jr., eds. (CRC Press, 2018). ISBN 9781498738200.
Traditionally, anomalies are defined in a negative way, not by what they are but by what they are not: they are data samples that are not like the data samples in the rest of data. “There is not an unambiguous way to way to define an anomaly,” one review notes, and then goes on to ambiguously define it as “an observation that deviates in some way from the background clutter” [1]. Anomalies are defined “without reference to target signatures or target subspaces” and “with reference to a model of the background” [2]. Indeed, as Ashton [3] remarks, “the basis of an anomaly detection system is accurate background characterization.” This exposition will concentrate on anomaly detection in the context of imagery, with particular emphasis on hyperspectral imagery (in which each pixel encodes not the usual red, green, and blue of visible images, but a spectrum of radiances over a range of wavelengths that often includes upwards of a hundred spectral channels). Testing for anomalies is an exercise that has application in a variety of scenarios, however. Since anomalies are deviations from what is normal, particularly in situations where the nature of that deviation is not be predictable or well characterized, anomaly detection has been used in a variety of fault detection contexts [4]–[8]. What we are calling anomaly testing here is essentially the same as what the machine learning community calls “novelty detection” [9]–[11] or “one-class classification” [12], [13].
A. Anomaly testing as triage
Although we may have difficulty defining anomalies, the reason we seek them is that (being rare and unlike most of the data) they are potentially interesting and possibly meaningful. We have to acknowledge that “interesting” is even harder to define than “anomalous” but in this exposition, we will make this distinction, partly because it separates the mystical “by definition undefined” [14] aspect of the interesting-data detection problem into two components, which correspond to the two boxes in Fig. 1. We can indeed define “anomalous,” but will leave “interesting” and “meaningful” to be domain- specific concepts. From this point of view, anomaly detection is a kind of triage. If anomalies can be defined and detected in a relatively generic way, then experts in the specific area of application can decide which anomalies are interesting or meaningful. Here the goal of anomaly detection is to reduce the quantity of incoming data to a level that can be handled by the more expensive downstream analysis. It is this later analysis that judges which of the anomalous items are in fact meaningful for the application at hand. This judgment can be very complicated and domain-specific, and can involve human acumen and intuition. What makes anomaly detection useful as a concept is that the anomaly detection module has more generic goals, and is consequently more amenable to formal mathematical analysis.
B. Anomalies drawn from a uniform distribution
Anomalies are rare, and where we expect to find them is in the far tails of the background distribution pb(x). We can express “anomalousness” as varying inversely with with this density function, and can derive this expression in two distinct ways, each providing its own insight. In the first and most direct approach, we make an explicit generative model for anomalies, and say that they are samples drawn from a uniform distribution. This is a simple statement, but it is in some ways revolutionary; in contrast to conventional wisdom [1]–[3], we are defining anomalies directly, without respect to the background distribution. To distinguish these anomalies from the background, we treat the detection as a hypothesis testing problem. The null hypothesis
We remark that the first approach can also accommodate additive anomalies. Here, the nu- merator of the likelihood ratio is a Bayes factor, but it still evaluates to a constant independent of x:
L(x) =
P (x = z + t) P (x = z)
pb(x − t)u(t)dt pb(x)
c
pb(z)dz pb(x)
c′′ pb(x)
We again obtain the result that anomalousness varies inversely with pb(x), the probability density function of the background. Contours of anomalousness will be level curves of the background density functions. For a Gaussian distribution, these contours are ellipsoids of constant Ma- halanobis distance [18], with larger distances corresponding to smaller densities and greater anomalousness; we can therefore use Mahalanobis distance as a measure of anomalousness
A(x) = (x − μ)T^ R−^1 (x − μ) (4)
where μ is a vector-valued mean and R is a covariance matrix. The Mahalanobis distance is the basis of the Reed-Xiaoli (RX) detector [19]–[21]. Although RX, as originally introduced [19], refers specifically to multispectral imagery, and in fact is a local anomaly detector, the term “RX” is often used as a shorthand for Mahalanobis distance based anomaly detection.
C. Anomalies as pixels in spectral imagery
Traditional statistical analysis treats data as a set of discrete samples that are drawn from a common distribution. Because each pixel in a hyperspectral image contains so much information (a many-channel spectrum of reflectances or radiances), one can often quite profitably treat the pixels as independent and identically distributed. It is as if the image were a “bag of pixels.” But however spectrally informative individual pixels are, they comprise an image, and the spatial structure in an image provides further leverage for characterizing the background and discovering anomalies.
Hyperspectral imagery provides a rich and irregular data set, with complex spatial and spectral structure. And the more accurately we can model this cluttered background, the better our detection performance. Simple models can be very effective, but the mismatch between simple models and the complicated nature of real data has driven research toward the development of more complex models [25].
R̂ = (1 − α)Rs + αRo (5)
where typically α 1. In the simplest case, Ro is just a multiple of the identify matrix [28], [29] (choosing the multiple so that Ro has trace equal to Rs ensures that α is dimensionless). An argument can be made for shrinkage against the diagonal matrix [30], [31], an approach that is generalized in the sparse matrix transform [32], [33]. Caefer et al. [34] recommended a quasi- local estimator that combines local and global covariance estimators by using local eigenvalues with global eigenvectors. The idea of segmentation is to replace the moving window with a static segment of similar pixels that surround the pixel of interest in a more irregular way. Here the image is partitioned into distinct segments of (usually contiguous) pixels, and a pixel’s anomalousness is based on the mean and covariance of the pixels in the segment to which the pixel belongs [34], [35]. This sometimes leads to extra false alarms on the boundaries of the segments, and one way to deal with this problem is with overlapping segments [36]. The estimation of covariance in the local annulus can be corrupted by one or a few outliers^2 and (^2) Outliers and anomalies are essentially the same thing, and we make no formal distinction between them. But informally we think of anomalies as rare nuggets deserving of further analysis, while outliers are nuisance samples that contaminate the data of interest.
Fig. 2. Three anomaly detectors with the same false alarm rate (Pfa = 0. 001 ). The contour marks the boundary between what is normal (inside) and what is anomalous (outside). If anomalies are presumed to be uniformly distributed, then the anomaly detector with smallest volume will have the fewest missed detections.
Plotting volume (or log-volume, which is often more convenient, especially in high dimen- sions) against false alarm rate provides a ROC-like curve that characterizes the anomaly detector’s performance [51], [52]. For a covariance matrix, the volume is proportional to the determinant of the matrix. For local anomaly detectors described in Section I-C1, one can still use a global covariance based on the difference between measured and estimated (e.g., by local mean) values at each pixel, and the smaller that covariance, the better the estimator. A natural choice, from a signal processing perspective is the total variance of that difference, ∑
n
(xn − ̂xn)T^ (xn − ̂xn) (6)
which corresponds to the trace of the covariance matrix. Smaller values of this variance imply that x̂ is closer to x, but Hasson et al. [45] point out that, in terms of target and anomaly detection performance, closer is not necessarily better. When, instead of a global covariance estimator, we use a separate covariance for each pixel, based on the local neighborhood of that pixel, then it is more complicated. It is clear that the volumes of the individual covariances should be small, but it is not obvious how best to combine them. In Bachega et al. [53], it is argued, more on practical than theoretical grounds, that an average of the log volume is a good choice.
For anomaly detection, low false alarm rates are imperative. So the challenge is to characterize the background density in regions where the data are sparse; that is, on the periphery (or “tail”) of the distribution. Unfortunately, traditional density estimation methods, especially parametric estimators (e.g., Gaussian), are dominated by the high-density core. And it bears pointing out that “robust” estimation methods (e.g., [37], [54], [55]) achieve their robustness by paying even less attention to the periphery. Robustness to outliers can be achieved by essentially removing the outliers from the data set. This direct approach is taken by the MCD (Minimum Covariance Determinant) of Rousseeuw et al. [37], [55]. For a data set of N samples, the idea is to take a core subset H of h < N samples and to compute the mean and covariance from just the samples in H, ignoring the rest. The
More robust against outliers ↓ All data samples Subset of h samples Sample covariance matrix Mahalanobis/RX MCD More attention to periphery → Minimum volume matrix MVEE MVEE-h
Fig. 3. Four algorithms for estimating ellipsoidal contours. All four algorithms seek ellipsoidal contours for the data, and can all four be expressed with an equation of the form A(x) = (x − μ)T^ R−^1 (x − μ). The top two algorithms use sample mean and sample covariance to estimate μ and R, respectively; the bottom two seek a minimum volume ellipsoid that strictly encloses the data. The left two algorithms use all of the data in the training set; the right two algorithms use a subset H that includes almost all of the data. Note that the MVEE-h algorithm [41] is both robust to outliers and sensitive to data on the periphery of the distribution.
formal aim is to choose the subset so as to minimize the volume of the ellipsoid corresponding to the sample covariance. Specifically,
min H det(R) where R = (1/h)
xn∈H
(xn − μ)(xn − μ)T
and μ = (1/h)
xn∈H
xn
and #{H} ≥ h
As stated, this is an NP-hard problem, but an iterative approach can be employed to find an approximate optimum. Given an initial set of core samples H, we can compute μ and R as sample mean and covariance of the core set. With this μ and R, we can use Eq. (4) to compute A(x) for all of the samples. Taking the h samples with smallest A(x) values yields a new core set H′. This process can be iterated, and is guaranteed to converge, though it is not guaranteed to converge to the global optimum defined in Eq. (7). Various tricks can be used both to speed up the iterations and to achieve lower minima [55]. Where MCD concentrates on identifying the core, the Minimum Volume Enclosing Ellipsoid (MVEE) algorithm concentrates on the periphery of the data. In contrast to Eq. (7), the aim is to optimize min μ,R
det(R) where (xn − μ)T^ R−^1 (xn − μ) ≤ 1 for all n (8)
Unlike the optimization in Eq. (7), the optimization here is convex and can be efficiently per- formed, using Khachiyan’s algorithm [56], possibly including some of the further improvements that have since been suggested [57], [58]. Although robustness against outliers and sensitivity to the periphery are seemingly opposite requirements, practical anomaly detection actually wants both. For data sets in which a very small number of samples are truly outliers (or are truly anomalies), we do not want to include these samples in our characterization of the background. But absent these outliers, we do want to identify where the tail of background distribution is, and that requires attention to the samples on the periphery [51]. Fig. 3 illustrates this tension between attention to the periphery and robustness to outliers by showing four algorithms lined up along two axes. All four algorithms seek ellipsoidal contours
Another approach based on projection to a lower dimensional space was proposed by Kwon et al. [70]; here the projection operator is based on eigenvalues of a matrix that is the difference of two covariance matrices, one computed from an inner window (centered at the pixel under test in a moving window scenario) and one from an outer window (an annulus that surrounds the inner window and provides local context).
A. Kernel density estimation
Given that the aim of anomaly detection is to estimate the background distribution pb(x), one of the most straightforward estimators is the kernel density estimator, or Parzen windows [71] estimator:
pb(x) = (1/N )
n=
κ(x, xn), (10)
where the sum is over all points in the data set, and where κ is a kernel function that is integrable and is everywhere non-negative. A popular choice is the Gaussian radial basis kernel,
κ(x, xi) =
2 πσd^
exp
‖x − xi‖^2 2 σ^2
Eq. (11) requires the user to choose a “bandwidth” σ that characterizes, in some sense, the range of influence of each point. Since density can vary widely over a distribution, variable and data-adaptive bandwidth schemes have been proposed [72], [73]. In the limit as bandwidth goes to zero, the anomalousness at x is dominated by the κ(x, xi) associated with the xi that is closest to x. Indeed, the anomalousness in that case is equivalent to that distance. An anomaly detector based on distance to the nearest point has been proposed [74], though with an additional step that uses a graph-based approach to eliminate a small fraction (typically 5%) of the points to be used as xi. An updated variant was later proposed [75] that included normalization, subsampling, and a distance defined by the average of the distances to the third, fourth, and fifth nearest points.
B. Feature space interpretation: the “kernel trick”
A particularly fruitful (if initially counter-intuitive) interpretation of kernel functions is as dot products in a (usually higher-dimensional) feature space. Let φ(x) be a function that maps x to some some feature space. Typically φ is nonlinear, and the map is to a feature space that is of higher dimension than x. Scalar dot product in this feature space can be expressed as (again, typically nonlinear) functions of the values in the original data space. That is: κ(r, s) = φ(r)T^ φ(s). (12)
The “kernel trick” is the observation that even though the function φ and the feature space are presumed to “exist” in some abstract mathematical sense, we do not actually need to use φ, as long as we have the kernel function κ. A popular choice is the Gaussian kernel
κ(r, s) = exp
‖r − s‖^2 2 σ^2
but many options are available. Polynomial kernels, for example, are of the form κ(r, s) = (c + rT^ s)d^ for some polynomial dimension d. More general radial-basis kernels are scalar functions of the scalar value ‖r − s‖^2 ; functions that are more heavy-tailed than the Gaussian have been proposed for this purpose [76]. This enables us re-derive the Parzen window detector from a different point of view. Given our data, {x 1 ,... , xN }, we first map to the feature space: {φ(x 1 ),... , φ(xN )}. In this feature space we define the centroid
μφ =
n=
φ(xn) (14)
and we define anomalousness as distance to the centroid in this feature space.
A(x) = ‖φ(x) − μφ‖^2 = (φ(x) − μφ)T^ (φ(x) − μφ) = φ(x)T^ φ(x) − 2 φ(x)T^ μφ + μTφ μφ (15)
We observe that first term φ(x)T^ φ(x) = κ(x, x) is constant for radial basis kernels, that the third term is also constant, and that
φ(x)T^ μφ =
n=
φ(x)T^ φ(xn) =
n=
κ(x, xn). (16)
This leads to
A(x) = constant −
n=
κ(x, xn), (17)
which is a negative monotonic transform of the density estimator (1/N )
n=1 κ(x,^ xn), and therefore equivalent to anomaly detection based on Parzen windows density estimation. The power of kernels in this case is that a seemingly trivial anomaly detector (Euclidean distance to the centroid of the data) in feature space maps back to a more complex data-adaptive anomaly detector in the data space. The power of this feature-space interpretation of kernels is that it enables us to derive other expressions for anomaly detection, starting with very simple models in feature space that are then mapped back to more sophisticated data-adaptive anomaly detectors in the data space. For instance, instead of Euclidean distance to the centroid μφ, consider a more periphery- respecting model that uses an adaptive center aφ that is adjusted to minimize the radius of the sphere that encloses all of the data (see Fig. 4). That is,^3
min r,aφ r^2 subject to: ‖φ(xn) − aφ‖^2 ≤ r^2 (18)
or more generally, that mostly encloses the data:
min r,aφ,ξ r^2 + c
n
ξn (19)
subject to: ‖φ(xn) − aφ‖^2 ≤ r^2 + ξn (20) and: ξn ≥ 0 , (21) (^3) Another way of expressing the centroid μφ is as the solution to the minimization of the average squared radius: μφ = argminμ ∑ n ‖φ(xn)^ −^ μ‖ (^2) ; by comparison, we can say aφ is the solution to the minimization of the maximum squared radius:
aφ = argminamaxn‖φ(xn) − a‖^2. We can interpret Eq. (19) as the minimization of a “soft” maximum.
Nasrabadi [80], the pseudoinverse was taken. The effect of the pseudoinverse is to project data (in the feature space) to the in-sample data plane, but this projection can be problematic for anomaly detection [81]. Anomalies are different from the rest of the data, and this difference will be suppressed by projection back into the in-sample data plane. Indeed, another kernelization that can be effective is the kernel subspace anomaly detector [82], [83]. Here, principal components analysis is performed in the feature space, and a subspace is defined that includes the first few principal components. Anomalousness is defined in terms of the distance to this subspace.
VI. CHANGE For the anomalous change detection problem, the aim is to find interesting differences between two images, taken of the same scene, but at different times and typically under different viewing conditions [84]. There will be some differences that are pervasive – e.g., differences due to overall calibration, contrast, illumination, look-angle, focus, spatial misregistration, atmospheric or even seasonal changes – but there may also be changes that occur in only a few pixels. These rare changes potentially indicate something truly changed in the scene, and the idea is to use anomaly detection to find them. But our interest is in pixels where changes between the pixels are unusual, not so much in unusual pixels that are “similarly unusual” in both images. Informally speaking, we want to learn the “patterns” of these pervasive differences, and then the changes that do not fit the patterns are identified as anomalous. An important precursor to anomalous change detection is the co-registration of the two images. We say that images are registered if corresponding pixels in the two images correspond to the same position in the scene. Registering imagery is a nontrivial task, yet misregistration is one of the main confounds in change detection [85]–[88]. In what follows, let x and y refer to corresponding pixels in two images.
A. Subtraction-based approaches to anomalous change detection
The most straightforward way to look for changes in a pair of images is to subtract them, e = y − x, and then to restrict analysis to the difference image e [89]. Simple subtraction, although it has the advantage of being simple, has the disadvantage that it folds in pervasive differences along with the anomalous changes. Most anomalous change detection algorithms are based on subtracting images, but involve transforming the images to make them more similar. For instance, the chronochrome [90] seeks a linear transform of the first image to make it as similar as possible (in a least squares sense) to the second image. That is, it seeks L so that ‖y − Lx‖^2 , averaged over the whole image, is minimized. To simplify notation, we will assume means have been subtracted from x and y, and define the covariance matrices X =
xxT^
yyT^
, and C =
yxT^
. The linear transform that minimizes the least square fit of y to Lx is given by L = CX−^1. Now the subtraction that is performed is e = y − Lx, and this reduces the effect of pervasive differences on e while still “letting through” the anomalous changes. Note that there is an asymmetry in the chronochrome; by swapping the role of x and y, and seeking L′^ to minimize ‖e = x−L′y‖^2 , one obtains a different anomalous change detector. Clifton [91] proposed a neural network version of chronochrome, in a nonlinear function L(x) is chosen to minimize e = y − L(x), with the aim of even further suppressing the pervasive differences. A more symmetrical approach, which is sometimes called covariance equalization [92], [93] or whitening/de-whitening [94], transforms the data in both images before it subtracts them:
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Fig. 5. Four anomalous change detectors, derived from four distinct and explicit definitions of anomalous change. Here x is represented by the horizontal axis and y is the vertical axis. The left panels show the non-anomalous data (sampled from a correlated Gaussian distribution) and the boundaries outside of which are the points that the detectors consider anomalous changes. The panels to the right include samples drawn from pa(x, y), the model for the distribution of anomalous changes. (a,b) RX detector is obtained from “straight” anomaly detection; (c,d) Chronochrome detector optimized for x → y changes; (e,f) Chronochrome optimized for y → x changes; and (g,h) hyperbolic anomalous change detector.
Another choice for pa(x, y) has also been suggested [99]. Here, pa(x, y) = pb(x)pb(y), and x → y and y → x changes are treated equally. The informal interpretation is that unusual changes are pixel pairs (x, y) which are collectively unusual, but individually normal. That is, the x pixel value is typical for the x-image, and the y pixel value is typical for the y-image, but the (x, y) pair is unusual. If we use this model for anomalies in Eq. (26), and take a logarithm, we obtain log L(x, y) = log pb(x) + log pb(y) − log pb(x, y) (27)
an expression for anomalousness that looks like negative mutual information of x and y. In the case of Gaussian pb(x, y), Eq. (27) reduces to a quadratic expression in x and y that has hyperbolic contours (see Fig. 5(g,h)). Experiments with real and simulated anomalous changes in real imagery indicated that this hyperbolic anomalous change detection (HACD) generally outperformed the subtraction-based anomaly detectors [95]. A further advantage of the distribution-based approach is that the distribution needn’t be Gaus- sian. Indeed, we can take a purely nonparametric view, and treat the problem of distinguishing pervasive differences from anomalous changes as a machine learning classification. Steinwart et al. [100] used support vector machines for just this purpose. But a simpler approach, that has also proven effective, is to consider a parametric distribution, but one slightly more general than the Gaussian. The class of elliptically-contoured (EC) distributions are, like the Gaussian, primarily parameterized by a mean vector and covariance matrix, but do not share the sharp e−r 2 tail of the Gaussian. Heavy-tailed EC distributions have been suggested for hyperspectral imagery in general [101], and for anomalous change detection in particular [102], [103]. Although EC distributions do not affect the “straight anomaly detector” shown in Fig. 5(a,b), they do generalize chronochrome and hyperbolic anomalous change detectors in a way that can lead to improved performance [103]. Kernelization of the EC-based change detector has also been shown to be advantageous [104].
C. Further comments on anomalous change detection
The description here treats pixel pairs as independent samples from an unknown distribution. But there is a lot of spatial structure in imagery, and further gains can be made by incorporating spatial aspects along with the spectral [105], [106]. The description here also considers only pairs of images; often there are more than two images, and these algorithms can be extended to that case [107], [108], though this approach may not be optimal for sequences of images (e.g., anomalous activities in video) where the order of the images in the sequence matters. Another issue that arises in remote sensing is that the anomalous targets may be subpixel in extent, which leads to a different optimization problem [109]. Anomalous change detection is a problem that is particularly well matched to remote sensing imagery, and in that context, a variety of practical issues have been discussed [110], [111]. One of the biggest of these issues is misregistration, when the images don’t exactly line up (and they never exactly line up). Although the effects of misregistration can to some extent be learned from the pervasive differences it creates in image pairs, it is still one of the main confounds to change detection [85], [86]. Gains can be made by explicitly adapting the change detection algorithm to be more robust to misregistration error [87], [88].
VII. CONCLUSION Anomaly detection is seldom a goal in its own right. It is the first step in a search for data samples that are relevant, meaningful, or – in some sense that depends on where the data came
from and what they are being used for – interesting. The mystical “by definition undefined” aspect of anomaly detection mostly derives from the ambiguity of what one means by “interesting” and this has led to a wide variety of ad hoc anomaly detection algorithms, justified by hand-waving arguments and validated (if at all) by anecdotal performance on imagery with a statistically inadequate number of pre-judged anomalies. By employing a framework in which anomalies are in fact well-defined, as samples drawn from some broad and flat distribution, anomaly detection algorithms can be objectively tested, and improvements can be confidently constructed. Within this framework, many of the tools that have been developed for signal processing, machine learning, and data analytics in general, can be brought to bear on the detection of anomalies. These range from the venerable Gaussian distribution to kernels and subspaces (and kernelized subspaces!), and invoke the usual issues in underfitting and overfitting data. The technical challenge of anomaly detection is not usually the anomalies themselves, but with characterizing what can be a complex and highly structured background. Since most anomaly detection scenarios require a low false alarm rate, it is out on the periphery of this background where the modeling is emphasized. This is something of a challenge, since the data density is much lower there. The modeling, however is discriminative, not generative, which means that the aim is not the model the distribution per se, but to find the boundary that separates the non-anomalous data from the anomalies.
REFERENCES
Optical Engineering 46 , 076402 (2007).
on nonparametric density estimation.” Proc. SPIE 8743 , 87431A (2013).