Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

NetKit: A Modular Toolkit for Classification in Networked Data, Papers of Music

Netkit, a toolkit for classification in networked data, and presents a case study of its application to univariate network classification. Netkit enables in-depth studies of statistical relational learning and classification techniques and is applicable to various networked data applications, including scientific, communication, and surveillance networks.

Typology: Papers

Pre 2010

Uploaded on 09/17/2009

koofers-user-svb
koofers-user-svb 🇺🇸

10 documents

1 / 49

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Classification in Networked Data0:
A toolkit and a univariate case study
Sofus A. Macskassy SOF MAC@FETCH.COM
Fetch Technologies, Inc.
2041 Rosecrans Avenue
El Segundo, CA 90254
Foster Provost FPROVOST @STE RN.NYU.E DU
New York University
44 W. 4th Street
New York, NY 10012
Editor:
Abstract
This paper is about classifying entities that are interlinked with entities for which the class
is known. After surveying prior work, we present NetKit, a modular toolkit for classification in
networked data, and a case-study of its application to networked data used in prior machine learn-
ing research. NetKit is based on a node-centric framework in which classifiers comprise a local
classifier, a relational classifier, and a collective inference procedure. Various existing node-centric
relational learning algorithms can be instantiated with appropriate choices for these components,
and new combinations of components realize new algorithms. The case study focuses on univari-
ate network classification, for which the only information used is the structure of class linkage in
the network (i.e., only links and some class labels). To our knowledge, no work previously has
evaluated systematically the power of class-linkage alone for classification in machine learning
benchmark data sets. The results demonstrate that very simple network-classification models per-
form quite well—well enough that they should be used regularly as baseline classifiers for studies
of learning with networked data. The simplest method (which performs remarkably well) highlights
the close correspondence between several existing methods introduced for different purposes—i.e.,
Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study
also shows that there are two sets of techniques that are preferable in different situations, namely
when few versus many labels are known initially. We also demonstrate that link selection plays an
important role similar to traditional feature selection.
Keywords: relational learning, network learning, collective inference, collective classification,
networked data, probabilistic relational models, network analysis, network data
1. Introduction
Networked data contain interconnected entities for which inferences are to be made. For example,
web pages are interconnected by hyperlinks, research papers are connected by citations, telephone
0. S.A. Macskassy and Provost, F.J., “Classification in Networked Data: A toolkit and a univariate case study” CeDER
Working Paper CeDER-04-08, Stern School of Business, New York University, NY, NY 10012. December 2004.
Updated December 2006.
c
2006 Sofus A. Macskassy and Foster Provost.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31

Partial preview of the text

Download NetKit: A Modular Toolkit for Classification in Networked Data and more Papers Music in PDF only on Docsity!

Classification in Networked Data^0 :

A toolkit and a univariate case study

Sofus A. Macskassy SOFMAC@FETCH.COM Fetch Technologies, Inc. 2041 Rosecrans Avenue El Segundo, CA 90254

Foster Provost FPROVOST@STERN.NYU.EDU New York University 44 W. 4th Street New York, NY 10012

Editor:

Abstract

This paper is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to networked data used in prior machine learn- ing research. NetKit is based on a node-centric framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. Various existing node-centric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univari- ate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. The results demonstrate that very simple network-classification models per- form quite well—well enough that they should be used regularly as baseline classifiers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes—i.e., Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection. Keywords: relational learning, network learning, collective inference, collective classification, networked data, probabilistic relational models, network analysis, network data

1. Introduction

Networked data contain interconnected entities for which inferences are to be made. For example, web pages are interconnected by hyperlinks, research papers are connected by citations, telephone

  1. S.A. Macskassy and Provost, F.J., “Classification in Networked Data: A toolkit and a univariate case study” CeDER Working Paper CeDER-04-08, Stern School of Business, New York University, NY, NY 10012. December 2004. Updated December 2006.

©^ c2006 Sofus A. Macskassy and Foster Provost.

MACSKASSY AND PROVOST

accounts are linked by calls, possible terrorists are linked by communications. This paper is about within-network classification : entities for which the class is known are linked to entities for which the class must be estimated. For example, telephone accounts previously determined to be fraudu- lent may be linked, perhaps indirectly, to those for which no assessment yet has been made. Such networked data present both complications and opportunities for classification and ma- chine learning. The data are patently not independent and identically distributed, which introduces bias to learning and inference procedures (Jensen and Neville, 2002b). The usual careful separation of data into training and test sets is difficult, and more importantly, thinking in terms of separating training and test sets obscures an important facet of the data: entities with known classifications can serve two roles. They act first as training data and subsequently as background knowledge during inference. Relatedly, within-network inference allows models to use specific node identifiers to aid inference (see Section 3.5.3). Networked data allow collective inference , meaning that various interrelated values can be in- ferred simultaneously. For example, inference in Markov random fields (MRFs, Dobrushin, 1968; Besag, 1974; Geman and Geman, 1984) uses estimates of a node’s neighbors’ labels to influence the estimation of the node’s label—and vice versa. Within-network inference complicates such pro- cedures by pinning certain values, but also offers opportunities such as the application of network- flow algorithms to inference (see Section 3.5.1). More generally, networked data allow the use of the features of a node’s neighbors, although that must be done with care to avoid greatly increasing estimation variance and thereby error (Jensen et al., 2004). To our knowledge there previously has been no large-scale, systematic experimental study of machine learning methods for within-network classification. A serious obstacle to undertaking such a study is the scarcity of available tools and source code, making it hard to compare various method- ologies and algorithms. A systematic study is further hindered by the fact that many relational learn- ing algorithms can be separated into various sub-components; ideally the relative contributions of the sub-components and alternatives should be assessed. As a main contribution of this paper, we introduce a network learning toolkit (NetKit-SRL) that enables in-depth, component-wise studies of techniques for statistical relational learning and classification with networked data. We abstract prior, published methods into a modular framework on which the toolkit is based.^1 NetKit is interesting for several reasons. First, various systems from prior work can be realized by choosing particular instantiations for the different components. A common platform allows one to compare and contrast the different systems on equal footing. Perhaps more importantly, the modularity of the toolkit broadens the design space of possible systems beyond those that have appeared in prior work, either by mixing and matching the components of the prior systems, or by introducing new alternatives for components. In the second half of the paper, we use NetKit to conduct a case study of within-network classifi- cation in homogeneous, univariate networks, which are important both practically and scientifically (as we discuss in Section 5). We compare various learning and inference techniques on twelve benchmark data sets from four domains used in prior machine learning research. Beyond illustrat- ing the value of the toolkit, the case study provides systematic evidence that with networked data even univariate classification can be remarkably effective. One implication is that such methods should be used as baselines against which to compare more sophisticated relational learning algo-

  1. NetKit-SRL, or NetKit for short, is written in Java 1.5 and is available as open source from http://www.research.rutgers.edu/˜sofmac/NetKit.html

MACSKASSY AND PROVOST

(2006b) explicitly represent and reason with accounts’ local network neighborhoods, for identify- ing telecommunications fraud. Similarly, networks of relationships between brokers can help in identifying securities fraud (Neville et al., 2005). For marketing, consumers can be connected into a network based on the products that they buy (or that they rate in a collaborative filtering system), and then network-based techniques can be applied for making product recommendations (Domingos and Richardson, 2001; Huang et al., 2004). If a firm can know actual social-network links between consumers, for example through communications records, statistical, network-based marketing techniques can perform significantly better than traditional targeted marketing based on demographics and prior purchase data (Hill et al., 2006a). Finally, network classification approaches have seen elegant application to problems that ini- tially do not present themselves as network classification. Section 3.5.1 discusses how for “trans- ductive” inference (Vapnik, 1998a), data points can be linked into a network based on any similarity measure. Thus, any transductive classification problem can be treated as a (within-)network classi- fication problem.

3. Network Classification and Learning

Traditionally, machine learning methods have treated entities as being independent, which makes it possible to infer class membership on an entity-by-entity basis. With networked data, the class membership of one entity may have an influence on the class membership of a related entity. Fur- thermore, entities not directly linked may be related by chains of links, which suggests that it may be beneficial to infer the class memberships of all entities simultaneously. Collective inferencing in relational data (Jensen et al., 2004) makes simultaneous statistical judgments regarding the values of an attribute or attributes for multiple linked entities for which some attribute values are not known.

3.1 Univariate Collective Inferencing

For the univariate case study presented below, the (single) attribute Xi of a vertex vi, representing the class, can take on some categorical value c ∈ X —for m classes, X = {c 1 ,... , cm}. We will use c to refer to a non-specified class value.

Given graph G = (V, E, X) where Xi is the (single) attribute of vertex vi ∈ V, and given known values xi of Xi for some subset of vertices VK^ , univariate collective inferencing is the process of simultaneously inferring the values xi of Xi for the re- maining vertices, VU^ = V−VK^ , or a probability distribution over those values.

As a shorthand, we will use xK^ to denote the (vector of) class values for VK^ , and similarly for xU^. Then, GK^ = (V, E, xK^ ) denotes everything that is known about the graph (we do not consider the possibility of unknown edges). Edge eij ∈ E represents the edge between vertices vi and vj , and wij represents the edge weight. For this paper we consider only undirected edges (i.e., wij = wji), if necessary simply ignoring directionality for a particular application. Rather than estimating the full joint probability distribution P (xU^ |GK^ ) explicitly, relational learning often enhances tractability by making a Markov assumption:

P (xi|G) = P (xi|Ni),

CLASSIFICATION IN NETWORKED DATA

where Ni is a set of “neighbors” of vertex vi such that P (xi|Ni) is independent of G − Ni (i.e., P (xi|Ni) = P (xi|G)). For this paper, we make the (“first-order”) assumption that Ni comprises only the immediate neighbors of vi in the graph. As one would expect, and as we will see in Section 5.3.5, this assumption can be violated to a greater or lesser degree based on how edges are defined. Given Ni, a relational model can be used to estimate xi. Note that N (^) iU (= Ni ∩ VU^ )—the set of neighbors of vi whose values of attribute X are not known—could be non-empty. Therefore, even if the Markov assumption holds, a simple application of the relational model may be insufficient. However, the relational model also may be used to estimate the labels of N (^) iU. Further, just as estimates for the labels of N (^) iU influence the estimate for xi, xi also influences the estimate of the labels of vj ∈ N (^) iU (because edges are undirected, so vj ∈ Ni =⇒ vi ∈ Nj ). In order to simultaneously estimate these interdependent values xU^ , various collective inference methods can be applied, which we discuss below. Many of the algorithms developed for within-network classification are heuristic methods with- out a formal probabilistic semantics (and others are heuristic methods with a formal probabilistic semantics). Nevertheless, let us suppose that at inference time we are presented with a probability distribution structured as a graphical model.^2 In general, there are various inference tasks we might be interested in undertaking (Pearl, 1988). We focus primarily on within-network, univariate classi- fication: the computation of the marginal probability of class membership of a particular node (i.e., the variable represented by the node taking on a particular value), conditioned on knowledge of the class membership of certain other nodes in the network. We also discuss methods for the related problem of computing the maximum a posteriori (MAP) joint labeling for V or VU^. For the sort of graphs we expect to encounter in the aforementioned applications, such proba- bilistic inference is quite difficult. As discussed by Wainwright and Jordan (2003), the naive method of marginalizing by summing over all configurations of the remaining variables is intractable even for graphs of modest size; for binary classification with around 400 unknown nodes, the summation involves more terms than atoms in the visible universe. Inference via belief propagation (Pearl,

  1. is applicable only as a heuristic approximation, because directed versions of many network classification graphs will contain cycles. An important alternative to heuristic (“loopy”) belief propagation is the junction-tree algorithm (Cowell et al., 1999), which provides exact solutions for arbitrary graphs. Unfortunately, the com- putational complexity of the junction-tree algorithm is exponential in the “treewidth” of the junction tree formed by the graph (Wainwright and Jordan, 2003). Since the treewidth is one less than the size of the largest clique, and the junction tree is formed by triangulating the original graph, the complexity is likely to be prohibitive for graphs such as social networks, which can have dense local connectivity and long cycles.

3.2 A Node-centric Network Learning Framework and Historical Background: Local, Relational, and Collective Inference

A large set of approaches to the problem of network classification can be viewed as “node centric,” in the sense that they focus on a single node at a time. For a couple reasons, which we elaborate

  1. For this paper, we assume that the structure of the network resulting from the chosen links corresponds at least partially to the structure of the network of probabilistic dependencies. This of course will be more or less true based on the choice of links, as we will see in Section 5.3.5.

CLASSIFICATION IN NETWORKED DATA

approaches. MRFs are used to estimate a joint probability distribution over the free variables of a set of nodes under the first-order Markov assumption that P (xi|G/vi) = P (xi|Ni), where xi is the (estimated) label of vertex vi, G/vi means all nodes in G except vi, and Ni is a neighborhood function returning the neighbors of vi. In a typical image application, nodes in the network are pixels and the labels are image properties such as whether a pixel is part of a vertical or horizontal border. Because of the obvious interdependencies among the nodes in an MRF, computing the joint probability of assignments of labels to the nodes (“configurations”) requires collective inference. Gibbs sampling (Geman and Geman, 1984) was developed for this purpose for restoring degraded images. Geman and Geman enforce that the Gibbs sampler settles to a final state by using simu- lated annealing where the temperature is dropped slowly until nodes no longer change state. Gibbs sampling is discussed in more detail below. Two problems with Gibbs sampling (Besag, 1986) are particularly relevant for machine learning applications of network classification. First, prior to Besag’s paper Gibbs sampling typically was used in vision not to compute the final marginal posteriors, as required by many “scoring” applica- tions where the goal is to rank individuals, but rather to get final MAP classifications. Second, Gibbs sampling can be very time consuming, especially for large networks (not to mention the problems detecting convergence in the first place). With his Iterated Conditional Modes (ICM) algorithm, Besag introduced the notion of iterative classification for scene reconstruction. In brief, iterative classification repeatedly classifies labels for vi ∈ VU^ , based on the “current” state of the graph, until no vertices change their label. ICM is presented as being efficient and particularly well suited to maximum marginal classification by node (pixel), as opposed to maximum joint classification over all the nodes (the scene).

Two other, closely related, collective inference techniques are (loopy) belief propagation (Pearl,

  1. and relaxation labeling (Rosenfeld et al., 1976; Hummel and Zucker, 1983). Loopy belief propagation was introduced above. Relaxation labeling originally was proposed as a class of par- allel iterative numerical procedures that use contextual constraints to reduce ambiguities in image analysis; an instance of relaxation labeling is described in detail below. Both methods use the es- timated class distributions directly, rather than the hard labelings used by iterative classification. Therefore, one requirement for applying these methods is that the relational classifier, when esti- mating xi, must be able to use the estimated class distributions of vj ∈ N (^) iU.

Graph-cut techniques recently have been used in vision research as an alternative to using Gibbs sampling (Boykov et al., 2001). In essence, these are collective inference procedures, and are the basis of a collection of modern machine learning techniques. However, they do not quite fit in the node-centric framework, so we treat them separately below.

3.3 Node-centric Network Classification Approaches

The node-centric framework allows us to describe several prior systems by how they solve the problems of local classification, relational classification, and collective inference. The components of these systems are the basis for composing methods in NetKit. For classifying web-pages based on the text and (possibly inferred) class labels of neighboring pages, Chakrabarti et al. (1998) combined naive Bayes local and relational classifiers with relaxation labeling for collective inference. In their experiments, performing network classification using the web-pages’ link structure substantially improved classification as compared to using only the local

MACSKASSY AND PROVOST

(text) information. Specifically, considering the text of neighboring pages generally hurt perfor- mance, whereas using only the (inferred) class labels improved performance.

The iterated conditional modes procedure (ICM, Besag, 1986) is a node-centric approach where the local and relational classifiers are domain-dependent probabilistic models (based on local at- tributes and a MRF), and iterative classification is used for collective inference. Iterative classifica- tion has been used for collective inference elsewhere as well, for example Neville and Jensen (2000) use it in combination with naive Bayes for local and relational classification (with a simulated an- nealing procedure to settle on the final labeling).

We will look in more detail at the procedure known as “link-based classification” (Lu and Getoor, 2003), also introduced for the classification of linked documents (web pages and published manuscripts with an accompanying citation graph). Similarly to the work of Chakrabarti et al. (1998), linked-based classification uses the (local) text of the document as well as neighbor labels. More specifically, the relational classifier is a logistic regression model applied to a vector of ag- gregations of properties of the sets of neighbor labels linked with different types of links (in-, out-, co-links). Various aggregates could be used and are examined by Lu and Getoor (2003), such as the mode (the value of the most often occurring neighbor class), a binary vector with a value of 1 at cell i if there was a neighbor whose class label was ci, and a count vector where cell i contained the number of neighbors belonging to class ci. In their experiments, the count model performed best. They used logistic regression on the local (text) attributes of the instances to initialize the priors for each vertex in their graph and then applied the link-based classifiers as their relational model.

The simplest network classification technique we will consider was introduced to highlight the remarkable amount of “power” for classification present just in the structure of the network, a no- tion that we will investigate in depth in the case study below. The weighted-vote relational neighbor (wvRN) procedure (Macskassy and Provost, 2003) performs relational classification via a weighted average of the (potentially estimated) class membership scores (“probabilities”) of the node’s neigh- bors. Collective inference is performed via a relaxation labeling method similar to that used by Chakrabarti et al. (1998). If local attributes such as text are ignored, the node priors can be instanti- ated with the unconditional marginal class distribution estimated from the training data.

Since wvRN performs so well in the case study below, it is noteworthy to point out its close re- lationship to Hopfield networks (Hopfield, 1982) and Boltzmann machines (Hinton and Sejnowski, 1986). A Hopfield network is a graph of homogeneous nodes and undirected edges, where each node is a binary threshold unit. Hopfield networks were designed to recover previously seen graph configurations from a partially observed configuration, by repeatedly estimating the states of nodes one at a time. The state of a node is determined by whether or not its input exceeds its threshold, where the input is the weighted sum of the states of its immediate neighbors. wvRN differs in that it retains uncertainty at the nodes rather than assigning each a binary state (also allowing multi- class networks). Learning in Hopfield networks consists of learning the weights of edges and the thresholds of nodes, given one or more input graphs. Given a partially observed graph state and repeatedly applying, node-by-node, the node-activation equation will provably converge to a stable graph state—the low-energy state of the graph. If the partial input state is “close” to one of the training states, the Hopfield network will converge to that state.

A Boltzmann machine, like a Hopfield network, is a network of units with an “energy” defined for the network (Hinton and Sejnowski, 1986). Unlike Hopfield networks, Boltzmann machine nodes are stochastic and the machines use simulated annealing to find a stable state. Boltzmann

MACSKASSY AND PROVOST

work classification, but rather, semi-supervised learning in a transductive setting (Vapnik, 1998b). Nevertheless, the methods introduced may have direct application to certain instances of univari- ate network classification. Specifically, they consider data sets where labels are given for a subset of cases, and classifications are desired for a subset of the rest. Data points are connected into a weighted network, by adding edges (in various ways) based on similarity between cases.

Finding the minimum energy configuration of a MRF, the partition of the nodes that maximizes self-consistency under the constraint that the configuration be consistent with the known labels, is equivalent to finding a minimum cut of the graph (Greig et al., 1989). Following this idea and sub- sequent work connecting classification to the problem of computing minimum cuts (Kleinberg and Tardos, 1999), Blum and Chawla (2001) investigate how to define weighted edges for a transduc- tive classification problem such that polynomial-time mincut algorithms give optimal solutions to objective functions of interest. For example, they show elegantly how forms of leave-one-out-cross- validation error (on the predicted labels) can be minimized for various nearest-neighbor algorithms, including a weighted-voting algorithm. This procedure corresponds to optimizing the consistency of the predictions in particular ways—as Blum and Chawla put it, optimizing the “happiness” of the classification algorithm.

Of course, optimizing the consistency of the labeling may not be ideal. For example in the case of a highly unbalanced class frequency it is necessary to preprocess the graph to avoid degener- ate cuts, for example those cutting off the one positive example (Joachims, 2003). This seeming pathology stems from the basic objective: the minimum of the sum of cut-through edge weights de- pends directly on the sizes of the cut sets; normalizing for the cut size leads to ratiocut optimization (Dhillon, 2001) constrained by the known labels (Joachims, 2003).

The mincut partition corresponds to the most probable joint labeling of the graph (taking an MRF perspective), whereas as discussed earlier we often would like a per-node (marginal) class- probability estimation (Blum et al., 2004). Unfortunately, in the case we are considering—when some node labels are known in a general graph—there is no known efficient algorithm for deter- mining these estimates. There are several other drawbacks (Blum et al., 2004), including that there may be many minimum cuts for a graph (from which mincut algorithms choose rather arbitrarily), and that the mincut approach does not yield a measure of confidence on the classifications. Blum et al. address these drawbacks by repeatedly adding artificial noise to the edge weights in the in- duced graph. They then can compute fractional labels for each node corresponding to the frequency of labeling by the various mincut instances. As mentioned above, this method (and the one dis- cussed next) was intended to be applied to an induced graph, which can be designed specifically for the application. Mincut approaches are appropriate for graphs that have at least some small, balanced cuts (whether or not these correspond to the labeled data) (Blum et al., 2004). It is not clear whether methods like this that discard highly unbalanced cuts will be effective for network classification problems such as fraud detection in transaction networks, with extremely unbalanced class distributions.

3.5.2 THE GAUSSIAN-FIELD CLASSIFIER

In the experiments of Blum et al. (2004), their randomized mincut method empirically does not perform as well as a method introduced by Zhu et al. (2003). Therefore, we will revisit this latter method in an experimental comparison following the main case study. Zhu et al. treat the induced network as a Gaussian field (a random field with soft node labels) constrained such that the labeled

CLASSIFICATION IN NETWORKED DATA

nodes maintain their values. The value of the energy function is the weighted average of the func- tion’s values at the neighboring points. Zhu et al. show that this function is a harmonic function and that the solution over the complete graph can be computed using a few matrix operations. The result is a classifier essentially identical to the wvRN classifier (Macskassy and Provost, 2003) discussed above (paired with relaxation labeling), except with a principled semantics and exact inference.^3 The energy function then can be normalized based on desired class posteriors (“class mass nor- malization”). Zhu et al. also discuss various physical interpretations of this procedure, including random walks, electric networks, and spectral graph theory, that can be intriguing in the context of particular applications. For example, applying the random walk interpretation to a telecommu- nications network including legitimate and fraudulent accounts, consider starting at an account of interest and walking randomly through the call graph based on the link weights; the node score is the probability that the walk hits a known fraudulent account before hitting a known legitimate account.

3.5.3 USING NODE IDENTIFIERS

As mentioned in the introduction, another unique aspect of within-network classification is that node identifiers , unique symbols for individual nodes, can be used in learning and inference. For example, for suspicion scoring in social networks, the fact that someone met with a particular individual may be informative (e.g., having had repeated meetings with a known terrorist leader). Very little work has incorporated identifiers, because of the obvious difficulty of modeling with very high cardinality categorical attributes. Identifiers (telephone numbers) have been used for fraud detection (Fawcett and Provost, 1997; Cortes et al., 2001; Hill et al., 2006b), but to our knowledge, Perlich and Provost (2006) provide the only comprehensive treatment of the use of identifiers for relational learning.

3.5.4 BEYOND UNIVARIATE CLASSIFICATION

Besides the methods already discussed (e.g., Besag, 1986; Lu and Getoor, 2003; Chakrabarti et al., 1998), several other methods go beyond the homogeneous, univariate case on which this paper focuses. Conditional random fields (CRFs, Lafferty et al., 2001) are random fields where the prob- ability of a node’s label is conditioned not only on the labels of neighbors (as in MRFs), but also on all the observed attribute data. Relational Bayesian networks (RBNs, a.k.a. probabilistic relational models, Koller and Pfeffer, 1998; Friedman et al., 1999; Taskar et al., 2001) extend Bayesian networks (BNs, Pearl, 1988) by taking advantage of the fact that a variable used in one instantiation of a BN may refer to the exact same variable in another BN. For example, consider that the grade of a student depends to some extent upon his professor; this professor is the same for all students in the class. Therefore, rather than building one BN and using it in isolation for each entity, RBNs directly link shared variables, thereby generating one big network of connected entities for which collective inferencing can be performed. Unfortunately, because the BN representation must be acyclic, RBNs cannot model arbitrary relational autocorrelation, such as the homophily that plays a large role in the case study below. However, undirected relational graphical models can model relational autocorrelation. Relational dependency networks (RDNs, Neville and Jensen, 2003, 2004, forthcoming), extend dependency

  1. Experiments show these two procedures to yield almost identical generalization performance, albeit the matrix-based procedure of Zhu et al. is much slower than the iterative wvRN.

CLASSIFICATION IN NETWORKED DATA

  1. Relational classifier inducer: Given GK^ , this module returns a model MR that will estimate xi using vi and Ni. Ideally, MR will estimate a probability distribution over the possible values for xi.
  2. Collective Inferencing: Given a graph G possibly with some xi known, a set of prior es- timates for xU^ , and a relational model MR, this module applies collective inferencing to estimate xU^.
  3. Weka Wrapper: This module is a wrapper for Weka^4 (Witten and Frank, 2000) and can convert the graph representation of vi into an entity that can either be learned from or be used to estimate xi. NetKit can use a Weka classifier either as a local classifier or as a relational classifier (by using various aggregation methods to summarize the values of attributes in Ni).

The current version of NetKit-SRL, while able to read in heterogeneous graphs, only supports classification in graphs consisting of a single type of node. Algorithms based on expectation max- imization are possible to implement through the NetKit collective inference module, by having the collective inference module repeatedly apply a relational classifier to learn a new relational model and then apply the new relational model to G (rather than repeatedly apply the same learned model at every iteration). The rest of this section describes the particular relational classifiers and collective inference methods implemented in NetKit for the case study. First, we describe the four (univariate^5 ) rela- tional classifiers. Then, we describe the three collective inference methods.

4.1 Relational Classifiers

All four relational classifiers take advantage of the first-order Markov assumption on the network: only a node’s local neighborhood is necessary for classification. The univariate case renders this assumption particularly restrictive: only the class labels of the local neighbors are necessary. The local network is defined by the user, analogous to the user’s definition of the feature set for proposi- tional learning. Entities whose class labels are not known are either ignored or are assigned a prior probability, depending upon the choice of local classifier.

4.1.1 WEIGHTED-VOTE RELATIONAL NEIGHBOR CLASSIFIER (WVRN)

The case study’s simplest classifier (Macskassy and Provost, 2003)^6 estimates class-membership probabilities by assuming the existence of homophily (see Section 3.4). Definition. Given vi ∈ VU^ , the weighted-vote relational-neighbor classifier (wvRN) estimates P (xi|Ni) as the (weighted) mean of the class-membership probabilities of the entities in Ni:

P (xi = c|Ni) =

Z

vj ∈Ni

wi,j · P (xj = c|Nj ),

where Z is the usual normalizer. As this is a recursive definition (for undirected graphs, vj ∈ Ni ⇔ vi ∈ Nj ) the classifier uses the “current” estimate for P (xj = c|Nj ), where the current estimate is defined by the collective inference technique being used.

  1. We use version 3.4.2. Weka is available at http://www.cs.waikato.ac.nz/˜ml/weka/
  2. The open-source NetKit release contains multivariate versions of these classifiers.
  3. Previously called the probabilistic Relational Neighbor classifier (pRN).

MACSKASSY AND PROVOST

4.1.2 CLASS-DISTRIBUTION RELATIONAL NEIGHBOR CLASSIFIER (CDRN)

The simple wvRN assumed that neighboring class labels were likely to be the same. Learning a model of the distribution of neighbor class labels is more flexible, and may lead to better discrimi- nation. Following Perlich and Provost (2003, 2006), and in the spirit of Rocchio’s method (Rocchio, 1971), we define node vi’s class vector CV(vi) to be the vector of summed linkage weights to the various (known) classes, and class c’s reference vector RV(c) to be the average of the class vectors for nodes known to be of class c. Specifically:

CV(vi)k =

vj ∈Ni,xj =ck

wi,j , (1)

where CV(vi)k represents the kth^ position in the class vector and ck ∈ X is the kth^ class. Based on these class vectors, the reference vectors can then be defined as the normalized vector sum:

RV(c) =

|VKc |

vi∈VKc

CV(vi), (2)

where VKc = {vi|vi ∈ VK^ , xi = c}. For the case study, during training, neighbors in VU^ are ignored. For prediction, estimated class membership probabilities are used for neighbors in VU^ , and Equation 1 becomes:

CV(vi)k =

vj ∈Ni

wi,j · P (xj = ck|Nj ) (3)

Definition. Given vi ∈ VU^ , the class-distribution relational-neighbor classifier (cdRN) es- timates the probability of class membership, P (xi = c|Ni), by the normalized vector similarity between vi’s class vector and class c’s reference vector:

P (xi = c|Ni) = sim(CV(vi), RV(c)),

where sim(a, b) is any vector similarity function (L 1 , L 2 , cosine, etc.), normalized to lie in the range [0 : 1]. For the results presented below, we use cosine similarity. As with wvRN, Equation 3 is a recursive definition, and therefore the value of P (xj = c|Nj ) is approximated by the “current” estimate as given by the selected collective inference technique.

4.1.3 NETWORK-ONLY BAYES CLASSIFIER (NBC)

NetKit’s network-only Bayes classifier (nBC) is based on the algorithm described by Chakrabarti et al. (1998). To start, assume there is a single node vi in VU^. The nBC uses multinomial naive Bayesian classification based on the classes of vi’s neighbors.

P (xi = c|Ni) = P (Ni|c) · P (c) P (Ni)

where

P (Ni|c) =

Z

vj ∈Ni

P (xj = ˜xj |xi = c)wi,j

MACSKASSY AND PROVOST

  1. Initialize vi ∈ VU^ using the local classifier model, ML. Specifically, for vi ∈ VU^ : (a) ĉi ← ML(vi), where ĉi is a vector of probabilities representing ML’s estimates of P (xi|Ni). We use ĉi(k) to mean the kth^ value in ĉi, which represents P (xi = ck|Ni).

For the case study, the local classifier model returns the unconditional marginal class distribution estimated from xK^. (b) Sample a value cs from ĉi, such that P (cs = ck|ĉi) = ĉi(k). (c) Set xi ← cs.

  1. Generate a random ordering, O, of vertices in VU^.
  2. For elements vi ∈ O in order: (a) Apply the relational classifier model: ĉi ← MR(vi). (b) Sample a value cs from ĉi, such that P (cs = ck|ĉi) = ĉi(k). (c) Set xi ← cs. Note that when MR is applied to estimate ĉi it uses the “new” labelings from elements 1 ,... , (i−1), while using the “current” labelings for elements (i+1),... , n.
  3. Repeat prior step 200 times without keeping any statistics. This is known as the burnin period.
  4. Repeat again for 2000 iterations, counting the number of times each Xi is assigned a particular value c ∈ X. Normalizing these counts forms the final class probability estimates. Table 3: Pseudo-code for Gibbs sampling.

on the application, the goal ideally would be to infer the class labels with either the maximum joint probability or the maximum marginal probability for each node. Alternatively, if estimates of entities’ class-membership probabilities are needed, the collective inference method estimates the marginal probability distribution P (Xi = c|GK^ , Λ) for each Xi ∈ xU^ and c ∈ X , where Λ stands for the priors returned by the local classifier.

4.2.1 GIBBS SAMPLING

Gibbs sampling (Geman and Geman, 1984) is commonly used for collective inferencing with re- lational learning systems. The algorithm is straightforward and is shown in Table 3.^8 The use of 200 and 2000 for the burnin period and number of iterations are commonly used values.^9 Ideally, we would iterate until the estimations converge. Although there are convergence tests for the Gibbs sampler, they are neither robust nor well understood (cf., Gilks et al., 1995), so we simply use a fixed number of iterations.

  1. This instance of Gibbs sampling uses a single random ordering (“chain”), as this is what we used in the case study. In the case study, averaging over 10 chains (the default in NetKit) had no effect on the final accuracies.
  2. As it turns out, in our case study Gibbs sampling invariably reached a seemingly final plateau in fewer than 1000 iterations, and often in fewer than 500.

CLASSIFICATION IN NETWORKED DATA

  1. For vi ∈ VU^ , initialize the prior: ĉi(0)^ ← ML(vi), where ĉi is defined as in Table 3.

For the case study, the local classifier model returns the unconditional marginal class distribution estimated from xK^.

  1. For elements vi ∈ VU^ : (a) Estimate xi by applying the relational model:

ĉi(t+1)^ ← MR(v( it )), (4)

where MR(v( i t)) denotes using the estimates ĉj (t)^ for vj ∈ Ni, and t is the iteration count. This has the effect that all predictions are done pseudo- simultaneously based on the state of the graph after iteration t.

  1. Repeat for T iterations, where T = 99 for the case study. ̂c(T^ )^ will comprise the final class probability estimations.

Table 4: Pseudo-code for relaxation labeling.

Notably, because all nodes are assigned a class at every iteration, when Gibbs sampling is used the relational models will always see a fully labeled/classified neighborhood, making prediction straightforward. For example, nBC does not need to compute its Bayesian combination (see Sec- tion 4.1.3).

4.2.2 RELAXATION LABELING

The second collective inferencing method used in the study is relaxation labeling, based on the method of Chakrabarti et al. (1998). Rather than treat G as being in a specific labeling “state” at every point (as Gibbs sampling does), relaxation labeling retains the uncertainty, keeping track of the current probability estimations for xU^. The relational model must be able to use these estimations. Further, rather than estimating one node at a time and updating the graph right away, relaxation labeling “freezes” the current estimations so that at step t + 1 all vertices will be updated based on the estimations from step t. The algorithm is shown in Table 4. Preliminary runs showed that relaxation labeling sometimes does not converge, but rather ends up oscillating between two or more graph states.^10 NetKit performs simulated annealing—on each subsequent iteration giving more weight to a node’s own current estimate and less to the influence of its neighbors. The new updating step, replacing Equation 4 is:

ĉi(t+1)^ = β(t+1)^ · MR(v( it )) + (1−β(t+1)) · ĉi(t),

where

β^0 = k β(t+1)^ = β(t)^ · α,

  1. Such oscillation has been noted elsewhere for closely related methods (Murphy et al., 1999).

CLASSIFICATION IN NETWORKED DATA

and inference in homogeneous networks, comparing alternative techniques that have not before been compared systematically, if at all. The setting for the case study is simple: For some entities in the network, the value of xi is known; for others it must be estimated. Univariate classification, albeit a simplification for many domains, is important for several rea- sons. First, it is a representation that is used in some applications. Above we mentioned fraud detec- tion; as a specific example, a telephone account that calls the same numbers as a known fraudulent account (and hence the accounts are connected through these intermediary numbers) is suspicious (Fawcett and Provost, 1997; Cortes et al., 2001). For phone fraud, univariate network classification often provides alarms with reasonable coverage and remarkably low false-positive rates. Gener- ally speaking, a homogeneous, univariate network is an inexpensive (in terms of data gathering, processing, storage) approximation of many complex networked data problems. Fraud detection applications certainly do have a variety of additional attributes of importance; nevertheless, univari- ate simplifications are very useful and are used in practice. The univariate case also is important scientifically. It isolates a primary difference between networked data and non-networked data, facilitating the analysis and comparison of relevant clas- sification and learning methods. One thesis of this study is that there is considerable information inherent in the structure of the networked data and that this information can be readily taken advan- tage of, using simple models, to estimate the labels of unknown entities. This thesis is tested by isolating this characteristic—namely ignoring any auxiliary attributes and only allowing the use of known class labels—and empirically evaluating how well univariate models perform in this setting on benchmark data sets. Considering homogeneous networks plays a similar role. Although the domains we consider have obvious representations consisting of multiple entity types and edges (e.g., people and papers for node types and same-author-as and cited-by as edge types in a citation-graph domain), a homo- geneous representation is much simpler. In order to assess whether a more complex representation is worthwhile, it is necessary to assess standard techniques on the simpler representation (as we do in this case study). Of course, the way a network is “homogenized” may have a considerable effect on classification performance. We will revisit this below in Section 5.3.6.

5.1 Data

The case study reported in this paper makes use of 12 benchmark data sets from four domains that have been the subject of prior study in machine learning. As this study focuses on networked data, any singleton (disconnected) entities in the data were removed. Therefore, the statistics we present may differ from those reported previously.

5.1.1 IMDB

Networked data from the Internet Movie Database (IMDb)^11 have been used to build models pre- dicting movie success as determined by box-office receipts (Jensen and Neville, 2002a). Following the work of Neville et al. (2003), we focus on movies released in the United States between 1996 and 2001 with the goal of estimating whether the opening weekend box-office receipts “will” ex- ceed $2 million (Neville et al., 2003). Obtaining data from the IMDb web-site, we identified 1169 movies released between 1996 and 2001 that we were able to link up with a revenue classification

  1. http://www.imdb.com

MACSKASSY AND PROVOST

Category Size High-revenue 572 Low-revenue 597 Total 1169 Base accuracy 51 .07%

Table 6: Class distribution for the IMDb data set.

Category Size Case-based 402 Genetic Algorithms 551 Neural Networks 1064 Probabilistic Methods 529 Reinforcement Learning 335 Rule Learning 230 Theory 472 Total 3583 Base accuracy 29 .70%

Table 7: Class distribution for the cora data set.

in the database given to us by the authors of the original study. The class distribution of the data set is shown in Table 6.

We link movies if they share a production company, based on observations from previous work (Macskassy and Provost, 2003).^12 The weight of an edge in the resulting graph is the number of production companies two movies have in common. Notably, we ignore the temporal aspect of the movies in this study, simply labeling movies at random for the training set. This can result in a movie in the test set being released earlier than a movie in the training set.

5.1.2 CORA

The cora data set (McCallum et al., 2000) comprises computer science research papers. It includes the full citation graph as well as labels for the topic of each paper (and potentially sub- and sub-sub- topics).^13 Following a prior study (Taskar et al., 2001), we focused on papers within the machine learning topic with the classification task of predicting a paper’s sub-topic (of which there are seven). The class distribution of the data set is shown in Table 7.

Papers can be linked if they share a common author, or if one cites the other. Following prior work (Lu and Getoor, 2003), we link two papers if one cites the other. The weight of an edge would normally be one unless the two papers cite each other (in which case it is two—there can be no other weight for existing edges).

  1. And on a suggestion from David Jensen.
  2. These labels were assigned by a naive Bayes classifier (McCallum et al., 2000).