Professional Documents
Culture Documents
Table 1: The Co-Training algorithm. In the experiments reported here both h1 and h2 were trained using a naive Bayes
algorithm, and algorithm parameters were set to p = 1, n = 3, k = 30 and u = 75.
examples that are more representative of the underlying rate of 22%. Figure 2 gives a plot of error versus num-
distribution D that generated U. ber of iterations for one of the ve runs.
Experiments were conducted to determine whether Notice that for all three types of classiers (hyperlink-
this co-training algorithm could successfully use the un- based, page-based, and combined), the co-trained clas-
labeled data to outperform standard supervised training sier outperforms the classier formed by supervised
of naive Bayes classiers. In each experiment, 263 (25%) training. In fact, the page-based and combined classi-
of the 1051 web pages were rst selected at random as ers achieve error rates that are half the error achieved
a test set. The remaining data was used to generate a by supervised training. The hyperlink-based classier is
labeled set L containing 3 positive and 9 negative ex- helped less by co-training. This may be due to the fact
amples drawn at random. The remaining examples that that hyperlinks contain fewer words and are less capable
were not drawn for L were used as the unlabeled pool of expressing an accurate approximation to the target
U. Five such experiments were conducted using dier- function.
ent training/test splits, with Co-training parameters set This experiment involves just one data set and one
to p = 1, n = 3, k = 30 and u = 75. target function. Further experiments are needed to de-
To compare Co-training to supervised training, we termine the general behavior of the co-training algo-
trained naive Bayes classiers that used only the 12 la- rithm, and to determine what exactly is responsible for
beled training examples in L. We trained a hyperlink- the pattern of behavior observed. However, these re-
based classier and a page-based classier, just as for sults do indicate that co-training can provide a useful
co-training. In addition, we dened a third combined way of taking advantage of unlabeled data.
classier, based on the outputs from the page-based
and hyperlink-based classier. In keeping with the naive
Bayes assumption of conditional independence, this com- 7 CONCLUSIONS AND OPEN
bined classier computes the probability P(cj jx) of class QUESTIONS
cj given the instance x = (x1; x2) by multiplying the We have described a model in which unlabeled data can
probabilities output by the page-based and hyperlink- be used to augment labeled data, based on having two
based classiers: views (x1; x2) of an example that are redundant but not
P(cj jx) P(cj jx1)P(cj jx2) completely correlated. Our theoretical model is clearly
The results of these experiments are summarized in an over-simplication of real-world target functions and
Table 2. Numbers shown here are the test set error rates distributions. In particular, even for the optimal pair of
averaged over the ve random train/test splits. The functions f1 ; f2 2 C1 C2 we would expect to occasion-
rst row of the table shows the test set accuracies for ally see inconsistent examples (i.e., examples (x1; x2)
the three classiers formed by supervised learning; the such that f1 (x1) 6= f2 (x2 )). Nonetheless, it provides a
second row shows accuracies for the classiers formed by way of looking at the notion of the \friendliness" of a
co-training. Note that for this data the default hypoth- distribution (in terms of the components and minimum
esis that always predicts \negative" achieves an error cuts) and at how unlabeled examples can potentially
Page-based classier Hyperlink-based classier Combined classier
Supervised training 12.9 12.4 11.1
Co-training 6.2 11.6 5.0
Table 2: Error rate in percent for classifying web pages as course home pages. The top row shows errors when training
on only the labeled examples. Bottom row shows errors when co-training, using both labeled and unlabeled examples.
0.3
Hyperlink-Based
Page-Based
Default
0.25
0.2
Percent Error on Test Data
0.15
0.1
0.05
0
0 5 10 15 20 25 30 35 40
Co-Training Iterations
Figure 2: Error versus number of iterations for one run of co-training experiment.
be used to prune away \incompatible" target concepts [6] Richard O. Duda and Peter E. Hart. Pattern Clas-
to reduce the number of labeled examples needed to sication and Scene Analysis. Wiley, 1973.
learn. It is an open question to what extent the consis- [7] Z. Ghahramani and M. I. Jordan. Supervised learn-
tency constraints in the model and the mutual indepen- ing from incomplete data via an EM approach. In
dence assumption of Section 5 can be relaxed and still Advances in Neural Information Processing Sys-
allow provable results on the utility of co-training from tems (NIPS 6). Morgan Kauman, 1994.
unlabeled data. The preliminary experimental results [8] S. A. Goldman and M. J. Kearns. On the complex-
presented suggest that this method of using unlabeled ity of teaching. Journal of Computer and System
data has a potential for signicant benets in practice, Sciences, 50(1):20{31, February 1995.
though further studies are clearly needed. [9] A.G. Hauptmann and M.J. Witbrock. Informedia:
We conjecture that there are many practical learn- News-on-demand - multimedia information acqui-
ing problems that t or approximately t the co-training sition and retrieval. In M. Maybury, editor, Intel-
model. For example, consider the problem of learning ligent Multimedia Information Retrieval, 1997.
to classify segments of television broadcasts [9, 16]. We [10] J. Jackson and A. Tomkins. A computational
might be interested, say, in learning to identify televised model of teaching. In Proceedings of the Fifth An-
segments containing the US President. Here X1 could nual Workshop on Computational Learning The-
be the set of possible video images, X2 the set of pos- ory, pages 319{326. Morgan Kaufmann, 1992.
sible audio signals, and X their cross product. Given [11] D. R. Karger. Random sampling in cut,
ow,
a small sample of labeled segments, we might learn a and network design problems. In Proceedings of
weakly predictive recognizer h1 that spots full-frontal the Twenty-Sixth Annual ACM Symposium on the
images of the president's face, and a recognizer h2 that Theory of Computing, pages 648{657, May 1994.
spots his voice when no background noise is present. [12] D. R. Karger. Random sampling in cut,
ow, and
We could then use co-training applied to the large vol- network design problems. Journal version draft,
ume of unlabeled television broadcasts, to improve the 1997.
accuracy of both classiers. Similar problems exist in [13] M. Kearns. Ecient noise-tolerant learning from
many perception learning tasks involving multiple sen- statistical queries. In Proceedings of the Twenty-
sors. For example, consider a mobile robot that must Fifth Annual ACM Symposium on Theory of Com-
learn to recognize open doorways based on a collection puting, pages 392{401, 1993.
of vision (X1 ), sonar (X2 ), and laser range (X3 ) sen- [14] D. D. Lewis and M. Ringuette. A comparison of
sors. The important structure in the above problems two learning algorithms for text categorization. In
is that each instance x can be partitioned into subcom- Third Annual Symposium on Document Analysis
ponents xi, where the xi are not perfectly correlated, and Information Retrieval, pages 81{93, 1994.
where each xi can in principle be used on its own to [15] Joel Ratsaby and Santosh S. Venkatesh. Learning
make the classication, and where a large volume of from a mixture of labeled and unlabeled examples
unlabeled instances can easily be collected. with parametric side information. In Proceedings
of the 8th Annual Conference on Computational
Learning Theory, pages 412{417. ACM Press, New
References York, NY, 1995.
[16] M.J. Witbrock and A.G. Hauptmann. Improving
[1] V. Castelli and T.M. Cover. On the exponential acoustic models by watching television. Technical
value of labeled samples. Pattern Recognition Let- Report CMU-CS-98-110, Carnegie Mellon Univer-
ters, 16:105{111, January 1995. sity, March 19 1998.
[2] V. Castelli and T.M. Cover. The relative value [17] D. Yarowsky. Unsupervised word sense disam-
of labeled and unlabeled samples in pattern- biguation rivaling supervised methods. In Proceed-
recognition with an unknown mixing parame- ings of the 33rd Annual Meeting of the Associa-
ter. IEEE Transactions on Information Theory, tion for Computational Linguistics, pages 189{196,
42(6):2102{2117, November 1996. 1995.
[3] M. Craven, D. Freitag, A. McCallum, T. Mitchell,
K. Nigam, and C.Y. Quek. Learning to extract
symbolic knowledge from the world wide web.
Technical report, Carnegie Mellon University, Jan-
uary 1997.
[4] S. E. Decatur. PAC learning with constant-
partition classication noise and applications to de-
cision tree induction. In Proceedings of the Four-
teenth International Conference on Machine Learn-
ing, pages 83{91, July 1997.
[5] A.P. Dempster, N.M. Laird, and D.B. Rubin. Max-
imum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society
B, 39:1{38, 1977.