2009年新书,非扫描
Contents
List of Figures xiii
List of Tables xix
Introduction xxi
About the Editors xxvii
Contributor List xxix
1 Analysis of Text Patterns Using Kernel Methods 1
Marco Turchi, Alessia Mammone, and Nello Cristianini
1.1 Introduction . . . . . . . . . . . . . . . 1
1.2 General Overview on Kernel Methods . . . . . . . 1
1.2.1 Finding Patterns in Feature Space . . . . . . . . . . . 5
1.2.2 Formal Properties of Kernel Functions . . . . . . . . . 8
1.2.3 Operations on Kernel Functions . . . . . . . . . . . . 10
1.3 Kernels for Text . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Vector SpaceModel . . . . . . . . . . . . . . . . . . . 11
1.3.2 Semantic Kernels . . . . . . . . . . . . . . . . . . . . . 13
1.3.3 String Kernels . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Conclusion and Further Reading . . . . . . . . . . . . . . . . 22
2 Detection of Bias in Media Outlets with Statistical Learning
Methods 27
Blaz Fortuna, Carolina Galleguillos, and Nello Cristianini
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Overview of the Experiments . . . . . . . . . . . . . . . . . . 29
2.3 Data Collection and Preparation . . . . . . . . . . . . . . . . 30
2.3.1 Article Extraction from HTML Pages . . . . . . . . . 31
2.3.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . 31
2.3.3 Detection of Matching News Items . . . . . . . . . . . 32
2.4 News Outlet Identification . . . . . . . . . . . . . . . . . . . . 35
2.5 Topic-Wise Comparison of Term Bias . . . . . . . . . . . . . 38
2.6 News OutletsMap . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.1 Distance Based on Lexical Choices . . . . . . . . . . . 42
vii
© 2009 by Taylor and Francis Group, LLC
viii
2.6.2 Distance Based on Choice of Topics . . . . . . . . . . 43
2.7 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.9 Appendix A: Support Vector Machines . . . . . . . . . . . . . 48
2.10 Appendix B: Bag of Words and Vector Space Models . . . . . 48
2.11 Appendix C: Kernel Canonical Correlation Analysis . . . . . 49
2.12 Appendix D: Multidimensional Scaling . . . . . . . . . . . . . 50
3 Collective Classification for Text Classification 51
Galileo Namata, Prithviraj Sen, Mustafa Bilgic, and Lise Getoor
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Collective Classification: Notation and Problem Definition . . 53
3.3 Approximate Inference Algorithms for Approaches Based on
Local Conditional Classifiers . . . . . . . . . . . . . . . . . . . 53
3.3.1 Iterative Classification . . . . . . . . . . . . . . . . . . 54
3.3.2 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . 55
3.3.3 Local Classifiers and Further Optimizations . . . . . . 55
3.4 Approximate Inference Algorithms for Approaches Based on
Global Formulations . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.1 Loopy Belief Propagation . . . . . . . . . . . . . . . . 58
3.4.2 Relaxation Labeling via Mean-Field Approach . . . . 59
3.5 Learning the Classifiers . . . . . . . . . . . . . . . . . . . . . 60
3.6 Experimental Comparison . . . . . . . . . . . . . . . . . . . . 60
3.6.1 Features Used . . . . . . . . . . . . . . . . . . . . . . . 60
3.6.2 Real-World Datasets . . . . . . . . . . . . . . . . . . . 60
3.6.3 Practical Issues . . . . . . . . . . . . . . . . . . . . . . 63
3.7 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Topic Models 71
David M. Blei and John D. Lafferty
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . 72
4.2.1 Statistical Assumptions . . . . . . . . . . . . . . . . . 73
4.2.2 Exploring a Corpus with the Posterior Distribution . . 75
4.3 Posterior Inference for LDA . . . . . . . . . . . . . . . . . . . 76
4.3.1 Mean Field Variational Inference . . . . . . . . . . . . 78
4.3.2 Practical Considerations . . . . . . . . . . . . . . . . . 81
4.4 Dynamic Topic Models and Correlated Topic Models . . . . . 82
4.4.1 The Correlated Topic Model . . . . . . . . . . . . . . 82
4.4.2 The Dynamic Topic Model . . . . . . . . . . . . . . . 84
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
© 2009 by Taylor and Francis Group, LLC
ix
5 Nonnegative Matrix and Tensor Factorization for Discussion
Tracking 95
Brett W. Bader, Michael W. Berry, and Amy N. Langville
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.1.1 Extracting Discussions . . . . . . . . . . . . . . . . . . 96
5.1.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . 96
5.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3 Tensor Decompositions and Algorithms . . . . . . . . . . . . 98
5.3.1 PARAFAC-ALS . . . . . . . . . . . . . . . . . . . . . 100
5.3.2 Nonnegative Tensor Factorization . . . . . . . . . . . . 100
5.4 Enron Subset . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4.1 TermWeighting Techniques . . . . . . . . . . . . . . . 103
5.5 Observations and Results . . . . . . . . . . . . . . . . . . . . 105
5.5.1 Nonnegative Tensor Decomposition . . . . . . . . . . . 105
5.5.2 Analysis of Three-Way Tensor . . . . . . . . . . . . . 106
5.5.3 Analysis of Four-Way Tensor . . . . . . . . . . . . . . 108
5.6 Visualizing Results of the NMF Clustering . . . . . . . . . . . 111
5.7 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6 Text Clustering with Mixture of von Mises-Fisher Distributions
121
Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, and Suvrit Sra
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.3.1 The von Mises-Fisher (vMF) Distribution . . . . . . . 124
6.3.2 Maximum Likelihood Estimates . . . . . . . . . . . . . 125
6.4 EMon aMixture of vMFs (moVMF) . . . . . . . . . . . . . . 126
6.5 Handling High-Dimensional Text Datasets . . . . . . . . . . . 127
6.5.1 Approximating κ . . . . . . . . . . . . . . . . . . . . . 128
6.5.2 Experimental Study of the Approximation . . . . . . . 130
6.6 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 134
6.7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.7.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . 138
6.7.3 Simulated Datasets . . . . . . . . . . . . . . . . . . . . 138
6.7.4 Classic3 Family of Datasets . . . . . . . . . . . . . . . 140
6.7.5 Yahoo News Dataset . . . . . . . . . . . . . . . . . . . 143
6.7.6 20 Newsgroup Family of Datasets . . . . . . . . . . . . 143
6.7.7 Slashdot Datasets . . . . . . . . . . . . . . . . . . . . 145
6.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.9 Conclusions and Future Work . . . . . . . . . . . . . . . . . . 148
© 2009 by Taylor and Francis Group, LLC
x
7 Constrained Partitional Clustering of Text Data: An
Overview 155
Sugato Basu and Ian Davidson
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.2 Uses of Constraints . . . . . . . . . . . . . . . . . . . . . . . . 157
7.2.1 Constraint-Based Methods . . . . . . . . . . . . . . . 157
7.2.2 Distance-BasedMethods . . . . . . . . . . . . . . . . . 158
7.3 Text Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.3.1 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . 161
7.3.2 DistanceMeasures . . . . . . . . . . . . . . . . . . . . 162
7.4 Partitional Clustering with Constraints . . . . . . . . . . . . 163
7.4.1 COP-KMeans . . . . . . . . . . . . . . . . . . . . . . . 163
7.4.2 Algorithms with Penalties – PKM, CVQE . . . . . . . 164
7.4.3 LCVQE: An Extension to CVQE . . . . . . . . . . . . 167
7.4.4 Probabilistic Penalty – PKM . . . . . . . . . . . . . . 167
7.5 Learning Distance Function with Constraints . . . . . . . . . 168
7.5.1 Generalized Mahalanobis Distance Learning . . . . . . 168
7.5.2 Kernel Distance Functions Using AdaBoost . . . . . . 169
7.6 Satisfying Constraints and Learning Distance Functions . . . 170
7.6.1 Hidden Markov Random Field (HMRF) Model . . . . 170
7.6.2 EMAlgorithm . . . . . . . . . . . . . . . . . . . . . . 173
7.6.3 Improvements to HMRF-KMeans . . . . . . . . . . . 173
7.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.7.2 Clustering Evaluation . . . . . . . . . . . . . . . . . . 175
7.7.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . 176
7.7.4 Comparison of Distance Functions . . . . . . . . . . . 176
7.7.5 Experimental Results . . . . . . . . . . . . . . . . . . 177
7.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8 Adaptive Information Filtering 185
Yi Zhang
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.2 Standard EvaluationMeasures . . . . . . . . . . . . . . . . . 188
8.3 Standard Retrieval Models and Filtering Approaches . . . . . 190
8.3.1 Existing Retrieval Models . . . . . . . . . . . . . . . . 190
8.3.2 Existing Adaptive Filtering Approaches . . . . . . . . 192
8.4 CollaborativeAdaptive Filtering . . . . . . . . . . . . . . . . 194
8.5 Novelty and Redundancy Detection . . . . . . . . . . . . . . . 196
8.5.1 Set Difference . . . . . . . . . . . . . . . . . . . . . . . 199
8.5.2 Geometric Distance . . . . . . . . . . . . . . . . . . . 199
8.5.3 Distributional Similarity . . . . . . . . . . . . . . . . . 200
8.5.4 Summary of Novelty Detection . . . . . . . . . . . . . 201
8.6 Other Adaptive Filtering Topics . . . . . . . . . . . . . . . . 201
8.6.1 Beyond Bag ofWords . . . . . . . . . . . . . . . . . . 202
© 2009 by Taylor and Francis Group, LLC
xi
8.6.2 Using Implicit Feedback . . . . . . . . . . . . . . . . . 202
8.6.3 Exploration and Exploitation Trade Off . . . . . . . . 203
8.6.4 Evaluation beyond Topical Relevance . . . . . . . . . 203
8.7 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 204
9 Utility-Based Information Distillation 213
Yiming Yang and Abhimanyu Lad
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.1.1 Related Work in Adaptive Filtering (AF) . . . . . . . 213
9.1.2 Related Work in Topic Detection and Tracking (TDT) 214
9.1.3 Limitations of Current Solutions . . . . . . . . . . . . 215
9.2 A Sample Task . . . . . . . . . . . . . . . . . . . . . . . . . . 216
9.3 Technical Cores . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.3.1 Adaptive Filtering Component . . . . . . . . . . . . . 218
9.3.2 Passage Retrieval Component . . . . . . . . . . . . . . 219
9.3.3 Novelty Detection Component . . . . . . . . . . . . . 220
9.3.4 Anti-Redundant Ranking Component . . . . . . . . . 220
9.4 EvaluationMethodology . . . . . . . . . . . . . . . . . . . . . 221
9.4.1 Answer Keys . . . . . . . . . . . . . . . . . . . . . . . 221
9.4.2 Evaluating the Utility of a Sequence of Ranked Lists . 223
9.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.6 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 226
9.6.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.6.2 Experimental Setup . . . . . . . . . . . . . . . . . . . 226
9.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 229
9.8 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 229
10 Text Search-Enhanced with Types and Entities 233
Soumen Chakrabarti, Sujatha Das, Vijay Krishnan, and Kriti Puniyani
10.1 Entity-Aware Search Architecture . . . . . . . . . . . . . . . . 233
10.1.1 Guessing Answer Types . . . . . . . . . . . . . . . . . 234
10.1.2 Scoring Snippets . . . . . . . . . . . . . . . . . . . . . 235
10.1.3 Efficient Indexing and Query Processing . . . . . . . . 236
10.1.4 Comparison with Prior Work . . . . . . . . . . . . . . 236
10.2 Understanding the Question . . . . . . . . . . . . . . . . . . . 236
10.2.1 Answer Type Clues in Questions . . . . . . . . . . . . 239
10.2.2 Sequential Labeling of Type Clue Spans . . . . . . . . 240
10.2.3 From Type Clue Spans to Answer Types . . . . . . . . 245
10.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . 247
10.3 Scoring Potential Answer Snippets . . . . . . . . . . . . . . . 251
10.3.1 A ProximityModel . . . . . . . . . . . . . . . . . . . . 253
10.3.2 Learning the Proximity Scoring Function . . . . . . . 255
10.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . 257
10.4 Indexing and Query Processing . . . . . . . . . . . . . . . . . 260
© 2009 by Taylor and Francis Group, LLC
xii
10.4.1 Probability of a Query Atype . . . . . . . . . . . . . . 262
10.4.2 Pre-Generalize and Post-Filter . . . . . . . . . . . . . 262
10.4.3 Atype Subset Index Space Model . . . . . . . . . . . . 265
10.4.4 Query Time BloatModel . . . . . . . . . . . . . . . . 266
10.4.5 Choosing an Atype Subset . . . . . . . . . . . . . . . . 269
10.4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . 271
10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
10.5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 272
10.5.2 Ongoing and Future Work . . . . . . . . . . . . . . . . 273
© 2009
1