The Machine Learning Pipeline 10
Data 11
Tasks 11
Models 12
Features 13
2. Basic Feature Engineering for Text Data: Flatten and Filter. . . . . . . . . . . . . . . . . . . . . . . 15
Turning Natural Text into Flat Vectors 15
Bag-of-words 16
Implementing bag-of-words: parsing and tokenization 20
Bag-of-N-Grams 21
Collocation Extraction for Phrase Detection 23
Quick summary 26
Filtering for Cleaner Features 26
Stopwords 26
Frequency-based filtering 27
Stemming 30
Summary 31
3. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf. . . . . . . . . . . . . . . . . . . . . . . 33
Tf-Idf : A Simple Twist on Bag-of-Words 33
Feature Scaling 35
Min-max scaling 35
Standardization (variance scaling) 36
L2 normalization 37
iii
www.it-ebooks.info
Putting it to the Test 38
Creating a classification dataset 39
Implementing tf-idf and feature scaling 40
First try: plain logistic regression 42
Second try: logistic regression with regularization 43
Discussion of results 46
Deep Dive: What is Happening? 47
1