Incredibly Fast Rundown of Machine Learning and `scikit-learn`

With notes from Rudin, Hastie et al. and James et al.

Sam Rosen

References throughout

If you want more in-depth explanations, see these materials, but focusing on your courses is plenty sufficient.

An Introduction to Statistical Learning (ISL) by James et al.: Friendlier introduction with plenty of code examples in R or Python, STA521 Textbook. Free Online Course
The Elements of Statistical Learning (ESL) by Hastie et al.: More theoretical introduction
Intuition for the Algorithms of Machine Learning (IAML) by Rudin: CS671 Textbook
scikit-learn User Guide (sklearn): Very well written tutorials that teach the library and machine learning concepts
Convex Optimization (CO): Infinitely useful textbook about optimization

Outline

What is Machine Learning?
Types of Machine Learning
- Supervised
  - Classification
  - Regression
- Unsupervised
  - Clustering
  - Dimensionality Reduction
Typical Machine Learning Pipeline w/ scikit-learn
Pitfalls

What is Machine Learning????

Supervised Learning

The input includes what the output should look like

Regression

Classification

Objective Functions

Many supervised methods can be fit by optimizing an objective function! By changing the loss function (\(\ell\)) and the regularization function (\(R^{reg}\)), different models are made with different strengths and weaknesses.

\[ \sum_i \ell[f(x_i), y_i] + C R^{reg}(f)\]

Least Squares: \(\min_w \| Xw - y\|^2_2\)
Ridge: \(\min_w \| Xw - y\|^2_2 + \alpha \|w\|^2_2\)
Lasso: \(\min_w \| Xw - y\|^2_2 + \alpha \|w\|_1\)
Elastic-Net: \(\min_w \| Xw - y\|^2_2 + \alpha \rho \|w\|_1 + \alpha(1-\rho)\|w\|^2_2\)
Logistic Regression: \(\hat p(X_i) = \frac{1}{1 + \exp(-X_i w - w_0)}, \quad \min_w C \sum_{i=1}^n [y_i \log(\hat p (X_i)) - (1 - y_i)\log(1 - \hat p(X_i))]\)
(C-Support) SVM: \(\min_{w, b, \xi} \frac{1}{2}|w|^2_2 + C \sum \xi_i, s.t. y_i(w^\top \phi(x_i) + b) \geq 1 - \xi_i, \xi_i \geq 0, i=1,…,l\)
Basic 2-layer Neural Network: \(f(\mathbf x) = \sigma\{\mathbf W_3^\top [\sigma(\mathbf W_2^\top\{\sigma[\mathbf W_1^\top \mathbf x + \mathbf b_1]\} + \mathbf b_2)] + b_3 \}\)

Unsupervised

Deriving helpful output from the input

Clustering

Dimensionality Reduction

More Objective Functions

K-means: \(\sum_i \min_k[ \operatorname{dist}(\mathbf x_i, \mathbf c_k)]\)
Non-negative Matrix Factorization (ESL 14.6): \(\max_{\mathbf W, \mathbf H} \sum_i \sum_j [x_{ij} \log(\mathbf{WH})_{ij} - (\mathbf{WH})_{ij}]\)
Sparse Principal Components (ESL 14.5.5): \(\max_v \ v^\top (\mathbf X^\top \mathbf X) v, \ s.t.\ \sum_i |v| \leq t, \ \|v\|^2_2 = 1\)
Modularity (Undirected graphs): \(\max_\mu \frac{1}{2m} \sum_{i,j} \left( A_{ij} - \frac{k_i k_j}{2m}\right)\delta(\mu_i, \mu_j)\)

Design Question 1

You work for Amazon and the company has noticed that the 5-star review system is flawed since two reviews may have the same number of stars, but actually have very different sentiments about the product. With all the data on Amazon users, and their reviews on products, they ask you to derive more intelligent information from reviews to assist in them suggesting high-quality products.

Clarification/Assumptions
Algorithms
Features
Evaluation

Design Question 2

You work for OpenAI (creators of ChatGPT), and need to classify lots of text data to help train the latest overhyped NLP model. They have several million documents which need to be classified to a certain type. Some types include: research paper, user review, novel, transcription.

Clarification/Assumptions
Algorithms
Features
Evaluation

Pipeline

Libraries

Picking your technologies or libraries may be important depending on application.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import csv
import sklearn

Input, Output, Data Collection

Review Sentiment Dataset

websites = ["yelp", "imdb", "amazon_cells"]
reviews = pd.DataFrame(columns = ['Review', 'Sentiment', 'Website'])

for website in websites:
  data = pd.read_csv(f'./reviews/{website}_labelled.txt',
                     sep='\t',
                     names=["Review", "Sentiment"],
                     quoting=csv.QUOTE_NONE)
  data["Website"] = website
  reviews = pd.concat([reviews, data], axis=0)

reviews.reset_index(inplace=True, drop=True) 

print(reviews.head(1))
print("Number of observations, columns:", reviews.shape)

                     Review Sentiment Website
0  Wow... Loved this place.         1    yelp
Number of observations, columns: (3000, 3)

You need data to do data analysis! This may be collected from web scraping, APIs, data repositories, etc.

Data Cleaning

Data is often not in an ideal form for using sklearn. ¹

# Remove punctuation and make all letters lower case
reviews["Cleaned"] = reviews["Review"].str.replace('[^\w\s]','')
reviews["Cleaned"] = reviews["Cleaned"].str.lower()

Exploratory Data Analysis

reviews["Lengths"] = reviews["Review"].apply(lambda review: len(review.split()))

sns.displot(reviews, x="Lengths", hue="Website", element="step")

Feature Engineering

sklearn has a preprocessing and feature extraction module ideal for modifying and creating features!
Let’s create a feature for every word in the dataset, describing how often it appears in each review. This is a “bag-of-words” representation. ¹

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

word_counts = count_vect.fit_transform(reviews["Cleaned"])
print(word_counts.shape)
print(list(count_vect.vocabulary_.items())[:10])

(3000, 5377)
[('wow', 5327), ('loved', 2808), ('this', 4738), ('place', 3499), ('crust', 1163), ('is', 2516), ('not', 3192), ('good', 2066), ('tasty', 4658), ('and', 234)]

Splitting into Training and Testing

We want our model to generalize well for new predictions, so we should split it into a training set for building, and use a test set for final evaluation.

from sklearn.model_selection import train_test_split

# Not necessary for splitting but helpful for joining
# our model results with the original data frame
indices = np.arange(word_counts.shape[0])

# What happens if our test set is too big?
X_train, X_test, y_train, y_test, train_indices, test_indices = train_test_split(
   word_counts, reviews["Website"], indices, test_size=0.3, random_state=0)

Parameter Tuning

Logistic Regression is a standard baseline for classification problems. We should start basic. We will use an \(\ell_1\) penalty to assist with having more features than data points and promote sparsity.

\[ \hat p(X_i) = \frac{1}{1 + \exp(-X_i w - w_0)}, \quad \min_w C \sum_{i=1}^n [y_i \log(\hat p (X_i)) - (1 - y_i)\log(1 - \hat p(X_i))] + \|w\|_1 \]

from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
 
logistic = linear_model.LogisticRegression(solver="liblinear", penalty="l1")
cv = GridSearchCV(logistic,
                  [{"C": 2.0**np.arange(-6, 6)}],
                  scoring="accuracy",
                  refit=True)
cv.fit(X_train, y_train)

print(cv.best_params_)
print(cv.best_estimator_.coef_.shape)
print(cv.best_estimator_.classes_)
print((np.count_nonzero(cv.best_estimator_.coef_, axis=1)))

{'C': 4.0}
(3, 5377)
['amazon_cells' 'imdb' 'yelp']
[613 567 565]

from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

log2_vals = np.arange(-10, 32)
coef_nonzero = []
test_accuracies = []
train_accuracies = []
x_axis_vals = []
kf = KFold(n_splits=5)
reindexed_y = np.array(y_train.to_list())
for fold_num, (train_index, test_index) in enumerate(kf.split(X_train)):  
  for reg_strength in log2_vals:
    logistic = linear_model.LogisticRegression(
      solver="liblinear", penalty="l1", C=2.0 ** reg_strength)
    logistic.fit(X_train[train_index, :], reindexed_y[train_index])

    test_accuracies.append(accuracy_score(logistic.predict(X_train[test_index]), reindexed_y[test_index]))
    train_accuracies.append(accuracy_score(logistic.predict(X_train[train_index]), reindexed_y[train_index]))
    x_axis_vals.append(reg_strength)
    coef_nonzero.append(np.count_nonzero(logistic.coef_, axis=1))

<Figure size 960x480 with 0 Axes>

Examine the Fit

most_important_words = np.argpartition(cv.best_estimator_.coef_, -20, axis=1)[:, -20:]

index_to_word = count_vect.get_feature_names_out()

for index, class_ in enumerate(cv.best_estimator_.classes_):
  print(class_)
  for word_index in most_important_words[index]:
    print(index_to_word[word_index], 
          round(cv.best_estimator_.coef_[index, word_index], 3), 
          end=", ")
  print("\n")

amazon_cells
voice 4.208, trouble 4.394, quality 4.405, works 4.664, reception 4.66, shipping 4.696, case 4.723, drop 5.617, colors 4.776, investment 5.947, headset 5.897, software 5.167, battery 6.153, plug 8.725, phone 8.103, sending 6.142, product 5.76, volume 5.653, ear 5.086, instructions 4.84, 

imdb
scenes 4.376, films 4.394, god 4.583, whatever 4.67, movies 4.696, dialogue 4.707, watch 4.775, viewing 4.724, acting 4.847, movie 8.547, characters 5.364, film 8.449, character 6.033, game 5.47, plot 5.775, cast 5.395, ending 5.19, script 4.952, whiny 5.171, theater 6.039, 

yelp
atmosphere 4.382, breakfast 4.562, pasta 4.841, pho 4.43, pizza 4.91, chicken 4.519, staff 4.694, chips 4.404, delicious 4.941, mid 4.945, eat 5.81, flavor 5.637, place 5.909, restaurant 5.946, check 5.112, sick 5.26, management 5.175, food 6.439, dish 6.53, meat 6.977,

Examine some Predictions

# Use train dataset if trying to do model choice
probs = cv.predict_proba(X_train)

difficult_reviews = np.nonzero(np.std(probs, axis=1) < 0.2)
print(reviews.loc[train_indices[difficult_reviews]]["Cleaned"])

uncertain_reviews = np.nonzero(np.max(probs, axis=1) < 0.5)
print(reviews.loc[train_indices[uncertain_reviews]]["Cleaned"])

1837                         dont waste your time  
2201    all in all i think it was a good investment
1188                  nothing at all to recommend  
479                                      i loved it
1234                       do not waste your time  
2483                             you wont regret it
1992             lange had become a great actress  
1694                           you wont regret it  
265                           plus its only 8 bucks
1585                              not recommended  
1407                i couldnt take them seriously  
409                             total waste of time
1133            all in all a great disappointment  
400             this one is simply a disappointment
470                                very good though
2783                                it was horrible
2146                           what a waste of time
1155                                     horrible  
2930                                   never got it
Name: Cleaned, dtype: object
2037             poor talk time performance
1837                 dont waste your time  
1234               do not waste your time  
1585                      not recommended  
2907                     how stupid is that
400     this one is simply a disappointment
1155                             horrible  
Name: Cleaned, dtype: object

Evaluate

Metrics
Visualize

Accuracy: Simplest metric, percent correct
Precision: If I predict \(x\) probability I am correct
Recall: If the truth is \(x\) probability I am correct
f-Score: Harmonic mean of precision and recall, great for unbalanced classes

from sklearn.metrics import confusion_matrix, precision_recall_fscore_support

train_conf_mat = confusion_matrix(y_train, cv.predict(X_train))
test_conf_mat = confusion_matrix(y_test, cv.predict(X_test))

test_precisions, test_recalls, test_fscores, _ = precision_recall_fscore_support(
  y_test, cv.predict(X_test))

train_precisions, train_recalls, train_fscores, _ = precision_recall_fscore_support(
  y_train, cv.predict(X_train))

Does a longer review help?

Joining Model and Data
Visualization

prediction_probs = cv.predict_proba(word_counts)
reviews["Prediction"] = cv.predict(word_counts)

for index, website in enumerate(cv.best_estimator_.classes_):
  reviews[f"prob_{website}"] = prediction_probs[:, index]

reviews["is_test"] = False
reviews["is_correct"] = reviews["Prediction"] == reviews["Website"]
reviews.loc[y_test.index, "is_test"] = True

<Figure size 960x480 with 0 Axes>

test_set = reviews.query("is_test")
percent_correct_by_length = test_set.groupby(
  ["Lengths", "is_correct", "Website"]).agg("size") / test_set.groupby(
    ["Lengths", "Website"]).agg("size")
percent_correct_by_length = percent_correct_by_length.to_frame().reset_index()
percent_correct_by_length.rename(columns={0: "Freq"}, inplace=True)

g = sns.FacetGrid(percent_correct_by_length, col="Website", sharex=False)
g.map_dataframe(sns.histplot, x="Lengths",
    hue="is_correct",
    multiple="stack",
    weights="Freq",
    discrete=1)

Pitfalls

See Lones 2023 for a great comprehensive overview.

Overfitting

“All models are wrong, some are useful” - George Box
It’s important for your models to generalize, and not just memorize the training data set

Bad Data / Fitting Artifacts

You may need lots of data to produce a good model particularly if data is noisy
Sometimes machine learning algorithms can give arbitrary results if an algorithm used is not the right choice for the data (see applying t-SNE to Normal data)

Fitting Issues

You can have the perfect model choice for your data, but struggle to fit it!
Vanishing Gradient is common with Neural Networks
Optimization can be hard (NP-HARD to be precise) Ex: Sparse Regression

\[ \min_{\boldsymbol \beta} \|\mathbf y - \mathbf X\boldsymbol \beta\|^2_2 \text{, subject to } \|\boldsymbol \beta\|_0 \leq k\]

Difficulty sampling a distribution (see STA602L)
Sensitivity to initial values
Too many hyperparameters, each added hyperparameter exponentially increases the total possible fits

Feature Correlation / Colinearity

Many machine learning algorithms were developed just to deal with correlated features. Likewise, many algorithms have been proven to fail if the features are too correlated.
Check if your features are correlated using DataFrame.corr. Plot it too!

Curse of Dimensionality

Being immortal → Everyone is dead and the Sun is about to explode
More covariates for analysis → All my data points are significant
Nice write-up

Incredibly Fast Rundown of Machine Learning and scikit-learn

References throughout

Outline

What is Machine Learning????

Supervised Learning

Regression

Classification

Objective Functions

Unsupervised

Clustering

Dimensionality Reduction

More Objective Functions

Design Question 1

Design Question 2

Pipeline

Libraries

Input, Output, Data Collection

Data Cleaning

Exploratory Data Analysis

Feature Engineering

Splitting into Training and Testing

Parameter Tuning

Aside, Regularization Strength

Examine the Fit

Examine some Predictions

Evaluate

Does a longer review help?

Pitfalls

Overfitting

Bad Data / Fitting Artifacts

Fitting Issues

Feature Correlation / Colinearity

Curse of Dimensionality

FlipFlop

Incredibly Fast Rundown of Machine Learning and `scikit-learn`