Incredibly Fast Rundown of Machine Learning and scikit-learn

With notes from Rudin, Hastie et al. and James et al.

Sam Rosen

References throughout

If you want a more in-depth dive of this material, see these materials, but focusing on your courses is plenty sufficient.

Outline

  • What is Machine Learning?
  • Types of Machine Learning
    • Supervised
      • Classification
      • Regression
    • Unsupervised
      • Clustering
      • Dimensionality Reduction
  • Typical Machine Learning Pipeline w/ scikit-learn
  • Pitfalls

What is Machine Learning????

Supervised Learning

The input includes what the output should look like

Regression

Classification

Objective Functions

Many supervised methods can be fit by optimizing an objective function! By changing the loss function (\(\ell\)) and the regularization function (\(R^{reg}\)), different models are made with different strengths and weaknesses.

\[ \sum_i \ell[f(x_i), y_i] + C R^{reg}(f)\]

  • Least Squares: \(\min_w \| Xw - y\|^2_2\)
  • Ridge: \(\min_w \| Xw - y\|^2_2 + \alpha \|w\|^2_2\)
  • Lasso: \(\min_w \| Xw - y\|^2_2 + \alpha \|w\|_1\)
  • Elastic-Net: \(\min_w \| Xw - y\|^2_2 + \alpha \rho \|w\|_1 + \alpha(1-\rho)\|w\|^2_2\)
  • Logistic Regression: \(\hat p(X_i) = \frac{1}{1 + \exp(-X_i w - w_0)}, \quad \min_w C \sum_{i=1}^n [y_i \log(\hat p (X_i)) - (1 - y_i)\log(1 - \hat p(X_i))]\)
  • (C-Support) SVM: \(\min_{w, b, \xi} \frac{1}{2}|w|^2_2 + C \sum \xi_i,  s.t. y_i(w^\top \phi(x_i) + b) \geq 1 - \xi_i, \xi_i \geq 0, i=1,…,l\)
  • Basic 2-layer Neural Network: \(f(\mathbf x) = \sigma\{\mathbf W_3^\top [\sigma(\mathbf W_2^\top\{\sigma[\mathbf W_1^\top \mathbf x + \mathbf b_1]\} + \mathbf b_2)] + b_3 \}\)

Unsupervised

Deriving helpful output from the input

Clustering

Dimensionality Reduction

More Objective Functions

  • K-means: \(\sum_i \min_k[ \operatorname{dist}(\mathbf x_i, \mathbf c_k)]\)
  • Non-negative Matrix Factorization (ESL 14.6): \(\max_{\mathbf W, \mathbf H} \sum_i \sum_j [x_{ij} \log(\mathbf{WH})_{ij} - (\mathbf{WH})_{ij}]\)
  • Sparse Principal Components (ESL 14.5.5): \(\max_v \ v^\top (\mathbf X^\top \mathbf X) v, \ s.t.\ \sum_i |v| \leq t, \ \|v\|^2_2 = 1\)
  • Modularity (Undirected graphs): \(\max_\mu \frac{1}{2m} \sum_{i,j} \left( A_{ij} - \frac{k_i k_j}{2m}\right)\delta(\mu_i, \mu_j)\)

Design Question 1

You work for Amazon and the company has noticed that the 5-star review system is flawed since two reviews may have the same number of stars, but actually have very different sentiments about the product. With all the data on Amazon users, and their reviews on products, they ask you to derive more intelligent information from reviews to assist in them suggesting high-quality products.

  • Clarification/Assumptions
  • Algorithms
  • Features
  • Evaluation

Design Question 2

You work for OpenAI (creators of ChatGPT), and need to classify lots of text data to help train the latest overhyped NLP model. They have several million documents which need to be classified to a certain type. Some types include: research paper, user review, novel, transcription.

  • Clarification/Assumptions
  • Algorithms
  • Features
  • Evaluation

Pipeline

Libraries

Picking your technologies or libraries may be important depending on application.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import csv
import sklearn

# For teaching purposes
pd.options.display.max_columns = None

Input, Output, Data Collection

Review Sentiment Dataset

websites = ["yelp", "imdb", "amazon_cells"]
reviews = pd.DataFrame(columns = ['Review', 'Sentiment', 'Website'])

for website in websites:
  data = pd.read_csv(f'./reviews/{website}_labelled.txt',
                     sep='\t',
                     names=["Review", "Sentiment"],
                     quoting=csv.QUOTE_NONE)
  data["Website"] = website
  reviews = pd.concat([reviews, data], axis=0)

reviews.reset_index(inplace=True, drop=True) 

print(reviews.head(1))
print("Number of observations, columns:", reviews.shape)
                     Review Sentiment Website
0  Wow... Loved this place.         1    yelp
Number of observations, columns: (3000, 3)
  • You need data to do data analysis! This may be collected from web scraping, APIs, data repositories, etc.

Data Cleaning

Data is often not in an ideal form for using sklearn. 1

# Remove punctuation and make all letters lower case
reviews["Cleaned"] = reviews["Review"].str.replace('[^\w\s]','')
reviews["Cleaned"] = reviews["Cleaned"].str.lower()

Exploratory Data Analysis

reviews["Lengths"] = reviews["Review"].apply(lambda review: len(review.split()))

sns.displot(reviews, x="Lengths", hue="Website", element="step")

Feature Engineering

  • sklearn has a preprocessing and feature extraction module ideal for modifying and creating features!
  • Let’s create a feature for every word in the dataset, describing how often it appears in each review. This is a “bag-of-words” representation. 1
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

word_counts = count_vect.fit_transform(reviews["Cleaned"])
print(word_counts.shape)
print(list(count_vect.vocabulary_.items())[:10])
(3000, 5377)
[('wow', 5327), ('loved', 2808), ('this', 4738), ('place', 3499), ('crust', 1163), ('is', 2516), ('not', 3192), ('good', 2066), ('tasty', 4658), ('and', 234)]

Splitting into Training and Testing

We want our model to generalize well for new predictions, so we should split it into a training set for building, and use a test set for final evaluation.

from sklearn.model_selection import train_test_split

# Not necessary for splitting but helpful for joining
# our model results with the original data frame
indices = np.arange(word_counts.shape[0])

# What happens if our test set is too big?
X_train, X_test, y_train, y_test, train_indices, test_indices = train_test_split(
   word_counts, reviews["Website"], indices, test_size=0.3, random_state=0)

Parameter Tuning

Logistic Regression is a standard baseline for classification problems. We should start basic. We will use an \(\ell_1\) penalty to assist with having more features than data points and promote sparsity.

\[ \hat p(X_i) = \frac{1}{1 + \exp(-X_i w - w_0)}, \quad \min_w C \sum_{i=1}^n [y_i \log(\hat p (X_i)) - (1 - y_i)\log(1 - \hat p(X_i))] + \|w\|_1 \]

from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
 
logistic = linear_model.LogisticRegression(solver="liblinear", penalty="l1")
cv = GridSearchCV(logistic,
                  [{"C": 2.0**np.arange(-6, 6)}],
                  scoring="accuracy",
                  refit=True)
cv.fit(X_train, y_train)

print(cv.best_params_)
print(cv.best_estimator_.coef_.shape)
print(cv.best_estimator_.classes_)
print((np.count_nonzero(cv.best_estimator_.coef_, axis=1)))
{'C': 4.0}
(3, 5377)
['amazon_cells' 'imdb' 'yelp']
[615 570 563]

Aside, Regularization Strength

from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

log2_vals = np.arange(-10, 32)
coef_nonzero = []
test_accuracies = []
train_accuracies = []
x_axis_vals = []
kf = KFold(n_splits=5)
reindexed_y = np.array(y_train.to_list())
for fold_num, (train_index, test_index) in enumerate(kf.split(X_train)):  
  for reg_strength in log2_vals:
    logistic = linear_model.LogisticRegression(
      solver="liblinear", penalty="l1", C=2.0 ** reg_strength)
    logistic.fit(X_train[train_index, :], reindexed_y[train_index])

    test_accuracies.append(accuracy_score(logistic.predict(X_train[test_index]), reindexed_y[test_index]))
    train_accuracies.append(accuracy_score(logistic.predict(X_train[train_index]), reindexed_y[train_index]))
    x_axis_vals.append(reg_strength)
    coef_nonzero.append(np.count_nonzero(logistic.coef_, axis=1))
<Figure size 960x480 with 0 Axes>

Examine the Fit

most_important_words = np.argpartition(cv.best_estimator_.coef_, -20, axis=1)[:, -20:]

index_to_word = count_vect.get_feature_names_out()

for index, class_ in enumerate(cv.best_estimator_.classes_):
  print(class_)
  for word_index in most_important_words[index]:
    print(index_to_word[word_index], 
          round(cv.best_estimator_.coef_[index, word_index], 3), 
          end=", ")
  print("\n")
amazon_cells
voice 4.208, case 4.723, shipping 4.696, works 4.663, quality 4.406, reception 4.66, trouble 4.394, colors 4.776, sending 6.143, volume 5.653, software 5.167, headset 5.898, instructions 4.841, product 5.759, drop 5.617, investment 5.947, ear 5.086, battery 6.153, plug 8.768, phone 8.102, 

imdb
scenes 4.375, films 4.394, god 4.585, whatever 4.67, dialogue 4.707, viewing 4.724, watch 4.775, movies 4.696, acting 4.847, characters 5.364, film 8.45, theater 6.039, script 4.952, character 6.034, plot 5.775, game 5.47, movie 8.547, whiny 5.171, cast 5.394, ending 5.19, 

yelp
atmosphere 4.382, chips 4.403, pho 4.429, delicious 4.941, staff 4.695, pasta 4.847, chicken 4.521, pizza 4.912, breakfast 4.561, mid 4.946, check 5.11, flavor 5.637, management 5.176, sick 5.26, eat 5.812, place 5.91, meat 6.977, dish 6.53, restaurant 5.945, food 6.44, 

Examine some Predictions

# Use train dataset if trying to do model choice
probs = cv.predict_proba(X_train)

difficult_reviews = np.nonzero(np.std(probs, axis=1) < 0.2)
print(reviews.loc[train_indices[difficult_reviews]]["Cleaned"])

uncertain_reviews = np.nonzero(np.max(probs, axis=1) < 0.5)
print(reviews.loc[train_indices[uncertain_reviews]]["Cleaned"])
1837                         dont waste your time  
2201    all in all i think it was a good investment
1188                  nothing at all to recommend  
479                                      i loved it
1234                       do not waste your time  
2483                             you wont regret it
1992             lange had become a great actress  
1694                           you wont regret it  
265                           plus its only 8 bucks
1585                              not recommended  
1407                i couldnt take them seriously  
409                             total waste of time
1133            all in all a great disappointment  
400             this one is simply a disappointment
470                                very good though
2783                                it was horrible
2146                           what a waste of time
1155                                     horrible  
2930                                   never got it
Name: Cleaned, dtype: object
2037             poor talk time performance
1837                 dont waste your time  
1234               do not waste your time  
1585                      not recommended  
2907                     how stupid is that
400     this one is simply a disappointment
1155                             horrible  
Name: Cleaned, dtype: object

Evaluate

  • Accuracy: Simplest metric, percent correct
  • Precision: If I predict \(x\) probability I am correct
  • Recall: If the truth is \(x\) probability I am correct
  • f-Score: Harmonic mean of precision and recall, great for unbalanced classes
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support

train_conf_mat = confusion_matrix(y_train, cv.predict(X_train))
test_conf_mat = confusion_matrix(y_test, cv.predict(X_test))

test_precisions, test_recalls, test_fscores, _ = precision_recall_fscore_support(
  y_test, cv.predict(X_test))

train_precisions, train_recalls, train_fscores, _ = precision_recall_fscore_support(
  y_train, cv.predict(X_train))

Does a longer review help?

prediction_probs = cv.predict_proba(word_counts)
reviews["Prediction"] = cv.predict(word_counts)

for index, website in enumerate(cv.best_estimator_.classes_):
  reviews[f"prob_{website}"] = prediction_probs[:, index]

reviews["is_test"] = False
reviews["is_correct"] = reviews["Prediction"] == reviews["Website"]
reviews.loc[y_test.index, "is_test"] = True
<Figure size 960x480 with 0 Axes>
test_set = reviews.query("is_test")
percent_correct_by_length = test_set.groupby(
  ["Lengths", "is_correct", "Website"]).agg("size") / test_set.groupby(
    ["Lengths", "Website"]).agg("size")
percent_correct_by_length = percent_correct_by_length.to_frame().reset_index()
percent_correct_by_length.rename(columns={0: "Freq"}, inplace=True)

g = sns.FacetGrid(percent_correct_by_length, col="Website", sharex=False)
g.map_dataframe(sns.histplot, x="Lengths",
    hue="is_correct",
    multiple="stack",
    weights="Freq",
    discrete=1)

Pitfalls

See Lones 2023 for a great comprehensive overview.

Overfitting

  • “All models are wrong, some are useful” - George Box
  • It’s important for your models to generalize, and not just memorize the training data set

Bad Data / Fitting Artifacts

  • You may need lots of data to produce a good model particularly if data is noisy
  • Sometimes machine learning algorithms can give arbitrary results if an algorithm used is not the right choice for the data (see applying t-SNE to Normal data)

Fitting Issues

  • You can have the perfect model choice for your data, but struggle to fit it!
  • Vanishing Gradient is common with Neural Networks
  • Optimization can be hard (NP-HARD to be precise) Ex: Sparse Ridge Regression

\[ \min_{\boldsymbol \beta} \|\mathbf y - \mathbf X\boldsymbol \beta\|^2_2 + \lambda_2 \|\boldsymbol \beta\|^2_2\text{, subject to } \|\boldsymbol \beta\|_0 \leq k\]

  • Difficulty sampling a distribution (see STA602L)
  • Sensitivity to initial values
  • Too many hyperparameters, each added hyperparameter exponentially increases the total possible fits

Feature Correlation / Colinearity

  • Many machine learning algorithms were developed just to deal with correlated features. Likewise, many algorithms have been proven to fail if the features are too correlated.
  • Check if your features are correlated using DataFrame.corr. Plot it too!

Curse of Dimensionality

  • Being immortal → Everyone is dead and the Sun is about to explode

  • More covariates for analysis → All my data points are significant

  • Nice write-up

FlipFlop

FlipFlop