scikit-learn
With notes from Rudin, Hastie et al. and James et al.
If you want more in-depth explanations, see these materials, but focusing on your courses is plenty sufficient.
ISL
) by James et al.: Friendlier introduction with plenty of code examples in R
or Python
, STA521 Textbook. Free Online CourseESL
) by Hastie et al.: More theoretical introductionIAML
) by Rudin: CS671 Textbooksklearn
): Very well written tutorials that teach the library and machine learning conceptsCO
): Infinitely useful textbook about optimizationscikit-learn
The input includes what the output should look like
sklearn
Chapter 1, IAML
Chapter 1.1, ISL
Chapter 2
sklearn
Chapter 1, IAML
Chapter 1.1, ISL
Chapter 4
Many supervised methods can be fit by optimizing an objective function! By changing the loss function (\(\ell\)) and the regularization function (\(R^{reg}\)), different models are made with different strengths and weaknesses.
\[ \sum_i \ell[f(x_i), y_i] + C R^{reg}(f)\]
Deriving helpful output from the input
sklearn
Chapter 2.3, IAML
Chapter 10, ISL
Chapter 10, ESL
14.3, Interactive Spectral Clustering
ISL
Chapter 10, IAML
Chapter 8, ESL
14.5
ESL 14.6
): \(\max_{\mathbf W, \mathbf H} \sum_i \sum_j [x_{ij} \log(\mathbf{WH})_{ij} - (\mathbf{WH})_{ij}]\)ESL 14.5.5
): \(\max_v \ v^\top (\mathbf X^\top \mathbf X) v, \ s.t.\ \sum_i |v| \leq t, \ \|v\|^2_2 = 1\)You work for Amazon and the company has noticed that the 5-star review system is flawed since two reviews may have the same number of stars, but actually have very different sentiments about the product. With all the data on Amazon users, and their reviews on products, they ask you to derive more intelligent information from reviews to assist in them suggesting high-quality products.
You work for OpenAI (creators of ChatGPT), and need to classify lots of text data to help train the latest overhyped NLP model. They have several million documents which need to be classified to a certain type. Some types include: research paper, user review, novel, transcription.
Picking your technologies or libraries may be important depending on application.
websites = ["yelp", "imdb", "amazon_cells"]
reviews = pd.DataFrame(columns = ['Review', 'Sentiment', 'Website'])
for website in websites:
data = pd.read_csv(f'./reviews/{website}_labelled.txt',
sep='\t',
names=["Review", "Sentiment"],
quoting=csv.QUOTE_NONE)
data["Website"] = website
reviews = pd.concat([reviews, data], axis=0)
reviews.reset_index(inplace=True, drop=True)
print(reviews.head(1))
print("Number of observations, columns:", reviews.shape)
Review Sentiment Website
0 Wow... Loved this place. 1 yelp
Number of observations, columns: (3000, 3)
Data is often not in an ideal form for using sklearn
. 1
sklearn
has a preprocessing and feature extraction module ideal for modifying and creating features!from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
word_counts = count_vect.fit_transform(reviews["Cleaned"])
print(word_counts.shape)
print(list(count_vect.vocabulary_.items())[:10])
(3000, 5377)
[('wow', 5327), ('loved', 2808), ('this', 4738), ('place', 3499), ('crust', 1163), ('is', 2516), ('not', 3192), ('good', 2066), ('tasty', 4658), ('and', 234)]
ISL
Chapter 10.4, ESL
Chapter 5.3
We want our model to generalize well for new predictions, so we should split it into a training set for building, and use a test set for final evaluation.
from sklearn.model_selection import train_test_split
# Not necessary for splitting but helpful for joining
# our model results with the original data frame
indices = np.arange(word_counts.shape[0])
# What happens if our test set is too big?
X_train, X_test, y_train, y_test, train_indices, test_indices = train_test_split(
word_counts, reviews["Website"], indices, test_size=0.3, random_state=0)
Logistic Regression is a standard baseline for classification problems. We should start basic. We will use an \(\ell_1\) penalty to assist with having more features than data points and promote sparsity.
\[ \hat p(X_i) = \frac{1}{1 + \exp(-X_i w - w_0)}, \quad \min_w C \sum_{i=1}^n [y_i \log(\hat p (X_i)) - (1 - y_i)\log(1 - \hat p(X_i))] + \|w\|_1 \]
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
logistic = linear_model.LogisticRegression(solver="liblinear", penalty="l1")
cv = GridSearchCV(logistic,
[{"C": 2.0**np.arange(-6, 6)}],
scoring="accuracy",
refit=True)
cv.fit(X_train, y_train)
print(cv.best_params_)
print(cv.best_estimator_.coef_.shape)
print(cv.best_estimator_.classes_)
print((np.count_nonzero(cv.best_estimator_.coef_, axis=1)))
{'C': 4.0}
(3, 5377)
['amazon_cells' 'imdb' 'yelp']
[613 567 565]
IAML
Chapter 1.3, ESL
Chapter 7, ISL
Chapter 5.1, 6
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
log2_vals = np.arange(-10, 32)
coef_nonzero = []
test_accuracies = []
train_accuracies = []
x_axis_vals = []
kf = KFold(n_splits=5)
reindexed_y = np.array(y_train.to_list())
for fold_num, (train_index, test_index) in enumerate(kf.split(X_train)):
for reg_strength in log2_vals:
logistic = linear_model.LogisticRegression(
solver="liblinear", penalty="l1", C=2.0 ** reg_strength)
logistic.fit(X_train[train_index, :], reindexed_y[train_index])
test_accuracies.append(accuracy_score(logistic.predict(X_train[test_index]), reindexed_y[test_index]))
train_accuracies.append(accuracy_score(logistic.predict(X_train[train_index]), reindexed_y[train_index]))
x_axis_vals.append(reg_strength)
coef_nonzero.append(np.count_nonzero(logistic.coef_, axis=1))
<Figure size 960x480 with 0 Axes>
most_important_words = np.argpartition(cv.best_estimator_.coef_, -20, axis=1)[:, -20:]
index_to_word = count_vect.get_feature_names_out()
for index, class_ in enumerate(cv.best_estimator_.classes_):
print(class_)
for word_index in most_important_words[index]:
print(index_to_word[word_index],
round(cv.best_estimator_.coef_[index, word_index], 3),
end=", ")
print("\n")
amazon_cells
voice 4.208, trouble 4.394, quality 4.405, works 4.664, reception 4.66, shipping 4.696, case 4.723, drop 5.617, colors 4.776, investment 5.947, headset 5.897, software 5.167, battery 6.153, plug 8.725, phone 8.103, sending 6.142, product 5.76, volume 5.653, ear 5.086, instructions 4.84,
imdb
scenes 4.376, films 4.394, god 4.583, whatever 4.67, movies 4.696, dialogue 4.707, watch 4.775, viewing 4.724, acting 4.847, movie 8.547, characters 5.364, film 8.449, character 6.033, game 5.47, plot 5.775, cast 5.395, ending 5.19, script 4.952, whiny 5.171, theater 6.039,
yelp
atmosphere 4.382, breakfast 4.562, pasta 4.841, pho 4.43, pizza 4.91, chicken 4.519, staff 4.694, chips 4.404, delicious 4.941, mid 4.945, eat 5.81, flavor 5.637, place 5.909, restaurant 5.946, check 5.112, sick 5.26, management 5.175, food 6.439, dish 6.53, meat 6.977,
# Use train dataset if trying to do model choice
probs = cv.predict_proba(X_train)
difficult_reviews = np.nonzero(np.std(probs, axis=1) < 0.2)
print(reviews.loc[train_indices[difficult_reviews]]["Cleaned"])
uncertain_reviews = np.nonzero(np.max(probs, axis=1) < 0.5)
print(reviews.loc[train_indices[uncertain_reviews]]["Cleaned"])
1837 dont waste your time
2201 all in all i think it was a good investment
1188 nothing at all to recommend
479 i loved it
1234 do not waste your time
2483 you wont regret it
1992 lange had become a great actress
1694 you wont regret it
265 plus its only 8 bucks
1585 not recommended
1407 i couldnt take them seriously
409 total waste of time
1133 all in all a great disappointment
400 this one is simply a disappointment
470 very good though
2783 it was horrible
2146 what a waste of time
1155 horrible
2930 never got it
Name: Cleaned, dtype: object
2037 poor talk time performance
1837 dont waste your time
1234 do not waste your time
1585 not recommended
2907 how stupid is that
400 this one is simply a disappointment
1155 horrible
Name: Cleaned, dtype: object
precision
and recall
, great for unbalanced classesfrom sklearn.metrics import confusion_matrix, precision_recall_fscore_support
train_conf_mat = confusion_matrix(y_train, cv.predict(X_train))
test_conf_mat = confusion_matrix(y_test, cv.predict(X_test))
test_precisions, test_recalls, test_fscores, _ = precision_recall_fscore_support(
y_test, cv.predict(X_test))
train_precisions, train_recalls, train_fscores, _ = precision_recall_fscore_support(
y_train, cv.predict(X_train))
sklearn
Chapter 3.3
prediction_probs = cv.predict_proba(word_counts)
reviews["Prediction"] = cv.predict(word_counts)
for index, website in enumerate(cv.best_estimator_.classes_):
reviews[f"prob_{website}"] = prediction_probs[:, index]
reviews["is_test"] = False
reviews["is_correct"] = reviews["Prediction"] == reviews["Website"]
reviews.loc[y_test.index, "is_test"] = True
<Figure size 960x480 with 0 Axes>
test_set = reviews.query("is_test")
percent_correct_by_length = test_set.groupby(
["Lengths", "is_correct", "Website"]).agg("size") / test_set.groupby(
["Lengths", "Website"]).agg("size")
percent_correct_by_length = percent_correct_by_length.to_frame().reset_index()
percent_correct_by_length.rename(columns={0: "Freq"}, inplace=True)
g = sns.FacetGrid(percent_correct_by_length, col="Website", sharex=False)
g.map_dataframe(sns.histplot, x="Lengths",
hue="is_correct",
multiple="stack",
weights="Freq",
discrete=1)
See Lones 2023 for a great comprehensive overview.
See Intuition for the Algorithms of Machine Learning Chapter 1.1
\[ \min_{\boldsymbol \beta} \|\mathbf y - \mathbf X\boldsymbol \beta\|^2_2 \text{, subject to } \|\boldsymbol \beta\|_0 \leq k\]
DataFrame.corr
. Plot it too!ISL
Chapter 3.6
Being immortal → Everyone is dead and the Sun is about to explode
More covariates for analysis → All my data points are significant
ISL
Chapter 6.4.3, ESL
Chapter 2.5, 18