Statistical Machine Learning by using Jupyter Notebook Problem

17 Aug Statistical Machine Learning by using Jupyter Notebook Problem

Posted at 14:28h in Computer Science by

For this HW, we will pretend that we only have 1000 images of the MNIST dataset.

We will begin first with what we had in our videos:

#commonly used imports
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

#get data from sklearn
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)

#get the attributes and labels
X, y = mnist["data"], mnist["target"]

#convert labels from characters to numbers
y = y.astype(np.uint8)

#split into training and test sets
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Suppose we decide on a Decision Tree Classifier (covered later in Chapter 6). There are two hyperparameters we will fine tune for now: max_leaf_nodes and min_samples. Let’s fine tune these using the following code.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

#setup the decision tree
dt_clf = DecisionTreeClassifier( random_state=30)

#the hyperparameters to search through
params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [3,4, 5]}

#initialize the GridSearch with 3-fold cross validation
grid_search_cv = GridSearchCV(dt_clf, params, verbose=1, cv=3)

#do the search (but only on the first 1000 images)
grid_search_cv.fit(X_train[:1000], y_train[:1000])

What are the best hyperparameters?

Now, let’s fit the training data (again pretending to only have the first 1000 images).

dt_clf = grid_search_cv.best_estimator_
dt_clf.fit(X_train[:1000], y_train[:1000])

Let’s now obtain cross-validation accuracy scores with this fine-tuned model

from sklearn.model_selection import cross_val_score
cross_val_score(dt_clf, X_train[:1000], y_train[:1000], cv=3, scoring="accuracy")

How accurate are your three folds?

If we only had 1000 images and we wanted more to train on, we could do what is known as data augmentation. The augmentation we will do here is to shift the images we do have. For each of the 1000 images, we shift one pixel up, left, right, and down. So this will give us four more images to add to our training. Data augmentation such as this is a very useful technique when there are not enough training instances. If you run the following code, you will see an example of the first image shifted.

from scipy.ndimage.interpolation import shift
def shift_image(image, dx, dy):
    image = image.reshape((28, 28))
    shifted_image = shift(image, [dy, dx], cval=0, mode="constant")
    return shifted_image.reshape([-1])

image = X_train[0]
shifted_image_down = shift_image(image, 0, 5)
shifted_image_left = shift_image(image, -5, 0)

plt.figure(figsize=(12,3))
plt.subplot(131)
plt.title("Original", fontsize=14)
plt.imshow(image.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(132)
plt.title("Shifted down", fontsize=14)
plt.imshow(shifted_image_down.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(133)
plt.title("Shifted left", fontsize=14)
plt.imshow(shifted_image_left.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.show()

To shift all of our images and add them to a training data set, we can use the following code:

X_train_augmented = [image for image in X_train[:1000]]
y_train_augmented = [label for label in y_train[:1000]]

for dx, dy in ((1, 0), (-1, 0), (0, 1), (0, -1)):
    for image, label in zip(X_train[:1000], y_train[:1000]):
        X_train_augmented.append(shift_image(image, dx, dy))
        y_train_augmented.append(label)

X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)

You now have 5000 images in the X_train_augmented and y_train_augmented sets (the 1000 original images and each image was shifted up, down, left, and right).

Let’s now fine-tune and fit a decision tree classifier using this augmented training data.

dt_clf = DecisionTreeClassifier( random_state=30)
params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [3,4, 5]}
grid_search_cv = GridSearchCV(dt_clf, params, verbose=1, cv=3)

#This will take around 10 minutes to run
grid_search_cv.fit(X_train_augmented, y_train_augmented)

grid_search_cv.best_estimator_

Now, what are the best hyperparameters?

Finally, obtain cross-validation accuracy scores with this model trained on the augmented data.

What are the accuracy scores now?

Did the augmented data help?

Think of one downside to using augmented data and comment.

Submit your code in a Jupyter notebook. Include all code: the code examples above and what you write.
Put your answers to the questions above into markdown cells.
Use a single hashmark heading to label the problem

Our website has a team of professional writers who can help you write any of your homework. They will write your papers from scratch. We also have a team of editors just to make sure all papers are of HIGH QUALITY & PLAGIARISM FREE. To make an Order you only need to click Ask A Question and we will direct you to our Order Page at WriteDemy. Then fill Our Order Form with all your assignment instructions. Select your deadline and pay for your paper. You will get it few hours before your set deadline.

Fill in all the assignment paper details that are required in the order form with the standard information being the page count, deadline, academic level and type of paper. It is advisable to have this information at hand so that you can quickly fill in the necessary information needed in the form for the essay writer to be immediately assigned to your writing project. Make payment for the custom essay order to enable us to assign a suitable writer to your order. Payments are made through Paypal on a secured billing page. Finally, sit back and relax.

Do you need an answer to this or any other questions?

17 Aug Statistical Machine Learning by using Jupyter Notebook Problem

About Us

Quick Links

Recent Posts

We Accept