Assignment #2
Due on Github 23:59 Wednesday 23 August
Goal
The goals of this assignment are:
- Coding up back-propagation back-propagation on the problems we tackled in the previous assignment: Boolean functions and digit recognition. So you should be able to reuse a significant amount of code from that assignment.
- Becoming familiar with PyTorch, one of the two most popular software packages for deep learning.
Part 1: backprop.py
Copy / paste / modify your perceptron.py module into a new module backprop.py. This module should provide a class that you instantiate by specifying one extra parameter, h, the number of hidden units. Your train method should take an extra parameter, η (eta), specifying the learning rate, which you can default to 0.5. Use the algorithm at the end of the lecture slides to flesh out the train and test methods.
Once you’ve set up your backprop code, it should be straightforward to copy/paste/modify your part1.py from the previous assignment. Since the point of backprop is to learn functions like XOR, modify your code to train on this one function and report the results. Since we’re using a squashing function rather than a hard threshold, you can simply report the floating-point value of the output (instead of True / False). A good result is one where you get no more than 0.2 for the False values and no less than 0.8 for the True. I was usually able to get results like this using three hidden units, η=0.5, and 10,000 iterations.
Once you’ve got your XOR solver working, add two methods to your backprop class: save, to save the current weights, and load, to load in a new set of weights. This will be essential when training larger, slower-learning networks like the ones in the rest of the assignment. You are free to implement these methods however you like, but I suggest using the Python pickling tools you learned about in CSCI 111. (If you’re rusty, take a look at slides 12-20 of Prof. Lambert’s presentation of this topic.)
Part 2: 2-not-2 revisited
Now redo your Part 2 from last time: a 196-input, one-output backprop network that learns to distinguish between 2 and not-2. To get the misses and false positives, you can use a threshold. Ideally, you could consider an output below 0.5 as 0 and above 0.5 as 1. But I found this threshold too high, missing many of the 2’s.
Of course, you’ll have to experiment with a different number of hidden units (and possibly learning rate η) to get something you’re happy with. Unlike the previous part, where you are almost certain to get good results on XOR with enough iterations, the goal here is not to “solve” the classification, but rather to explore the behavior of back-prop on an interesting problem and report your results in a concise and understandable way.
Once you’re satisfied with your results on this part, use your save method to save the trained weights, and add some code at the end to load them, run your tests, and report your results. Once you’ve got this whole part2.py script working, comment-out the training part, so that the script simply loads the weights, tests with them, and reports the results. This is how I will test your script.
Part 3: Backprop as full digit classifier
Here we’ll go for the “Full Monty” and try to classify all 10 digits correctly. Use your new backprop class to instantiate, train, and test a 196-input, 10-output network on a one-in-N (“one-hot”) code for the digits. (This is the code at the bottom of each pattern, though it is easy to build yourself if you didn’t read it from the data file.) For testing, you might simply pick the largest of the ten output units as the “winner”.
Before you start training for lots of iterations here, I’d get your testing part of your part3.py code working: just train for one iteration, then run the tests and produce a 10×10 table (confusion matrix) showing each digit (row) and how many times it was classified as each digit (column). (A perfect solution would have all 250s on the diagonal of this table, but that is an extremely unlikely result.) Again, there’s no “correct” number of hidden units, iterations, or the like. At some point you’ll have to stick with something that works reasonably, and produce a nice table to report your results with it.
If you think about the number of weights you’re now training (197∗h+(h+1)∗10197∗h+(h+1)∗10), you can see why it will be crucial to get your setup and report working nicely before you spend hours training. As with Part 2, you’ll save the weights once you’re satisfied, then add code to load and test with them, and finally comment-out the training part.
Part 4: Diving into PyTorch
First, follow the instructions I showed you about using pytorch.org for installing PyTorch on your laptop if you have one.
Most of your coding for this assignment will be copy/paste/modify — in my experience, the next best thing to writing it yourself from scratch — and often the only practical option!
The author’s Github repository for the book contains code in Jupyter notebook (ipynb) form, corresponding to the Chapter 5 section Building the MNIST Classifier in PyTorch (starting on p. 148). Find that code in the repository and copy/paste it into a file mlp.py (one section a time is safest), all the way through the train()
and test()
calls at the bottom. With no modification to the code, I was able to get a figure and test results very similar to what’s shown in the notebook: a nice smooth descending error and a testing accuracy close to 91%. I also saw that the train()
function saved (pickled) the trained network to file mnist.pt, which is nice. I did get some weird Unable to init server … Connection refused messages, as well as something about “Gdk” on the machines in P413, but that didn’t affect my ability to run the code. In my experience, these kinds of minor annoyances are pretty common with Deep Learning packages and other large open-source software projects that are evolving so rapidly.
Part 5: Pickling
As in our previous assignment, we want to get used to the habit of separating our training and testing code into two separate programs, enabling us to run a trained network on new data. So, copy/paste your mlp.py script into mlp_train.py and mlp_test.py. Then edit your three files so that mlp.py has just the network class code, mlp_train.py runs the training, and mlp_test.py runs the test. Running mlp_train.py will produce the mnist.pt (pickled sate dictionary) file as before. I also found it helpful for train() to print a little message telling the user Saving network in mnist.plt. To figure out how to load this file into mlp_test.py, I found this documentation useful. Good coding practice also dictates that you should remove unnecessary imports (e.g., MNIST data imports from mlp.py and mlp_test.py). At this point you should probably also comment-out the plt plotting code to save yourself the trouble of the plot window popping up.
You’ll probably notice at this point that in both scripts you need to specify this size of the network (in_dim, feature_dim, out_dim
). This isn’t terrible, but it will slow you down when you want to experiment with different network shapes in the next section. Since this is the MNIST data set, we know that the number of inputs is always 784 (28×28 pixels) and the number outputs (digit classifications) is 10. Plus, as in our previous assignment, there’s information in the trained network (params, weights) that enables you to determine the other size without having to store or specify it explicitly. So add a little code to mlp_test.py to extract and use this information.
Part 6: Hyper-parameters
91% accuracy seems like an awesome result on a serious data set like MNIST. Looking at the code, you’ll see what look like pretty arbitrary decisions about the standard hyperparameters (training iterations, hidden units, eta). In this part we’ll try a little “Goldilocks and the Three Bears” experimentation to see whether the values we’re using are “just right” (i.e., a good tradeoff between training time and testing generalization).
First let’s look at training iterations. Another name for these is epochs, which you’ll see in your mlp_train.py script as 40. We could mess around all day with different values, but the Goldilocks approach suggests simply trying something significantly lower and significantly higher. So try 20 and 60 instead and note whether you get obviously better or worse testing results. You’ll put these values into a little plot or table in your writeup, so for the time being you can just write them down or save them in a spreadsheet.
Next let’s play with the size of the hidden layer. As usual, people use different terminology for the same concept. Based on the work you did in the pickling part, it should be pretty obvious what variable name corresponds to the hidden-layer size. Once you’ve got that, modify the code to stop training not after a fixed number of epochs, but instead after a particular loss value. Then see if you can test the claim in our lecture notes about more hidden units providing faster training but worse generalization (testing). (FWIW, I wasn’t able to get anything consistent here.)
Finally, look for the variable corresponding to the learning rate that we called η. Again, try something significantly bigger and smaller to see if you get any noticeable difference in training duration and testing accuracy. And again, your results are valuable (perhaps more so!) even if they don’t match expectations: maybe you can get the network to train faster and get better test results?
Part 7: Confusion Matrix
As before, a confusion matrix is a lot more informative than a simple accuracy rate. What you’ll do on this part is grab your confusion matrix code from the previous assignment and copy/paste it into your mlp_test.py script, then figure out how to reconcile it with the existing code in mlp_test.py.
Looking at that code, I saw some obvious candidates, pred
and targets
. At first the shapes and contents those tensors didn’t make sense, but looking at them in more detail, I realized how they were formatted, and how to use them to populate my confusion matrix. Once I figured that out, I was able to add just two lines to do this!
Part 8: Direct logistic regression (no Hidden Layer)
As I mentioned in class, the previous time I gave this assignment (in TensorFlow) we found that the logistic regression being use on the output layer of the MNIST network was so powerful that we got good results without needing a hidden layer; i.e., with a single-layer perceptron. To see how well we can do with an SLP, copy your three scripts to three new scripts slp.py, slp_train.py, and slp_test.py. Then modify the code to use only one layer of weights. Since there is no more hidden layer, you’ll have to experiment with the other hyper-params (number of epochs, learning rate) to try and match the performance of your MLP. Hint: A common trick to get quickly to a good solution is to keep doubling the size of the value (e.g., epochs) until you get something that works.
Part 9: CPU vs. GPU showdown
As we discussed, a big (supposed) advantage of PyTorch is the ease with which it lets you run your training on the GPU via the CUDA software libraries. If you have access to a computer with an NVIDIA GPU, you should be able to complete this last part of the assignment; otherwise, feel free to skip it.
Unable to locate anything helpful in our textbooks about PyTorch + CUDA, I found various tutorials online, but none got me to a complete solution. After much trial and error I came up with the following recipe. The point of doing it this way is that you’ll automatically run on CUDA if it’s available, and on the ordinary CPU otherwise:
First, at the top of your training script, set a variable to name what device you want to run on, based on whether CUDA is available:
dev = "cuda:0" if torch.cuda.is_available() else "cpu"
print("Running on " + dev)
Next, immediately after constructing your classifier, convert it to use the device: classifier = classifier.to(dev)
Finally, in the training loop, do the same .to(dev)
trick on your data
and target tensors
.
Once you’ve got this working, it’s time for a head-to-head comparison between GPU and CPU. First, we’ll time it on the GPU. From the command line:
time python3 mlp_train.py
Do this a few times, noting down the real component of the timing result, in case it varies between trials. Next, override the automatic device choice, forcing the program to run on the CPU:
dev = "cpu" # "cuda:0" if torch.cuda.is_available() else "cpu"
To my disappointment, I found that the GPU ran at the same speed (time) or slower than the CPU! If you’ve taken a course on parallel computing (or google around a bit on this complaint), you’ll know that the typical explanation is that you’re running a relatively small model and transferring the data to the GPU more frequently than you might need to. At this point I was happy with my ability to run on the GPU for future work and didn’t try harder to get the GPU to win as expected. So an excellent extra-credit opportunity would be to modify your model or training code enough to get a noticeable speedup on the GPU.
What to submit to Github
- backprop.py
- A little PDF write-up (a single page should be sufficient; two at most) with a brief description of your PyTorch results in each part, including the confusion matrix.
* Based on https://www.cs.colorado.edu/~mozer/Teaching/syllabi/DeepLearning2015/assignments/assignment3.html