Assignment #6

Due on Github 23:59 Wednesday 20 September

Objectives

  1. Expand our Deep Reinforcement Learning toolkit to enable continuous control for robotics.
  2. Improve our understanding of Actor/Critic methods
  3. Get some experience using the new IsaacGym DRL / simulator from NVIDIA, if available

Option #1 (for those without a GPU)

If your computer doesn’t have a GPU, find an example of Actor/Critic learning in PyTorch / Gymnasium that you can run on your computer.  As we’ve done before with Taxi, once you have the example running, make sure that you have a training / testing phase; or, even better, a training script that saves the trained network, and a testing script that loads and runs it.   You can submit your program as a Python script or scripts that I should be able to run on my computer.  

If you want to try an example that already works, here is my fork of a repository that uses the PPO algorithm to learn Pendulum-v1.  Running pendulum_train.py will save the actor and critic network weights.  Can you write a pendulum_test.py script to play the game using the actor network?

Option #2 (for those with a GPU)

First, make sure that Isaac Gym is installed on your computer.  I wasn’t able to do the installation on my computer at work, so if you have trouble with this option, I would just switch to Option #1, because I won’t be able to help you with Isaac issues.

Next, install the Isaac Gym environments as follows:

git clone https://github.com/NVIDIA-Omniverse/IsaacGymEnvs

cd IsaacGymEnvs

pip3 install -e .
python3 train.py

That last command will train your ants in real time with rendering, which as we know can slow things down dramatically.   Fortunately Isaac Gym also allows “headless” (no-render) training, as described in the instructions for the repository you just cloned.  Use headless training to train your ants this time; then follow the directions for running the train.py program again to show your training results.  Because headless training is supposed to stop automatically with decent performance, none of the ants should stumble or tip over now.

Being able to train up a swarm of ants in under a minute is pretty cool, but just how much of an improvement are we getting with Isaac Gym over pytorch by itself (i.e., the td3-learn program from earlier)?  To address this question scientifically, we’d like to compare performance on the same environment.  Looking over the list of environments (tasks) in Isaac Gym showed that although Isaac Gym currently has far fewer built-in environments than the ones in the table from OpenAI Gym , both platforms appear to have the familiar Cartpole.  Looking at the code for Isaac Gym’s Cartpole implementation, however, I began to doubt that it was the same as either of the Cartpole environments (v0 or v1) from OpenAI.  Can you spot the difference(s)?  As usual, try to start with the obvious.

What to turn in to github

For option #1 (no GPU), just turn in a script (or two) that I can run.

For option #2 (Isaac Gym), turn in the following: (a) your Ant.pth file; (b) a README.md or little PDF writeup describing how the Isaac Gym cartpole reward and action space differ from the Cartpole-v1 that we used in our previous assignment.

Assignment #3

Assignment # 3

Due on Github 11:59PM Wednesday 30 August

Goals

The goals of this assignment are:

  1. Get some practice working with convolutional neural networks (CNNs) in PyTorch. As before, we will begin by modifying existing code from the author, instead of attempting to write the whole program from scratch.
  2. Understand state-of-the art recurrent networks by building a Long Short-Term Memory (LSTM) network for part-of-speech (POS) tagging.
  3. Learn how neural nets represent word meanings by building a network for dense word embeddings.
  4. Get a feel for the latest development in Deep Learning — the attention/transformer networks behind ChatGPT — by deploying and modifying up a simple PyTorch example.
  5. See how much speedup you can get if you have a GPU-enabled computer. I have put an asterisk (*) next to these parts, to indicate that you don’t have to do them if you don’t have a GPU. In the same vein, you should feel free to reduce the number of training iterations that you run in order to keep the runtime under an hour on your computer. In either case, add a remark in the writeup letting me know what you did!

Part 1: Build and run a CNN

As in the previous assignment, look over the code from the author’s repository. Then begin copy/pasting piece-by-piece into a single Python script cnn.py that you can run in IDLE, command-line, or your favorite IDE.   It should only take a few minutes to copy/paste the code up to the point where you can run the first network and get a final accuracy of around 55.6.  Looking at my code I that point, I found there was a bit that was unused, so I removed it.

Part 2: Train, pickle

Once you can run the code, repeat what we did in the previous assignment: create two separate scripts cnn.py (class definition) and cnn_train.py (training). This time you’ll have to to add the pickling code, since it’s not already in the code from the author’s repository.   While you’re at it, copy the line of code from the previous assignment that reports which device (cpu or gpu) the program is using to train the network.

Part 3: Test

Considering all the work we did getting mlp_test.py to work in the last assignment (including the confusion matrix), I found it easier to use that script as the basis for my third new script, cnn_test.py.  Although the initial copy/paste/modify was easy, I immediately ran into some problems with the forward() method.  (If you didn’t have problems, feel to skip the rest of this paragraph!)  Following my practice of when in doubt, print out it, I discovered the usual suspect I’ve mentioned in class: a mismatch in the shape of the tensor I was passing into the network versus what the network expected.   I thought about doing a reshape() call on the data tensor, but quickly realized that a simpler solution involved eliminating, rather than adding,  a line of code.

Part 4: CPU/GPU Rematch!*

Use your cnn_train.py script to repeat the timing experiment from the previous assignment and report your results (time for training on CPU  vs. GPU for same number of epochs).  In addition to noticing a significant time difference (expected), I found that the GPU vs. CPU agreed on the loss, but differed noticeably in accuracy . This puzzled me a bit, because the point of calling torch.random.seed(0) at the top is to get the same results every time, to help with debugging.  In your writeup, note the CPU-vs-GPU time differences, and see what you can find online about the differences in accuracy.

Part 5: But is it worth it?

Of course, a fancy neural net (CNN) with GPU speedup isn’t really worth much if it doesn’t beat the results (around 91% test accuracy) that we were able to get with the simpler networks (SLP, MLP) in our previous assignment.  To test the advantage of the CNN, let’s step up our game to a much bigger number of training epochs, say, 500.   So to finish up, see what kind of accuracy you can get with your three networks (SLP, MLP, CNN) after such a large number of training epochs — making sure to use the same learning rate for all three. (I recommend bringing along some other work to do while you’re waiting!)  Report the accuracy for each of the three networks in your writeup.

Part 6: LSTM

Read through this tutorial and copy/paste the code into a Python script lstm.py.  Because of the small problem size, it should be possible to run the whole script (including training) in a few seconds on your laptop.

As usual, we’re going to learn more about this model by making some improvements.

First, based on the tutorial’s description of the final output (tag scores), add some code to report the output to the more human-readable form described in the large comment (DET NOUN VERB DET NOUN).  A nice output would report each word along with the POS label learned by the network, followed by the correct POS label (just as we did all the way back with XOR learning).  For example:

The:   DET (DET)
dog:   NN ( NN)
ate:   V ( V)
the:   DET (DET)
apple: NN ( NN)

Once you’ve got that working, factor the code into a test function test(words, targets),where targets are the desired POS tags, that will run this test for either of the two training examples (The dog ate the apple or Everybody read that book.)  At this point you can probably comment-out the print() statements on the code before the training/testing part,  to avoid distraction.  Then add the usual code inside your training loop to report the number of epochs at a reasonable interval.

Now that we’ve got a nice little training/testing script set up, let’s add some more parts of speech (POS) to our problem.  Adjectives seem like a natural place to start: How about: The big dog ate the red apple and Everybody read that awesome book ? Feel free to come up with your own vocabulary (bonus points for making me laugh!)

Of course, we haven’t really done an honest training/testing evaluation of our model, because we’re training and testing it on the same data set.  To see how the model works on data it hasn’t seen, try passing a couple of new sentences to your test function, by making new sentences from the existing vocabulary.

Applying some critical thinking to our results so far, you can probably see that they aren’t all that impressive: all we’ve got it is a model that classifies data (words) that it’s already seen! Indeed, you could probably do this with a simple logistic-regression or classic perceptron model using a single layer of weights and no recurrent (feedback) connections.  So what’s the big deal about LSTM and other recurrent networks?

Well, as we discussed in lecture, the cool thing about these networks (going back to Elman’s  1990 Simple Recurrent Network) is that they can predict the identity or category of the next item, based on the items seen so far; e.g., you know that the next word after The dog ate the … must be a noun or adjective.  What’s more, you can use this approach to solve “Plato’s Problem” of deducing information about words you have never heard: when you hear The dog ate the knish, you can immediately infer that knish is a noun (and probably something edible!) So, to finish up Part 1 of the assignment, figure out how to add a new word to the vocabulary, run the training on the same sentences as before (i.e., sentences not containing that new word), and see what your test function does on a sentence containing the new word.

Part 7: Word Embeddings

As in the previous part, read the PyTorch tutorial on Word Embeddings, copy/pasting the code into a script embedding.py.  And again, comment-out the annoying print statements, except for the final one reporting the embedded vector value (tensor) for one of the words.  Next, replace that final print with a loop that prints each vocabulary word followed by its vector, without the distracting tensor .. grad_fn wrapping.  After “a little help from my friends” on StackOverflow, I was able to use numpy formatting to get an attractive printout like this, sorted in alphabetical order:

'This   [+2.295 +0.676 +1.714 -1.794 -1.521 +0.918 -0.549 -0.347 +0.473 -0.429]
And     [+0.487 -0.309 -3.014 -1.247 +1.349 +0.269 -1.128 -0.601 +1.837 -1.071]

youth's [+2.414 +1.021 -0.44  -1.734 -1.026 +0.521 -0.453 -0.126 -0.588 +2.119]

Note that because of the simple way that the text was split (via the default blankspace delimiter), we’re getting bogus punctuation included with some words (like the quotation mark in 'This) — which as Shakespeare might say, doth vex me somewise!   Pythonistas have lots of tricks for getting rid of punctuation, but for the current project I think it’s simply easier to either (a) not worry about the problem, or (b) edit the sonnet fragment a bit to eliminate punctuation and upper-case letters.  I chose the either (b), which gave me a final vocabulary size of 86 lower-case words.

Now that we’ve got a nice little word-embedding program with sensible output, let’s see whether we can understand the embeddings (vectors) that it’s giving us.  As we saw in the lecture on the Simple Recurrent Net (slides 13-16), a clever way of doing this is to build a distance matrix, then run Hierarchical Cluster Analysis on the distance matrix to build a dendrogram (tree diagram) to visualize the semantic structure encoded in the embeddings.

Fortunately, there are now powerful tools that can do both of these steps for you automatically. This page has a schweet example.  One I got this example running, I printed out the tiny dataset used for the clustering and saw that it was simple a list of 2D vectors (represented as tuples).  After puzzling over how to go from that kind of 2D data to our ten-dimensional embedding vectors, I figured I’d just put the vectors into a big list and use them as the data.  Sure enough, it worked!  A little more googling revealed how to use my Shakespeare vocabulary as the labels for the dendrogram, and then how to rotate the dendrogram so that the labels appeared on the left rather than at the bottom, for greater readability.

After all that work, I found my dendrogram results somewhat disappointing: with so many words, it was impossible to read the whole plot clearly, and when I tried zoom in on it, I found it difficult to discern the kinds of word-class patterns that Elman got.  As is often the case in science (esp. data science), your results can be sensitive to not only the algorithm you use, but also the data!  In other words, if I could go back to the simpler, artificially-generated sentences used by Elman, I might see some kind of pattern in my own embedding results.

As usual, googling a bit for RANDOM SENTENCE GENERATOR PYTHON, I found a simple solution on StackOverflow. Even better, I realized that instead of using random numbers, I could simply enumerate every possible sentence of the form ADJ NOUN VERB ADVERB (e.g., adorable rabbit runs occasionally, by a quadruply-nested cascade of for loops.  This solution had the advantage of having a very small vocabulary size (20 words)  and a much much larger data set (54 four-word sentences) than the original sonnet fragment.  So, for the final part of the assignment, I added a little code to my embedding.py to generate these simple sentences and use them as the training set.  By decreasing  the context size and number of embedding dimensions, I was able to get reasonable (not perfect!) dendrogram results after 100 epochs.  Try that, see what you get, and include your dendrogram picture in your writeup.

Part 8: Attention / Transformers

Here is the code to copy/paste/modify into your initial transfomer.py script.  This time the PyTorch folks did a nice job formatting the training reports — including time info!  Unfortunately they did not include a more detailed test case (like our confusion matrices from the previous assignments); hence, as before, we have an opportunity to explore further.

First, as before, let’s do the easy thing and see how much value we get from the GPU*, by finding the device = ... code, commenting it out to force CPU, and then running a trial with and without CUDA.  As before, if you run the code with time /usr/bin/python3 instead of just /usr/bin/python3, you can get a nice overall time summary at the end, to include in your writeup.

Next, let’s see what this model is actually learning!  I found the tutorial description pretty minimal, so as usual I started printing things out and exiting before the training started.  By printing out the size (len) of various data variables in the code, I quickly got a confirmation that this is indeed an model of the English language.  (Hint: take a look at this statistic).  Then the sizes of the training and testing sets then made sense too.  Make sure to note these three sizes and report them in your writeup, with a brief explanation.  Also comment in your writeup on a new “one weird trick” you can see in the training report!

Now that we’ve got a good sense of what kind of data this model is using and how long it takes to train, let’s do the usual thing and break it up into training and testing scripts.  Unfortunately the code saves the pickled model best_model_params.pt in some weird temporary directory that I couldn’t locate, so the first modification I made was to force it to save that file in the current directory, with a helpful message about saving the file, as we did in the previous assignments.  Once you’ve got the model saved, comment briefly in your writeup on the number of apparent parameters (floating-point weights) it appears to  contain, assuming the standard four-byte floating-point encoding.

So now it’s time to split up our code the usual way: transformer.py (class definition), transformer_train.py (train model and save it), and trainsformer_test.py (load model and test it).    Because of the way that the original transformer script mixes up global variables and function parameters, this step can take a while to get right, but at the end you’ll have a standalone test script that you can use to try out your Pre-Trained transformer: the PT part of ChatGPT!

As it stands, the evaluate()function used by the training and testing scripts doesn’t report anything interesting; it just returns the loss value.  So to get a better idea of what the network is actually doing, I copy/pasted the evaluate()function into my test script to create a new function report(), which I then modified to report the actual input and target words.   As mentioned in the (confusing to me!) tutorial instructions, the job of the get_batch() function used in evaluate()is to make a target sequence out of the input sequence by shifting the input sequence by one position — the same trick as in Elman’s original 1990 sequence-learning model.

To verify this claim for the actual vocabulary in the data,  it took a little bit of experimental printing, to see that the data (input) and targets were both of size 35×10, but that the targets had been reshaped to 1×350.  Once I figured that out, I was able to reverse-engineer the vocabulary object to extract the words corresponding to each word index, and then write some code to report the data words followed by the target words. Hint: as usual, type() will tell you the type of a variable, after which you can look up its methods in the online documentation.  For the first iteration of the report() loop I got this output (abbreviated here for simplicity), showing that the inputs and targets had the expected relationship:

= next either imagery and her = . was hitting
robert day blunt and n death boston seneca proscribed the
<unk> it ( clear @-@ as celtics asked . slow ...

robert day blunt and n death boston seneca proscribed the
<unk> it ( clear @-@ as celtics asked . slow
= joined <unk> , <unk> it = <unk> these @-@ ...

Looking at this data, I still couldn’t make any sense of the individual lines: WTF is = next either imagery and her = was hitting  … supposed to mean?!  As  a final effort at understanding this complicated model, I managed to find the URL for the WikiText-2 dataset zipfile, hidden in the PyTorch source code.  Downloading and unzipping this dataset and looking through the testing part, I solved the final mystery!  In your writeup, briefly comment on what you find when you do this; i.e., how is the code representing the actual text? Hint: Add a line print(targets) inside your training loop. Can you explain what the targets mean?

What to submit to Github

As usual, your PDF writeup will be the main part, plus your Python scripts to preserve your work.

Assignment #2

Assignment #2

Due on Github 23:59 Wednesday 23 August

Goal

The goals of this assignment are:

  1. Coding up back-propagation back-propagation on the problems we tackled in the previous assignment: Boolean functions and digit recognition. So you should be able to reuse a significant amount of code from that assignment.
  2. Becoming familiar with  PyTorch, one of the two most popular software packages for deep learning.

Part 1: backprop.py

Copy / paste / modify your perceptron.py module into a new module backprop.py. This module should provide a class that you instantiate by specifying one extra parameter, h, the number of hidden units. Your train method should take an extra parameter, η (eta), specifying the learning rate, which you can default to 0.5. Use the algorithm at the end of the lecture slides to flesh out the train and test methods.

Once you’ve set up your backprop code, it should be straightforward to copy/paste/modify your part1.py from the previous assignment. Since the point of backprop is to learn functions like XOR, modify your code to train on this one function and report the results. Since we’re using a squashing function rather than a hard threshold, you can simply report the floating-point value of the output (instead of True / False). A good result is one where you get no more than 0.2 for the False values and no less than 0.8 for the True. I was usually able to get results like this using three hidden units, η=0.5, and 10,000 iterations.

Once you’ve got your XOR solver working, add two methods to your backprop class: save, to save the current weights, and load, to load in a new set of weights. This will be essential when training larger, slower-learning networks like the ones in the rest of the assignment. You are free to implement these methods however you like, but I suggest using the Python pickling tools you learned about in CSCI 111. (If you’re rusty, take a look at slides 12-20 of Prof. Lambert’s presentation of this topic.)

Part 2: 2-not-2 revisited

Now redo your Part 2 from last time: a 196-input, one-output backprop network that learns to distinguish between 2 and not-2. To get the misses and false positives, you can use a threshold. Ideally, you could consider an output below 0.5 as 0 and above 0.5 as 1. But I found this threshold too high, missing many of the 2’s.

Of course, you’ll have to experiment with a different number of hidden units (and possibly learning rate η) to get something you’re happy with. Unlike the previous part, where you are almost certain to get good results on XOR with enough iterations, the goal here is not to “solve” the classification, but rather to explore the behavior of back-prop on an interesting problem and report your results in a concise and understandable way.

Once you’re satisfied with your results on this part, use your save method to save the trained weights, and add some code at the end to load them, run your tests, and report your results. Once you’ve got this whole part2.py script working, comment-out the training part, so that the script simply loads the weights, tests with them, and reports the results. This is how I will test your script.

Part 3: Backprop as full digit classifier

Here we’ll go for the “Full Monty” and try to classify all 10 digits correctly. Use your new backprop class to instantiate, train, and test a 196-input, 10-output network on a one-in-N (“one-hot”) code for the digits. (This is the code at the bottom of each pattern, though it is easy to build yourself if you didn’t read it from the data file.) For testing, you might simply pick the largest of the ten output units as the “winner”.

Before you start training for lots of iterations here, I’d get your testing part of your part3.py code working: just train for one iteration, then run the tests and produce a 10×10 table (confusion matrix) showing each digit (row) and how many times it was classified as each digit (column). (A perfect solution would have all 250s on the diagonal of this table, but that is an extremely unlikely result.) Again, there’s no “correct” number of hidden units, iterations, or the like. At some point you’ll have to stick with something that works reasonably, and produce a nice table to report your results with it.

If you think about the number of weights you’re now training (197∗h+(h+1)∗10197∗h+(h+1)∗10), you can see why it will be crucial to get your setup and report working nicely before you spend hours training. As with Part 2, you’ll save the weights once you’re satisfied, then add code to load and test with them, and finally comment-out the training part.

Part 4: Diving into PyTorch

First, follow the instructions I showed you about using pytorch.org for installing PyTorch on your laptop if you have one.

Most of your coding for this assignment will be copy/paste/modify — in my experience, the next best thing to writing it yourself from scratch — and often the only practical option!

The author’s Github repository for the book contains code in Jupyter notebook (ipynb) form, corresponding to the  Chapter 5 section Building the MNIST Classifier in PyTorch (starting on p. 148).  Find that code in the repository and copy/paste it into a file mlp.py (one section a time is safest), all the way through the train() and test() calls at the bottom. With no modification to the code, I was able to get a figure and test results very similar to what’s shown  in the notebook: a nice smooth descending error  and a testing accuracy close to 91%.  I also saw that the train() function saved (pickled) the trained network to file mnist.pt, which is nice.  I did get some weird Unable to init server  … Connection refused messages, as well as something about “Gdk” on the machines in P413, but that didn’t affect my ability to run the code. In my experience, these kinds of minor annoyances are pretty common with Deep Learning packages and other large open-source software projects that are evolving so rapidly.

Part 5: Pickling 

As in our previous assignment, we want to get used to the habit of separating our training and testing code into two separate programs, enabling us to run a trained network on new data.  So, copy/paste your mlp.py script into mlp_train.py and mlp_test.py.  Then edit your three files so that mlp.py has just the network class code, mlp_train.py runs the training, and mlp_test.py runs the test.  Running mlp_train.py will produce the  mnist.pt (pickled sate dictionary) file as before.  I also found it helpful for train() to print a little message telling the user Saving network in mnist.plt. To figure out how to load this file into mlp_test.py, I found this documentation useful.   Good coding practice also dictates that you should remove unnecessary imports (e.g., MNIST data imports from mlp.py and mlp_test.py).  At this point you should probably also comment-out the plt plotting code to save yourself the trouble of the plot window popping up.

You’ll probably notice at this point that in both scripts you need to specify this size of the network (in_dim, feature_dim, out_dim).  This isn’t terrible, but it will slow you down when you want to experiment with different network shapes in the next section.  Since this is the MNIST data set, we know that the number of inputs is always 784 (28×28 pixels) and the number outputs (digit classifications) is 10.  Plus, as in our previous assignment, there’s information in the trained network (params, weights) that enables you to determine the other size without having to store or specify it explicitly.  So add a little code to mlp_test.py to extract and use this information.

Part 6: Hyper-parameters

91% accuracy seems like an awesome result on a serious data set like MNIST.  Looking at the code, you’ll see what look like pretty arbitrary decisions about the standard hyperparameters (training iterations, hidden units, eta).  In this part we’ll try a little “Goldilocks and the Three Bears” experimentation to see whether the values we’re using are “just right” (i.e., a good tradeoff between training time and testing generalization).

First let’s look at training iterations.  Another name for these is epochs, which you’ll see in your mlp_train.py script as 40.  We could mess around all day with different values, but the Goldilocks approach suggests simply trying something significantly lower and significantly higher.  So try 20 and 60 instead and note whether you get obviously better or worse testing results.   You’ll put these values into a little plot or table in your writeup, so for the time being you can just write them down or save them in a spreadsheet.

Next let’s play with the size of the hidden layer.  As usual, people use different terminology for the same concept.  Based on the work you did in the pickling part, it should be pretty obvious what variable name corresponds to the hidden-layer size.  Once you’ve got that, modify the code to stop training not after a fixed number of epochs, but instead after a particular loss value. Then see if you can test the claim in our lecture notes about more hidden units providing faster training but worse generalization (testing).  (FWIW, I wasn’t able to get anything consistent here.)

Finally, look for the variable corresponding to the learning rate that we called η.  Again, try something significantly bigger and smaller to see if you get any noticeable difference in training duration and testing accuracy.  And again, your results are valuable (perhaps more so!) even if they don’t match expectations: maybe you can get the network to train faster and get better test results?

Part 7: Confusion Matrix

As before, a confusion matrix is a lot more informative than a simple accuracy rate.  What you’ll do on this part is grab your confusion matrix code from the previous assignment and copy/paste it into your mlp_test.py script, then figure out how to reconcile it with the existing code in mlp_test.py.

Looking at that code, I saw some obvious candidates, pred and targets.  At first the shapes and contents those tensors didn’t make sense, but looking at them in more detail, I realized how they were formatted, and how to use them to populate my confusion matrix.  Once I figured that out, I was able to add just two lines  to do this!

Part 8: Direct logistic regression (no Hidden Layer)

As I mentioned in class, the previous time I gave this assignment (in TensorFlow) we found that the logistic regression being use on the output layer of the MNIST network was so powerful that we got good results without needing a hidden layer; i.e., with a single-layer perceptron.  To see how well we can do with an SLP, copy your three scripts to three new scripts slp.py, slp_train.py, and slp_test.py.  Then modify the code to use only one layer of weights.   Since there is no more hidden layer, you’ll have to experiment with the other hyper-params (number of epochs, learning rate) to try and match the performance of your MLP.  Hint: A common trick to get quickly to a good solution is to keep doubling the size of the value (e.g., epochs) until you get something that works.

Part 9: CPU vs. GPU showdown

As we discussed, a big (supposed) advantage of PyTorch is the ease with which it lets you run your training on the GPU via the CUDA software libraries.  If you have access to a computer with an NVIDIA GPU, you should be able to complete this last part of the assignment; otherwise, feel free to skip it.

Unable to locate anything helpful in our textbooks about PyTorch + CUDA, I found various tutorials online, but none got me to a complete solution.  After much trial and error I came up with the following recipe.  The point of doing it this way is that you’ll automatically run on CUDA if it’s available, and on the ordinary CPU otherwise:

First, at the top of your training script, set a variable to name what device you want to run on, based on whether CUDA is available:

dev = "cuda:0" if torch.cuda.is_available() else "cpu"                                                print("Running on " + dev)

Next, immediately after constructing your classifier, convert it to use the device:                                      classifier = classifier.to(dev)

Finally, in the training loop, do the same  .to(dev) trick on your data and target tensors.

Once you’ve got this working, it’s time for a head-to-head comparison between GPU and CPU.  First, we’ll time it on the GPU.  From the command line:

    time python3 mlp_train.py

Do this a few times, noting down the real component of the timing result, in case it varies between trials.  Next,  override  the automatic device choice, forcing the program to run on the CPU:

    dev = "cpu" # "cuda:0" if torch.cuda.is_available() else "cpu"

To my disappointment,  I found that the GPU ran at the same speed (time) or slower than the CPU!   If you’ve taken a course on parallel computing (or google around a bit on this complaint), you’ll know that the typical explanation is that you’re running a relatively small model and transferring the data to the GPU more frequently than you might need to.  At this point I was happy with my ability to run on the GPU for future work and didn’t try harder to get the GPU to win as expected.  So an excellent extra-credit opportunity would be to modify your model or training code enough to get a noticeable speedup on the GPU.     

What to submit to Github

  1. backprop.py
  2. A little PDF write-up (a single page should be sufficient; two at most) with a brief description of your PyTorch results in each part, including the confusion matrix.

* Based on https://www.cs.colorado.edu/~mozer/Teaching/syllabi/DeepLearning2015/assignments/assignment3.html