Assignment # 3
Due on Github 11:59PM Wednesday 30 August
The goals of this assignment are:
- Get some practice working with convolutional neural networks (CNNs) in PyTorch. As before, we will begin by modifying existing code from the author, instead of attempting to write the whole program from scratch.
- Understand state-of-the art recurrent networks by building a Long Short-Term Memory (LSTM) network for part-of-speech (POS) tagging.
- Learn how neural nets represent word meanings by building a network for dense word embeddings.
- Get a feel for the latest development in Deep Learning — the attention/transformer networks behind ChatGPT — by deploying and modifying up a simple PyTorch example.
- See how much speedup you can get if you have a GPU-enabled computer. I have put an asterisk (*) next to these parts, to indicate that you don’t have to do them if you don’t have a GPU. In the same vein, you should feel free to reduce the number of training iterations that you run in order to keep the runtime under an hour on your computer. In either case, add a remark in the writeup letting me know what you did!
Part 1: Build and run a CNN
As in the previous assignment, look over the code from the author’s repository. Then begin copy/pasting piece-by-piece into a single Python script cnn.py that you can run in IDLE, command-line, or your favorite IDE. It should only take a few minutes to copy/paste the code up to the point where you can run the first network and get a final accuracy of around 55.6. Looking at my code I that point, I found there was a bit that was unused, so I removed it.
Part 2: Train, pickle
Once you can run the code, repeat what we did in the previous assignment: create two separate scripts cnn.py (class definition) and cnn_train.py (training). This time you’ll have to to add the pickling code, since it’s not already in the code from the author’s repository. While you’re at it, copy the line of code from the previous assignment that reports which device (cpu or gpu) the program is using to train the network.
Part 3: Test
Considering all the work we did getting mlp_test.py to work in the last assignment (including the confusion matrix), I found it easier to use that script as the basis for my third new script, cnn_test.py. Although the initial copy/paste/modify was easy, I immediately ran into some problems with the
forward() method. (If you didn’t have problems, feel to skip the rest of this paragraph!) Following my practice of when in doubt, print out it, I discovered the usual suspect I’ve mentioned in class: a mismatch in the shape of the tensor I was passing into the network versus what the network expected. I thought about doing a
reshape() call on the data tensor, but quickly realized that a simpler solution involved eliminating, rather than adding, a line of code.
Part 4: CPU/GPU Rematch!*
Use your cnn_train.py script to repeat the timing experiment from the previous assignment and report your results (time for training on CPU vs. GPU for same number of epochs). In addition to noticing a significant time difference (expected), I found that the GPU vs. CPU agreed on the loss, but differed noticeably in accuracy . This puzzled me a bit, because the point of calling
torch.random.seed(0) at the top is to get the same results every time, to help with debugging. In your writeup, note the CPU-vs-GPU time differences, and see what you can find online about the differences in accuracy.
Part 5: But is it worth it?
Of course, a fancy neural net (CNN) with GPU speedup isn’t really worth much if it doesn’t beat the results (around 91% test accuracy) that we were able to get with the simpler networks (SLP, MLP) in our previous assignment. To test the advantage of the CNN, let’s step up our game to a much bigger number of training epochs, say, 500. So to finish up, see what kind of accuracy you can get with your three networks (SLP, MLP, CNN) after such a large number of training epochs — making sure to use the same learning rate for all three. (I recommend bringing along some other work to do while you’re waiting!) Report the accuracy for each of the three networks in your writeup.
Part 6: LSTM
Read through this tutorial and copy/paste the code into a Python script lstm.py. Because of the small problem size, it should be possible to run the whole script (including training) in a few seconds on your laptop.
As usual, we’re going to learn more about this model by making some improvements.
First, based on the tutorial’s description of the final output (tag scores), add some code to report the output to the more human-readable form described in the large comment (DET NOUN VERB DET NOUN). A nice output would report each word along with the POS label learned by the network, followed by the correct POS label (just as we did all the way back with XOR learning). For example:
The: DET (DET)
dog: NN ( NN)
ate: V ( V)
the: DET (DET)
apple: NN ( NN)
Once you’ve got that working, factor the code into a test function
targets are the desired POS tags, that will run this test for either of the two training examples (The dog ate the apple or Everybody read that book.) At this point you can probably comment-out the
print() statements on the code before the training/testing part, to avoid distraction. Then add the usual code inside your training loop to report the number of epochs at a reasonable interval.
Now that we’ve got a nice little training/testing script set up, let’s add some more parts of speech (POS) to our problem. Adjectives seem like a natural place to start: How about: The big dog ate the red apple and Everybody read that awesome book ? Feel free to come up with your own vocabulary (bonus points for making me laugh!)
Of course, we haven’t really done an honest training/testing evaluation of our model, because we’re training and testing it on the same data set. To see how the model works on data it hasn’t seen, try passing a couple of new sentences to your test function, by making new sentences from the existing vocabulary.
Applying some critical thinking to our results so far, you can probably see that they aren’t all that impressive: all we’ve got it is a model that classifies data (words) that it’s already seen! Indeed, you could probably do this with a simple logistic-regression or classic perceptron model using a single layer of weights and no recurrent (feedback) connections. So what’s the big deal about LSTM and other recurrent networks?
Well, as we discussed in lecture, the cool thing about these networks (going back to Elman’s 1990 Simple Recurrent Network) is that they can predict the identity or category of the next item, based on the items seen so far; e.g., you know that the next word after The dog ate the … must be a noun or adjective. What’s more, you can use this approach to solve “Plato’s Problem” of deducing information about words you have never heard: when you hear The dog ate the knish, you can immediately infer that knish is a noun (and probably something edible!) So, to finish up Part 1 of the assignment, figure out how to add a new word to the vocabulary, run the training on the same sentences as before (i.e., sentences not containing that new word), and see what your test function does on a sentence containing the new word.
Part 7: Word Embeddings
As in the previous part, read the PyTorch tutorial on Word Embeddings, copy/pasting the code into a script embedding.py. And again, comment-out the annoying print statements, except for the final one reporting the embedded vector value (tensor) for one of the words. Next, replace that final print with a loop that prints each vocabulary word followed by its vector, without the distracting
tensor .. grad_fn wrapping. After “a little help from my friends” on StackOverflow, I was able to use numpy formatting to get an attractive printout like this, sorted in alphabetical order:
'This [+2.295 +0.676 +1.714 -1.794 -1.521 +0.918 -0.549 -0.347 +0.473 -0.429]
And [+0.487 -0.309 -3.014 -1.247 +1.349 +0.269 -1.128 -0.601 +1.837 -1.071]
youth's [+2.414 +1.021 -0.44 -1.734 -1.026 +0.521 -0.453 -0.126 -0.588 +2.119]
Note that because of the simple way that the text was split (via the default blankspace delimiter), we’re getting bogus punctuation included with some words (like the quotation mark in
'This) — which as Shakespeare might say, doth vex me somewise! Pythonistas have lots of tricks for getting rid of punctuation, but for the current project I think it’s simply easier to either (a) not worry about the problem, or (b) edit the sonnet fragment a bit to eliminate punctuation and upper-case letters. I chose the either (b), which gave me a final vocabulary size of 86 lower-case words.
Now that we’ve got a nice little word-embedding program with sensible output, let’s see whether we can understand the embeddings (vectors) that it’s giving us. As we saw in the lecture on the Simple Recurrent Net (slides 13-16), a clever way of doing this is to build a distance matrix, then run Hierarchical Cluster Analysis on the distance matrix to build a dendrogram (tree diagram) to visualize the semantic structure encoded in the embeddings.
Fortunately, there are now powerful tools that can do both of these steps for you automatically. This page has a schweet example. One I got this example running, I printed out the tiny dataset used for the clustering and saw that it was simple a list of 2D vectors (represented as tuples). After puzzling over how to go from that kind of 2D data to our ten-dimensional embedding vectors, I figured I’d just put the vectors into a big list and use them as the data. Sure enough, it worked! A little more googling revealed how to use my Shakespeare vocabulary as the labels for the dendrogram, and then how to rotate the dendrogram so that the labels appeared on the left rather than at the bottom, for greater readability.
After all that work, I found my dendrogram results somewhat disappointing: with so many words, it was impossible to read the whole plot clearly, and when I tried zoom in on it, I found it difficult to discern the kinds of word-class patterns that Elman got. As is often the case in science (esp. data science), your results can be sensitive to not only the algorithm you use, but also the data! In other words, if I could go back to the simpler, artificially-generated sentences used by Elman, I might see some kind of pattern in my own embedding results.
As usual, googling a bit for RANDOM SENTENCE GENERATOR PYTHON, I found a simple solution on StackOverflow. Even better, I realized that instead of using random numbers, I could simply enumerate every possible sentence of the form ADJ NOUN VERB ADVERB (e.g., adorable rabbit runs occasionally, by a quadruply-nested cascade of
for loops. This solution had the advantage of having a very small vocabulary size (20 words) and a much much larger data set (54 four-word sentences) than the original sonnet fragment. So, for the final part of the assignment, I added a little code to my embedding.py to generate these simple sentences and use them as the training set. By decreasing the context size and number of embedding dimensions, I was able to get reasonable (not perfect!) dendrogram results after 100 epochs. Try that, see what you get, and include your dendrogram picture in your writeup.
Part 8: Attention / Transformers
Here is the code to copy/paste/modify into your initial transfomer.py script. This time the PyTorch folks did a nice job formatting the training reports — including time info! Unfortunately they did not include a more detailed test case (like our confusion matrices from the previous assignments); hence, as before, we have an opportunity to explore further.
First, as before, let’s do the easy thing and see how much value we get from the GPU*, by finding the
device = ... code, commenting it out to force CPU, and then running a trial with and without CUDA. As before, if you run the code with
time /usr/bin/python3 instead of just
/usr/bin/python3, you can get a nice overall time summary at the end, to include in your writeup.
Next, let’s see what this model is actually learning! I found the tutorial description pretty minimal, so as usual I started printing things out and exiting before the training started. By printing out the size (
len) of various data variables in the code, I quickly got a confirmation that this is indeed an model of the English language. (Hint: take a look at this statistic). Then the sizes of the training and testing sets then made sense too. Make sure to note these three sizes and report them in your writeup, with a brief explanation. Also comment in your writeup on a new “one weird trick” you can see in the training report!
Now that we’ve got a good sense of what kind of data this model is using and how long it takes to train, let’s do the usual thing and break it up into training and testing scripts. Unfortunately the code saves the pickled model best_model_params.pt in some weird temporary directory that I couldn’t locate, so the first modification I made was to force it to save that file in the current directory, with a helpful message about saving the file, as we did in the previous assignments. Once you’ve got the model saved, comment briefly in your writeup on the number of apparent parameters (floating-point weights) it appears to contain, assuming the standard four-byte floating-point encoding.
So now it’s time to split up our code the usual way: transformer.py (class definition), transformer_train.py (train model and save it), and trainsformer_test.py (load model and test it). Because of the way that the original transformer script mixes up global variables and function parameters, this step can take a while to get right, but at the end you’ll have a standalone test script that you can use to try out your Pre-Trained transformer: the PT part of ChatGPT!
As it stands, the
evaluate()function used by the training and testing scripts doesn’t report anything interesting; it just returns the loss value. So to get a better idea of what the network is actually doing, I copy/pasted the
evaluate()function into my test script to create a new function
report(), which I then modified to report the actual input and target words. As mentioned in the (confusing to me!) tutorial instructions, the job of the
get_batch() function used in
evaluate()is to make a target sequence out of the input sequence by shifting the input sequence by one position — the same trick as in Elman’s original 1990 sequence-learning model.
To verify this claim for the actual vocabulary in the data, it took a little bit of experimental printing, to see that the data (input) and targets were both of size 35×10, but that the targets had been reshaped to 1×350. Once I figured that out, I was able to reverse-engineer the vocabulary object to extract the words corresponding to each word index, and then write some code to report the data words followed by the target words. Hint: as usual,
type() will tell you the type of a variable, after which you can look up its methods in the online documentation. For the first iteration of the
report() loop I got this output (abbreviated here for simplicity), showing that the inputs and targets had the expected relationship:
= next either imagery and her = . was hitting
robert day blunt and n death boston seneca proscribed the
<unk> it ( clear @-@ as celtics asked . slow ...
robert day blunt and n death boston seneca proscribed the
<unk> it ( clear @-@ as celtics asked . slow
= joined <unk> , <unk> it = <unk> these @-@ ...
Looking at this data, I still couldn’t make any sense of the individual lines: WTF is
= next either imagery and her = was hitting … supposed to mean?! As a final effort at understanding this complicated model, I managed to find the URL for the WikiText-2 dataset zipfile, hidden in the PyTorch source code. Downloading and unzipping this dataset and looking through the testing part, I solved the final mystery! In your writeup, briefly comment on what you find when you do this; i.e., how is the code representing the actual text? Hint: Add a line
print(targets) inside your training loop. Can you explain what the targets mean?
What to submit to Github
As usual, your PDF writeup will be the main part, plus your Python scripts to preserve your work.