315-ps6 – Simon D. Levy

CSCI 315 Assignment #6

Goals

1. Understand state-of-the art recurrent networks by building a Long Short-Term Memory (LSTM) network for part-of-speech (POS) tagging.
2. Learn how neural nets represent word meanings by building a network for dense word embeddings.
3. Get a feel for the latest development in Deep Learning — the attention/transformer networks behind ChatGPT — by deploying and modifying PyTorch example for natural-language translation.
4. Become a more efficient user of cloud compute resource: specifically, for credit on this assignment, you must report the total number of hours for which you kept your instance active.

Part 1: LSTM

Read through this tutorial and copy/paste the code into a Python script lstm.py. Because of the small problem size, it should be possible to run the whole script (including training) in a few seconds on your laptop.

As usual, we’re going to learn more about this model by making some improvements.

First, based on the tutorial’s description of the final output (tag scores), add some code to report the output to the more human-readable form described in the large comment (DET NOUN VERB DET NOUN). A nice output would report each word along with the POS label learned by the network, followed by the correct POS label (just as we did all the way back with XOR learning). For example:

The: DET (DET)
dog: NN ( NN)
ate: V ( V)
the: DET (DET)
apple: NN ( NN)

Once you’ve got that working, factor the code into a test function test(words, targets),where targets are the desired POS tags, that will run this test for either of the two training examples (The dog ate the apple or Everybody read that book.) At this point you can probably comment-out the print() statements on the code before the training/testing part, to avoid distraction. Then add the usual code inside your training loop to report the number of epochs at a reasonable interval.

Now that we’ve got a nice little training/testing script set up, let’s add some more parts of speech (POS) to our problem. Adjectives seem like a natural place to start: How about: The big dog ate the red apple and Everybody read that awesome book ? Feel free to come up with your own vocabulary (bonus points for making me laugh!)

Of course, we haven’t really done an honest training/testing evaluation of our model, because we’re training and testing it on the same data set. To see how the model works on data it hasn’t seen, try passing a couple of new sentences to your test function, by making new sentences from the existing vocabulary.

Applying some critical thinking to our results so far, you can probably see that they aren’t all that impressive: all we’ve got it is a model that classifies data (words) that it’s already seen! Indeed, you could probably do this with a simple logistic-regression or classic perceptron model using a single layer of weights and no recurrent (feedback) connections. So what’s the big deal about LSTM and other recurrent networks?

Well, as we discussed in lecture, the cool thing about these networks (going back to Elman’s 1990 Simple Recurrent Network) is that they can predict the identity or category of the next item, based on the items seen so far; e.g., you know that the next word after The dog ate the … must be a noun or adjective. What’s more, you can use this approach to solve “Plato’s Problem” of deducing information about words you have never heard: when you hear The dog ate the knish, you can immediately infer that knish is a noun (and probably something edible!) So, to finish up Part 1 of the assignment, figure out how to add a new word to the vocabulary, run the training on the same sentences as before (i.e., sentences not containing that new word), and see what your test function does on a sentence containing the new word.

Part 2: Word Embeddings

As in the previous part, read the PyTorch tutorial on Word Embeddings, copy/pasting the code into a script embedding.py. And again, comment-out the annoying print statements, except for the final one reporting the embedded vector value (tensor) for one of the words. Next, replace that final print with a loop that prints each vocabulary word followed by its vector, without the distracting tensor .. grad_fn wrapping. After “a little help from my friends” on StackOverflow, I was able to use numpy formatting to get an attractive printout like this, sorted in alphabetical order:

'This [+2.295 +0.676 +1.714 -1.794 -1.521 +0.918 -0.549 -0.347 +0.473 -0.429]
And [+0.487 -0.309 -3.014 -1.247 +1.349 +0.269 -1.128 -0.601 +1.837 -1.071]
…
youth's [+2.414 +1.021 -0.44 -1.734 -1.026 +0.521 -0.453 -0.126 -0.588 +2.119]

Note that because of the simple way that the text was split (via the default blankspace delimiter), we’re getting bogus punctuation included with some words (like the quotation mark in 'This) — which as Shakespeare might say, doth vex me somewise! Pythonistas have lots of tricks for getting rid of punctuation, but for the current project I think it’s simply easier to either (a) not worry about the problem, or (b) edit the sonnet fragment a bit to eliminate punctuation and upper-case letters. I chose the either (b), which gave me a final vocabulary size of 86 lower-case words.

Now that we’ve got a nice little word-embedding program with sensible output, let’s see whether we can understand the embeddings (vectors) that it’s giving us. As we saw in the lecture on the Simple Recurrent Net (slides 13-16), a clever way of doing this is to build a distance matrix, then run Hierarchical Cluster Analysis on the distance matrix to build a dendrogram (tree diagram) to visualize the semantic structure encoded in the embeddings.

Fortunately, there are now powerful tools that can do both of these steps for you automatically. This page has a schweet example. One I got this example running, I printed out the tiny dataset used fyou must report the total number of hours for which you kept your instance activeor the clustering and saw that it was simple a list of 2D vectors (represented as tuples). After puzzling over how to go from that kind of 2D data to our ten-dimensional embedding vectors, I figured I’d just put the vectors into a big list and use them as the data. Sure enough, it worked! A little more googling revealed how to use my Shakespeare vocabulary as the labels for the dendrogram, and then how to rotate the dendrogram so that the labels appeared on the left rather than at the bottom, for greater readability.

After all that work, I found my dendrogram results somewhat disappointing: with so many words, it was impossible to read the whole plot clearly, and when I tried zoom in on it, I found it difficult to discern the kinds of word-class patterns that Elman got. As is often the case in science (esp. data science), your results can be sensitive to not only the algorithm you use, but also the data! In other words, if I could go back to the simpler, artificially-generated sentences used by Elman, I might see some kind of pattern in my own embedding results.

As usual, googling a bit for RANDOM SENTENCE GENERATOR PYTHON, I found a simple solution on StackOverflow. Even better, I realized that instead of using random numbers, I could simply enumerate every possible sentence of the form ADJ NOUN VERB ADVERB (e.g., adorable rabbit runs occasionally, by a quadruply-nested cascade of for loops. This solution had the advantage of having a very small vocabulary size (20 words) and a much much larger data set (5⁴ four-word sentences) than the original sonnet fragment. So, for the final part of the assignment, I added a little code to my embedding.py to generate these simple sentences and use them as the training set. By decreasing the context size and number of embedding dimensions, I was able to get reasonable (not perfect!) dendrogram results after 100 epochs. Try that, see what you get, and include your dendrogram picture in your writeup.

Part 3: Attention

This tutorial has a completely-worked example of English/French translation using a sequence-to-sequence network and an an attention network (the latter being the basis of ChatGPT). To get started, download and unzip hw6_data.zip, and then begin your usual cop/paste of the code in the tutorial, into an attention.py script. I found that by carefully copy/pasting each block of code and immediately trying out the resulting script, I was able to reach the final stage (Visualizing Attention) in about a half an hour. I puzzled a while about why I wasn’t seeing the plots at the end. Looking through the code, I eventually figured out how to fix this and was able to see the plots.

Given the complexity of this code, there is probably very little we can do to modify it significantly. The Exercises section at the end suggests trying out a new language pair, which shouldn’t be difficult if you look for French in the tutorial and find where the tutorial writer got the eng-fra dataset. So, for Part 3 of your writeup, you can simply (a) compare GPU and CPU training time for the 80 epochs; (b) show the two kinds of plots (training loss over epochs; visualizing attention) for another language pair. And remember: for credit on this assignment, you must report the number of hours you kept your Jetstream instance active to complete your work.