CSCI 316 Problem Set #2

Assignment #5

Due on Github 23:59 Wednesday 13 September

Balance is everything, riding out time …  – John Gardner  (1933 – 1982), Grendel


    1. Understand Deep Q-Learning by coding it up in Python.
    2. Dive deeper into Gymnasium, with a more challenging, dynamic environment.


As in our simple Q-Learning assignment, we’ll be modifying an existing code repository to make it more general, which will help us understand how the algorithm works. As before, we’re going to end up with a Python class in one file and a unit-test in another. Also, as before, we’ll be using Gymnasium, but now we’ll get a chance to step up our game into Deep Reinforcement learning, by revisiting PyTorch.

Getting started

This tutorial has all the code needed to train on Cartpole-v1 with Deep Q-Learning.   As usual, once we’ve got the script running, we’re going to make it more general and reusable for future projects.   For this stage I just saved the script as

Making a reusable DQL class

At this point your should be able to separate your code into two files: containing the DQL class with a train method, and, containing the code that uses this class to learn Cartpole.  As in our Taxi assignment, your should not contain any specific references to the Cartpole environment.  It would also make sense to turn the hyper-parameters (BATCH_SIZE, GAMMA, …) into default parameters passed to your train method.  I also found it simpler to remove the device-related code and always run on the CPU, since this network is small enough to train for 600 episodes in a reasonable amount of time.  Finally, in your training loop, make sure to report the current episode number periodically (say, every 10 episodes), along with the summed reward for that episode, to track your progress; for example:

Episode 000 / 600: total reward = 20.000000
Episode 010 / 600: total reward = 35.000000
Episode 020 / 600: total reward = 9.000000
Episode 030 / 600: total reward = 13.000000
Episode 040 / 600: total reward = 14.000000
Episode 050 / 600: total reward = 12.000000
Episode 060 / 600: total reward = 10.000000
Episode 070 / 600: total reward = 17.000000
Episode 080 / 600: total reward = 15.000000
Episode 090 / 600: total reward = 21.000000
Episode 100 / 600: total reward = 23.000000
Episode 110 / 600: total reward = 52.000000
Episode 120 / 600: total reward = 65.000000
Episode 130 / 600: total reward = 127.000000
Episode 140 / 600: total reward = 151.000000
Episode 150 / 600: total reward = 227.000000

As in the previous assignment, your train method should return an array of total rewards per episode, so your main program can plot it after training.  Here, for example, is a successful training session that I was lucky to get after several attempts.  Note the catastrophic forgetting that occurs once the network has achieved the highest possible reward:

Adding a testing (play) feature

Once you’ve got the training script running,  add a play method to your DQL class to play the game (with rendering).  HintI: To get the action from the network, I first had to convert the state to a PyTorch tensor; then, I had to convert the resulting action tensor back into a numpy vector in order to call argmax on it.

What to submit to github

  • Scripts for any additional environments you can solve for extra credit!