CSCI 316 Problem Set #3

Assignment #6

Due on Github 23:59 Wednesday 20 September

The sweep of the pendulum had increased in extent by nearly a yard. As a natural consequence, its velocity was also much greater. 

– E.A. Poe, The Pit and the Pendulum

Do you want ants? Because that’s how you get ants! – Archer

Objectives

    1. Expand our Deep Reinforcement Learning toolkit to enable continuous control for robotics.
    2. Improve our understanding of Actor/Critic methods
    3. Get some experience using the new IsaacGym DRL / simulator from NVIDIA.

Running on a workstation in the Advanced Lab (Parmly 413)

Because of the computational power needed for DRL, you will want to do this assignment on a  computer with a GPU.  Here’s what you need to do to get started:

git clone https://github.com/simondlevy/AC-Gym

cd AC-Gym

/usr/bin/python3 a2c-learn.py --maxtime 300

The program may give you a little warning about doing something sub-optimal, and then it will begin iterating over episodes using the  A2C learning algorithm on the Inverted Pendulum problem.  The reward will start as a pretty big negative number (around -2000), and  move closer to zero.  (Remember, reward is the opposite of cost, so getting a smaller negative reward means we’re minimizing the cost as expected with gradient descent.)  Setting the maxtime option to 300 (seconds) will cause the program to stop after five minutes.

Plotting the results

Once you get a best-reward report, there will also be two new directories, models and runs, with the runs directory containing a .CSV file that you can double-click to open in LibreOffice.  (If you prefer, you can just email it to yourself and open it in your favorite spreadsheet program on your laptop, like Excel or Numbers.)   This plot should be the first figure in your writeup.  I found that A2C did pretty poorly on this task, hovering around -1200 for the reward.   For the optimal comparison, avoid the ac-plot program; instead do a nice plot using a spreadsheet, so you can put both runs in the same figure for comparison.

Using a better algorithm

Fortunately, the repository you cloned provides a variety of Actor/Critic algorithms.  In my experience, the best one (on this pendulum problem, at least) is TD-3.  So repeat your five-minute experiment above, this time with td3-learn.py, and plot the results as before.

Playing back your results

Given how much better TD-3 does on the pendulum task (versus A2C), it’s probably only worth running a test on network you built using TD-3.  To do that, you can use the .dat file created in your models directory; for example:

/usr/bin/python3 ac-test.py models/td3-Pendulum-v1-00156.851.dat

showed me a little movie of the pendulum swinging to its upright position.

Ants!

To finish up our exploration of cutting-edge DRL technology, and get ready for our ant robot projects, let’s use the Isaac Gym software to train some robotic ants.  To get started:

git clone https://github.com/NVIDIA-Omniverse/IsaacGymEnvs

cd IsaacGymEnvs/isaacgymenvs

/usr/bin/python3 train.py

That last command will train your ants in real time with rendering, which as we know can slow things down dramatically.   Fortunately Isaac Gym also allows “headless” (no-render) training, as described in the instructions for the repository you just cloned.  Use headless training to train your ants this time; then follow the directions for running the train.py program again to show your training results.  Because headless training is supposed to stop automatically with decent performance, none of the ants should stumble or tip over now.

Comparing apples to apples

Being able to train up a swarm of ants in under a minute is pretty cool, but just how much of an improvement are we getting with Isaac Gym over pytorch by itself (i.e., the td3-learn program from earlier)?  To address this question scientifically, we’d like to compare performance on the same environment.  Looking over the list of environments (tasks) in Isaac Gym showed that although Isaac Gym currently has far fewer built-in environments than the ones in the table from OpenAI Gym , both platforms appear to have the familiar Cartpole.  Looking at the code for Isaac Gym’s Cartpole implementation, however, I began to doubt that it was the same as either of the Cartpole environments (v0 or v1) from OpenAI.  Can you spot the difference(s)?  As usual, try to start with the obvious.

What to turn in to github

Since we didn’t write any code for this assignment, all I expect is:

  • A little (one-page) PDF write-up sketching your results. Be sure to include:
    • the reward plots from your first and second experiments, making sure to use the same Y axis limits for both plots (ideally, you can just plot them in a single plot with a legend).
    • a brief discussion of the Cartpole code from Isaac Gym that reveals that (1) it’s  not using gym as we did; and (2) details of how its reward and action space differ from the Cartpole-v1 that we used.
  • Your t3d.. .dat file from the pendulum exercise
  • Your Ant.pth file