Updates / FAQs

  • 2017-04-21 First draft


In this optional project you will implement a solution the Reinforcement Learning problem with continuous multivariate state. In an earlier project, you solved this problem using Q-Learning with one dimensional discrete states. We will test your implementation in the context of a robot navigation problem. The form of the navigation problems are exactly the same as those presented in the earlier project except that instead of your learner being presented with a single dimensional discrete state you will receive a multidimensional continuous state.

We will not review all the details of the problem here because they are available in the earlier project description.

We will provide that automates testing of your learner in the navigation problem. Overall, your tasks for this project include:

  • Code a continuous reinforcement learner CRLearner
  • Test/debug the learner in navigation problems

For this assignment we will test only your code (there is no report component).

Template and Data

The template has not yet been added to GitHub, but you can get a zip file version of it here:

  • Update your local mc3_p5 directory using github.
  • Implement the CRLearner class in mc3_p5/
  • To test your CRLearner, run python from the mc3_p5/ directory.
  • Note that example navigation problems are provided in the mc3_p5/testworlds directory.

The worlds beginning with “vr” as in “vr_01_005.csv” are intended for use as test cases for this project.

Part 1: Implement CRLearner

Your CRLearner class should be implemented in the file It should implement EXACTLY the API defined below. DO NOT import any modules besides those allowed below. Your class should implement the following methods:

Details on the input arguments to the constructor:

  • num_dimensions: integer, the number of continuous dimensions in the state
  • num_actions: integer, the number of actions available.
  • verbose: binary, True if printing stuff is allowed.

query(s_prime, r) is the core method of the CRLearner. It should keep track of the last state s and the last action a, then use the new information s_prime and r to update its internal model or policy. The learning instance, or experience tuple is <s, a, s_prime, r>. query() should return an integer, which is the next action to take. Details on the arguments:

  • s_prime: a one-dimensional ndarray containing num_dimensions elements. Each element corresponds to one dimension of the state.
  • r: float, a real valued immediate reward.

querysetstate(s) A special version of the query method that sets the state to s, and returns an integer action according to the same rules as query()

Here’s an example of the API in use:

import CRLearner as cr
import numpy as np

learner = cr.CRLearner(num_dimensions = 2, 
    num_actions = 4, verbose = False)

s = np.asarray((0.4, 0.45)) # our initial state

a = learner.querysetstate(s) # action for state s

s_prime = np.asarray((0.42, 0.45)) # the new state we end up in after taking action a in state s

r = 0.0 # reward for taking action a in state s

next_action = learner.query(s_prime, r)

Part 2: Navigation Problem Test Cases

We will test your CRLearner with a navigation problem as follows. Note that your CRLearner does not need to be coded specially for this task. In fact the code doesn’t need to know anything about it. The code necessary to test your learner with this navigation task is implemented in for you.

The navigation task takes place in a square grid world that measures 1.0 units by 1.0 units. The location of the robot is the “state” and it will be provided to you as a 1 by 2 ndarray where the first element represents the X location and the second element represents the Y location. The particular environment is expressed in a CSV file of integers, where the value in each position is interpreted as follows:

  • 0: blank space.
  • 1: an obstacle.
  • 2: the starting location for the robot.
  • 3: the goal location.
  • 5: quicksand.

An example navigation problem (world01.csv) is shown below. Following python conventions, [0.0, 0.0] is upper left, or northwest corner, and [1.0, 1.0] is the lower right or southeast corner. Rows are north/south, columns are east/west.


In this example the robot will be started at the bottom center, and must navigate to the top left. Note that a wall of obstacles blocks its path, and there is some quicksand along the left side. The objective is for the robot to learn how to navigate from the starting location to the goal with the highest total reward. We define the reward for each step as:

  • -1 if the robot moves to an empty or blank space, or attempts to move into a wall
  • -100 if the robot moves to a quicksand space
  • 1 if the robot moves to the goal space

Overall, we will assess the performance of a policy as the average reward it incurs to travel from the start to the goal (higher reward is better). We assess a learner in terms of the reward it converges to over a given number of training iterations (trips from start to goal).

Important note: the problem includes random actions and sensor noise. So, for example, if your learner responds with a “move north” action, there is some probability that the robot will actually move in a different direction. For this reason, the “wise” learner develops policies that keep the robot well away from quicksand. We map this problem to a reinforcement learning problem as follows:

  • State: The state is the location of the robot, expressed as a 2 element vector.
  • Actions: There are 4 possible actions, 0: move north, 1: move east, 2: move south, 3: move west.
  • R: The reward is as described above.
  • T: The transition matrix can be inferred from the CSV map and the actions.

Note that R and T are not known by or available to the learner. The testing code will test your code as follows (pseudo code):

Instantiate the learner with the constructor QLearner()
s = initial_location
a = querysetstate(s)
s_prime = new location according to action a
r = -1.0
while not converged:
    a = query(s_prime, r) 
    s_prime = new location according to action a
    if s_prime == goal:
        r = +1
        s_prime = start location
    else if s_prime == quicksand:
        r = -100
        r = -1

A few things to note about this code: The learner always receives a reward of -1.0 (or -100.0) until it reaches the goal, when it receives a reward of +1.0. As soon as the robot reaches the goal, it is immediately returned to the starting location.

Part 3: Implement author() Method (0%)

You should implement a method called author() that returns your Georgia Tech user ID as a string. This is the ID you use to log into t-square. It is not your 9 digit student number. Here is an example of how you might implement author() within a learner object:

class CRLearner(object):
    def author(self):
        return 'tb34' # replace tb34 with your Georgia Tech username.

And here’s an example of how it could be called from a testing program:

# create a learner and train it
learner = cr.CRLearner() # create a QLearner

Check the template code for examples. We are adding those to the repo now, but it might not be there if you check right away. Implementing this method correctly does not provide any points, but there will be a penalty for not implementing it.

Contents of Report

There is no report component of this assignment. However, if you would like to impress us with your Machine Learning prowess, you are invited to submit a succinct report.

Hints & resources

The main difference between this problem and the earlier one is that you must deal with continuous state. Deep Q-Learning is one approach to this problem. You are welcome also to consider other solutions if you like. Here are some links to Deep Q-Learning approaches:

What to turn in

Turn your project in via t-square. All the code necessary to run your learner must be submitted. We will call only your methods in CRLearner following the specification described above. You are allowed to access/use library code, but it must be submitted and run as .py files. If you do use code that was not written by you, you must include comments providing proper credit and citations.

  • Your CRLearner as
  • Other python files as necessary to support your learner.


Only your CRLearner class will be tested.

  • The code for the learner must reflect an effort to create a continuous state learner (not a repackaged discrete state learner like Q-Learning).
  • We will create a number of groups of test cases, where each group reflects essentially the same navigation problem but with progressively higher resolution. e.g., multiple square worlds of different sizes 5×5 world, 10×10 world, 100×100, 1000×1000, etc. Your learner will not know the dimensions of the world it is in.
  • We will test your learner against N (value of N to be determined later) test worlds with 500 iterations in each world. One “iteration” means your robot reaches the goal one time, or the simulation times out. Your CRLearner retains its state, and then we allow it to navigate to the goal again, over and over, 500 times.
  • Benchmark: We do not have a reference solution for this problem. We will instead use the best student’s submission as the benchmark. We will select a number of test cases that the benchmark can solve, then use those as the cases we test other submissions against. We will take the median reward of the benchmark across all of those 500 iterations.
  • Your score: For each world we will take the median cost your solution finds across all 500 iterations.
  • For a test to be successful, your learner should find a total reward >= 1.5 x the benchmark.
  • There will be 10 test cases, each test case is worth 9.0 points.
  • Is the author() method correctly implemented (-100% if not)

Required, Allowed & Prohibited


  • Your project must be coded in Python 2.7.x.
  • Your code must run on one of the university-provided computers (e.g.
  • All code required to run the learner must be submitted. We will not debug your code.
  • All code in must be written by you.


  • You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
  • Your code may use standard Python libraries.
  • You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
  • You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
  • Code provided by the instructor, or allowed by the instructor to be shared.
  • You may reuse code from the internet that you include as support files (it must be credited and cited).


  • Any libraries not listed in the “allowed” section above.