Project 7: Q-Learning Robot Documentation

QLearner.py

class QLearner.QLearner(num_states=100, num_actions=4, alpha=0.2, gamma=0.9, rar=0.5, radr=0.99, dyna=0, verbose=False)

This is a Q learner object.

Parameters

num_states (int) – The number of states to consider.
num_actions (int) – The number of actions available..
alpha (float) – The learning rate used in the update rule. Should range between 0.0 and 1.0 with 0.2 as a typical value.
gamma (float) – The discount rate used in the update rule. Should range between 0.0 and 1.0 with 0.9 as a typical value.
rar (float) – Random action rate: the probability of selecting a random action at each step. Should range between 0.0 (no random actions) to 1.0 (always random action) with 0.5 as a typical value.
radr (float) – Random action decay rate, after each update, rar = rar * radr. Ranges between 0.0 (immediate decay to 0) and 1.0 (no decay). Typically 0.99.
dyna (int) – The number of dyna updates for each regular update. When Dyna is used, 200 is a typical value.
verbose (bool) – If “verbose” is True, your code can print out information for debugging.

query(s_prime, r)

Update the Q table and return an action

Parameters

Returns

The selected action

Return type

int

querysetstate(s)

Update the state without updating the Q-table

discretize(pos)

convert the location to a single integer

getgoalpos(data)

find where the goal is in the map

getrobotpos(data)

Finds where the robot is in the map

movebot(data, oldpos, a)

move the robot and report reward

Parameters

Returns

the new position of the robot and the reward

Return type

tuple(int, int), int

printmap(data)

Prints out the map

test(map, epochs, learner, verbose)

function to test the code

Parameters

map (array) – 2D array that stores the map
epochs (int) – each epoch involves one trip to the goal
learner (QLearner) – the qlearner object
verbose (bool) – If “verbose” is True, your code can print out information for debugging.
If verbose = False your code should not generate ANY output. When we test your code, verbose will be False.

Returns

the total reward

Return type

np.float64