lspi.policy module

LSPI Policy class used for learning and executing policy.

class lspi.policy.Policy(basis, discount=1.0, explore=0.0, weights=None, tie_breaking_strategy=2)[source]

Bases: object

Represents LSPI policy. Used for sampling, learning, and executing.

The policy class includes an exploration value which controls the probability of performing a random action instead of the best action according to the policy. This can be useful during sample.

It also includes the discount factor \(\gamma\), number of possible actions and the basis function used for this policy.

Parameters:
  • basis (BasisFunction) – The basis function used to compute \(phi\) which is used to select the best action according to the policy
  • discount (float, optional) – The discount factor \(\gamma\). Defaults to 1.0 which is valid for finite horizon problems.
  • explore (float, optional) – Probability of executing a random action instead of the best action according to the policy. Defaults to 0 which is no exploration.
  • weights (numpy.array or None) – The weight vector which is dotted with the \(\phi\) vector from basis to produce the approximate Q value. When None is passed in the weight vector is initialized with random weights.
  • tie_breaking_strategy (Policy.TieBreakingStrategy value) – The strategy to use if a tie occurs when selecting the best action. See the lspi.policy.Policy.TieBreakingStrategy class description for what the different options are.
Raises:
  • ValueError – If discount is < 0 or > 1
  • ValueError – If explore is < 0 or > 1
  • ValueError – If weights are not None and the number of dimensions does not match the size of the basis function.
class TieBreakingStrategy[source]

Bases: object

Strategy for breaking a tie between actions in the policy.

FirstWins:
In the event of a tie the first action encountered with that value is returned.
LastWins:
In the event of a tie the last action encountered with that value is returned.
RandomWins
In the event of a tie a random action encountered with that value is returned.
FirstWins = 0
LastWins = 1
RandomWins = 2
Policy.__copy__()[source]

Return a copy of this class with a deep copy of the weights.

Policy.best_action(state)[source]

Select the best action according to the policy.

This calculates argmax_a Q(state, a). In otherwords it returns the action that maximizes the Q value for this state.

Parameters:
  • state (numpy.array) – State vector.
  • tie_breaking_strategy (TieBreakingStrategy value) – In the event of a tie specifies which action the policy should return. (Defaults to random)
Returns:

Action index

Return type:

int

Raises:

ValueError – If state’s dimensions do not match basis functions expectations.

Policy.calc_q_value(state, action)[source]

Calculate the Q function for the given state action pair.

Parameters:
  • state (numpy.array) – State vector that Q value is being calculated for. This is the s in Q(s, a)
  • action (int) – Action index that Q value is being calculated for. This is the a in Q(s, a)
Returns:

The Q value for the state action pair

Return type:

float

Raises:
  • ValueError – If state’s dimensions do not conform to basis function expectations
  • ValueError – If action is outside of the range of valid action indexes
Policy.num_actions

Return number of possible actions.

This number should always match the value stored in basis.num_actions.

Returns:Number of possible actions. In range [1, \(\infty\))
Return type:int
Policy.select_action(state)[source]

With random probability select best action or random action.

If the random number is below the explore value then pick a random value otherwise pick the best action according to the basis and policy weights.

Parameters:state (numpy.array) – State vector
Returns:Action index
Return type:int
Raises:ValueError – If state’s dimensions do not match basis functions expectations.