lspi.lspi module¶
Contains main interface to LSPI algorithm.
-
lspi.lspi.
learn
(data, initial_policy, solver, epsilon=1e-05, max_iterations=10)[source]¶ Find the optimal policy for the specified data.
Parameters: - data – Generally a list of samples, however, the type of data does not matter so long as the specified solver can handle it in its solve routine. For example when doing model based learning one might pass in a model instead of sample data
- initial_policy (Policy) – Starting policy. A copy of this policy will be made at the start of the method. This means that the provided initial policy will be preserved.
- solver (Solver) – A subclass of the Solver abstract base class. This class must implement the solve method. Examples of solvers might be steepest descent or any other linear system of equation matrix solver. This is basically going to be implementations of the LSTDQ algorithm.
- epsilon (float) – The threshold of the change in policy weights. Determines if the policy has converged. When the L2-norm of the change in weights is less than this value the policy is considered converged
- max_iterations (int) – The maximum number of iterations to run before giving up on convergence. The change in policy weights are not guaranteed to ever go below epsilon. To prevent an infinite loop this parameter must be specified.
Returns: The converged policy. If the policy does not converge by max_iterations then this will be the last iteration’s policy.
Return type: Raises: ValueError
– If epsilon is <= 0ValueError
– If max_iteration <= 0