lspi.domains module¶
Contains example domains that LSPI works on.
-
class
lspi.domains.
ChainDomain
(num_states=10, reward_location=0, failure_probability=0.1)[source]¶ Bases:
lspi.domains.Domain
Chain domain from LSPI paper.
Very simple MDP. Used to test LSPI methods and demonstrate the interface. The state space is a series of discrete nodes in a chain. There are two actions: Left and Right. These actions fail with a configurable probability. When the action fails to performs the opposite action. In otherwords if left is the action applied, but it fails, then the agent will actually move right (assuming it is not in the right most state).
The default reward for any action in a state is 0. There are 2 special states that will give a +1 reward for entering. The two special states can be configured to appear at the end of the chain, in the middle, or in the middle of each half of the state space.
Parameters: - num_states (int) – Number of states in the chain. Must be at least 4. Defaults to 10 states.
- reward_location (ChainDomain.RewardLoction) – Location of the states with +1 rewards
- failure_probability (float) – The probability that the applied action will fail. Must be in range [0, 1]
-
class
RewardLocation
[source]¶ Bases:
object
Location of states giving +1 reward in the chain.
- Ends:
- Rewards will be given at the ends of the chain.
- Middle:
- Rewards will be given at the middle two states of the chain.
- HalfMiddles:
- Rewards will be given at the middle two states of each half of the chain.
-
Ends
= 0¶
-
HalfMiddles
= 2¶
-
Middle
= 1¶
-
ChainDomain.
action_name
(action)[source]¶ Return string representation of actions.
- 0:
- left
- 1:
- right
Returns: String representation of action. Return type: str
-
ChainDomain.
apply_action
(action)[source]¶ Apply the action to the chain.
If left is applied then the occupied state index will decrease by 1. Unless the agent is already at 0, in which case the state will not change.
If right is applied then the occupied state index will increase by 1. Unless the agent is already at num_states-1, in which case the state will not change.
The reward function is determined by the reward location specified when constructing the domain.
If failure_probability is > 0 then there is the chance for the left and right actions to fail. If the left action fails then the agent will move right. Similarly if the right action fails then the agent will move left.
Parameters: action (int) – Action index. Must be in range [0, num_actions()) Returns: The sample for the applied action. Return type: sample.Sample Raises: ValueError
– If the action index is outside of the range [0, num_actions())
-
ChainDomain.
current_state
()[source]¶ Return the current state of the domain.
Returns: The current state as a 1D numpy vector of type int. Return type: numpy.array
-
ChainDomain.
num_actions
()[source]¶ Return number of actions.
Chain domain has 2 actions.
Returns: Number of actions Return type: int
-
ChainDomain.
reset
(initial_state=None)[source]¶ Reset the domain to initial state or specified state.
If the state is unspecified then it will generate a random state, just like when constructing from scratch.
State must be the same size as the original state. State values can be either 0 or 1. There must be one and only one location that contains a value of 1. Whatever the numpy array type used, it will be converted to an integer numpy array.
Parameters: initial_state (numpy.array) – The state to set the simulator to. If None then set to a random state.
Raises: ValueError
– If initial state’s shape does not match (num_states, ). In otherwords the initial state must be a 1D numpy array with the same length as the existing state.ValueError
– If part of the state has a value or 1, or there are multiple parts of the state with value of 1.ValueError
– If there are values in the state other than 0 or 1.
-
class
lspi.domains.
Domain
[source]¶ Bases:
object
ABC for domains.
Minimum interface for a reinforcement learning domain.
-
action_name
(action)[source]¶ Return a string representation of the action.
Parameters: action (int) – The action index to apply. This number should be in the range [0, num_actions()) Returns: String representation of the action index. Return type: str
-
apply_action
(action)[source]¶ Apply action and return a sample.
Parameters: action (int) – The action index to apply. This should be a number in the range [0, num_actions()) Returns: Sample containing the previous state, the action applied, the received reward and the resulting state. Return type: sample.Sample
-
current_state
()[source]¶ Return the current state of the domain.
Returns: The current state of the environment expressed as a numpy array of the individual state variables. Return type: numpy.array
-
num_actions
()[source]¶ Return number of possible actions for the given domain.
Actions are indexed from 0 to num_actions - 1.
Returns: Number of possible actions. Return type: int
-
reset
(initial_state=None)[source]¶ Reset the simulator to initial conditions.
Parameters: initial_state (numpy.array) – Optionally specify the state to reset to. If None then the domain should use its default initial set of states. The type will generally be a numpy.array, but a subclass may accept other types.
-
-
lspi.domains.
random
() → x in the interval [0, 1).¶