This class implements the basic DQN methodology, i.e. More...
Public Member Functions | |
def | __init__ (self, pytorch.nn.Module target_model, pytorch.nn.Module policy_model, pytorch.optim.Optimizer optimizer, Union[LRScheduler, None] lr_scheduler, LossFunction loss_function, float gamma, float epsilon, float min_epsilon, float epsilon_decay_rate, int epsilon_decay_frequency, int memory_buffer_size, int target_model_update_rate, int policy_model_update_rate, int backup_frequency, float lr_threshold, int batch_size, int num_actions, str save_path, int bootstrap_rounds=1, str device="cpu", Optional[Dict[str, Any]] prioritization_params=None, float force_terminal_state_selection_prob=0.0, float tau=1.0, Union[int, str] apply_norm=-1, Union[int, List[str]] apply_norm_to=-1, float eps_for_norm=5e-12, int p_for_norm=2, int dim_for_norm=0, Optional[float] max_grad_norm=None, float grad_norm_p=2.0) |
None | load (self, Optional[str] custom_name_suffix=None) |
This method loads the target_model, policy_model, optimizer, lr_scheduler and agent_states from the supplied save_path argument in the DQN Agent class' constructor (also called init). More... | |
int | policy (self, Union[ndarray, pytorch.Tensor, List[float]] state_current) |
The policy for the agent. More... | |
None | save (self, Optional[str] custom_name_suffix=None) |
This method saves the target_model, policy_model, optimizer, lr_scheduler and agent_states in the supplied save_path argument in the DQN Agent class' constructor (also called init). More... | |
int | train (self, Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]] state_current, Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]] state_next, Union[int, float] reward, Union[int, float] action, Union[bool, int] done, Optional[Union[pytorch.Tensor, np.ndarray, float]] priority=1.0, Optional[Union[pytorch.Tensor, np.ndarray, float]] probability=1.0, Optional[Union[pytorch.Tensor, np.ndarray, float]] weight=1.0) |
![]() | |
Dict[str, Any] | __getstate__ (self) |
To get the agent's current state (dict of attributes). More... | |
def | __init__ (self) |
The class initializer. More... | |
None | __setstate__ (self, Dict[str, Any] state) |
To load the agent's current state (dict of attributes). More... | |
None | load (self, *args, **kwargs) |
Load method for the agent. More... | |
Any | policy (self, *args, **kwargs) |
Policy method for the agent. More... | |
None | save (self, *args, **kwargs) |
Save method for the agent. More... | |
Any | train (self, *args, **kwargs) |
Training method for the agent. More... | |
Data Fields | |
apply_norm | |
The input apply_norm argument; indicating the normalisation to be used. More... | |
apply_norm_to | |
The input apply_norm_to argument; indicating the quantity to normalise. More... | |
backup_frequency | |
The input model backup frequency in terms of timesteps. More... | |
batch_size | |
The batch size to be used when training policy model. More... | |
bootstrap_rounds | |
The input boostrap rounds. More... | |
device | |
The input device argument; indicating the device name. More... | |
dim_for_norm | |
The input dim_for_norm argument; indicating dimension along which we wish to normalise. More... | |
eps_for_norm | |
The input eps_for_norm argument; indicating epsilon to be used for normalisation. More... | |
epsilon | |
The input exploration factor. More... | |
epsilon_decay_frequency | |
The input epsilon decay frequency in terms of timesteps. More... | |
epsilon_decay_rate | |
The input epsilon decay rate. More... | |
force_terminal_state_selection_prob | |
The input force_terminal_state_selection_prob . More... | |
gamma | |
The input discounting factor. More... | |
grad_norm_p | |
The input grad_norm_p ; indicating the p-value for p-normalisation for gradient clippings. More... | |
loss_function | |
The input loss function. More... | |
lr_scheduler | |
The input optional LR Scheduler (this can be None). More... | |
lr_threshold | |
The input LR Threshold. More... | |
max_grad_norm | |
The input max_grad_norm ; indicating the maximum gradient norm for gradient clippings. More... | |
memory | |
The instance of rlpack._C.memory.Memory used for Replay buffer. More... | |
memory_buffer_size | |
The input argument memory_buffer_size ; indicating the buffer size used. More... | |
min_epsilon | |
The input minimum exploration factor after decays. More... | |
num_actions | |
The input number of actions. More... | |
optimizer | |
The input optimizer wrapped with policy_model parameters. More... | |
p_for_norm | |
The input p_for_norm argument; indicating p-value for p-normalisation. More... | |
policy_model | |
The input policy model. More... | |
policy_model_update_rate | |
The input argument policy_model_update_rate ; indicating the update rate of policy model. More... | |
prioritization_params | |
The input prioritization parameters. More... | |
save_path | |
The input save path for backing up agent models. More... | |
step_counter | |
The step counter; counting the total timesteps done so far up to memory_buffer_size. More... | |
target_model | |
The input target model. More... | |
target_model_update_rate | |
The input argument target_model_update_rate ; indicating the update rate of target model. More... | |
tau | |
The input tau ; indicating the soft update used to update target_model parameters. More... | |
![]() | |
loss | |
The list of losses accumulated after each backward call. More... | |
save_path | |
The path to save agent states and models. More... | |
Private Member Functions | |
def | _anneal_alpha (self) |
def | _anneal_beta (self) |
None | _apply_prioritization_strategy (self, pytorch.Tensor td_value, pytorch.Tensor random_indices) |
Void protected method that applies the relevant prioritization strategy for the DQN. More... | |
None | _decay_epsilon (self) |
Protected method to decay epsilon. More... | |
None | _grad_mean_reduction (self) |
Performs mean reduction and assigns the policy model's parameter the mean reduced gradients. More... | |
int | _infer_action (self, pytorch.Tensor state_current, bool call_from_policy=True) |
Helper method to support action inference form policy model. More... | |
Tuple[ pytorch.Tensor, pytorch.Tensor, pytorch.Tensor, pytorch.Tensor, pytorch.Tensor, pytorch.Tensor, pytorch.Tensor, pytorch.Tensor, pytorch.Tensor,] | _load_random_experiences (self) |
This method loads random transitions from memory. More... | |
pytorch.Tensor | _temporal_difference (self, pytorch.Tensor rewards, pytorch.Tensor q_values, pytorch.Tensor dones) |
This method computes the temporal difference for given transitions. More... | |
None | _train_policy_model (self) |
Protected method of the class to train the policy model. More... | |
None | _update_target_model (self) |
Protected method of the class to update the target model. More... | |
Private Attributes | |
__prioritization_strategy_code | |
The prioritization strategy code. More... | |
_grad_accumulator | |
The list of gradients from each backward call. More... | |
_normalization | |
The normalisation tool to be used for agent. More... | |
This class implements the basic DQN methodology, i.e.
DQN without prioritization. This class also acts as a base class for other DQN variants all of which override the method __apply_prioritization_strategy
to implement their prioritization strategy.
def rlpack.dqn.dqn_agent.DqnAgent.__init__ | ( | self, | |
pytorch.nn.Module | target_model, | ||
pytorch.nn.Module | policy_model, | ||
pytorch.optim.Optimizer | optimizer, | ||
Union[LRScheduler, None] | lr_scheduler, | ||
LossFunction | loss_function, | ||
float | gamma, | ||
float | epsilon, | ||
float | min_epsilon, | ||
float | epsilon_decay_rate, | ||
int | epsilon_decay_frequency, | ||
int | memory_buffer_size, | ||
int | target_model_update_rate, | ||
int | policy_model_update_rate, | ||
int | backup_frequency, | ||
float | lr_threshold, | ||
int | batch_size, | ||
int | num_actions, | ||
str | save_path, | ||
int | bootstrap_rounds = 1 , |
||
str | device = "cpu" , |
||
Optional[Dict[str, Any]] | prioritization_params = None , |
||
float | force_terminal_state_selection_prob = 0.0 , |
||
float | tau = 1.0 , |
||
Union[int, str] | apply_norm = -1 , |
||
Union[int, List[str]] | apply_norm_to = -1 , |
||
float | eps_for_norm = 5e-12 , |
||
int | p_for_norm = 2 , |
||
int | dim_for_norm = 0 , |
||
Optional[float] | max_grad_norm = None , |
||
float | grad_norm_p = 2.0 |
||
) |
target_model | nn.Module: The target network for DQN model. This the network which has its weights frozen. |
policy_model | nn.Module: The policy network for DQN model. This is the network which is trained. |
optimizer | optim.Optimizer: The optimizer wrapped with policy model's parameters. |
lr_scheduler | Union[LRScheduler, None]: The PyTorch LR Scheduler with wrapped optimizer. |
loss_function | LossFunction: The loss function from PyTorch's nn module. Initialized instance must be passed. |
gamma | float: The gamma value for agent. |
epsilon | float: The initial epsilon for the agent. |
min_epsilon | float: The minimum epsilon for the agent. Once this value is reached, it is maintained for all further episodes. |
epsilon_decay_rate | float: The decay multiplier to decay the epsilon. |
epsilon_decay_frequency | int: The number of timesteps after which the epsilon is decayed. |
memory_buffer_size | int: The buffer size of memory; or replay buffer for DQN. |
target_model_update_rate | int: The timesteps after which target model's weights are updated with policy model weights: weights are weighted as per tau : see below)). |
policy_model_update_rate | int: The timesteps after which policy model is trained. This involves backpropagation through the policy network. |
backup_frequency | int: The timesteps after which models are backed up. This will also save optimizer, lr_scheduler and agent_states: epsilon the time of saving and memory. |
lr_threshold | float: The threshold LR which once reached LR scheduler is not called further. |
batch_size | int: The batch size used for inference through target_model and train through policy model |
num_actions | int: Number of actions for the environment. |
save_path | str: The save path for models: target_model and policy_model, optimizer, lr_scheduler and agent_states. |
bootstrap_rounds | int: The number of rounds until which gradients are to be accumulated before performing calling optimizer step. Gradients are mean reduced for bootstrap_rounds > 1. Default: 1. |
device | str: The device on which models are run. Default: "cpu". |
prioritization_params | Optional[Dict[str, Any]]: The parameters for prioritization in prioritized memory: or relay buffer). Default: None. |
force_terminal_state_selection_prob | float: The probability for forcefully selecting a terminal state in a batch. Default: 0.0. |
tau | float: The weighted update of weights from policy_model to target_model. This is done by formula target_weight = tau * policy_weight +: 1 - tau) * target_weight/. Default: -1. |
apply_norm | Union[int, str]: The code to select the normalization procedure to be applied on selected quantities; selected by apply_norm_to : see below)). Direct string can also be passed as per accepted keys. Refer below in Notes to see the accepted values. Default: -1 |
apply_norm_to | Union[int, List[str]]: The code to select the quantity to which normalization is to be applied. Direct list of quantities can also be passed as per accepted keys. Refer below in Notes to see the accepted values. Default: -1. |
eps_for_norm | float: Epsilon value for normalization: for numeric stability. For min-max normalization and standardized normalization. Default: 5e-12. |
p_for_norm | int: The p value for p-normalization. Default: 2: L2 Norm. |
dim_for_norm | int: The dimension across which normalization is to be performed. Default: 0. |
max_grad_norm | Optional[float]: The max norm for gradients for gradient clipping. Default: None |
grad_norm_p | Optional[float]: The p-value for p-normalization of gradients. Default: 2.0. |
Notes
The codes for apply_norm
are given as follows: -
"none"
)"min_max"
)"standardize"
)"p_norm"
)The codes for apply_norm_to
are given as follows:
["none"]
)["states"]
)["rewards"]
)["td"]
)["states", "rewards"]
)["states", "td"]
)If a valid max_norm_grad
is passed, then gradient clipping takes place else gradient clipping step is skipped. If max_norm_grad
value was invalid, error will be raised from PyTorch.
Reimplemented from rlpack.utils.base.agent.Agent.
Reimplemented in rlpack.dqn.dqn_proportional_prioritization_agent.DqnProportionalPrioritizationAgent, and rlpack.dqn.dqn_rank_based_prioritization_agent.DqnRankBasedPrioritizationAgent.
|
private |
|
private |
|
private |
Void protected method that applies the relevant prioritization strategy for the DQN.
td_value | pytorch.Tensor: The computed TD value. |
random_indices | The indices of randomly sampled transitions. |
Reimplemented in rlpack.dqn.dqn_proportional_prioritization_agent.DqnProportionalPrioritizationAgent, and rlpack.dqn.dqn_rank_based_prioritization_agent.DqnRankBasedPrioritizationAgent.
|
private |
Protected method to decay epsilon.
This method is called every epsilon_decay_frequency
timesteps and decays the epsilon by epsilon_decay_rate
, both supplied in DqnAgent class' constructor.
|
private |
Performs mean reduction and assigns the policy model's parameter the mean reduced gradients.
|
private |
Helper method to support action inference form policy model.
state_current | pytorch.Tensor: The current state of the agent in the environment |
call_from_policy | bool: The flag indicating if the method is being from DqnAgent.policy method or not. Default: True |
|
private |
This method loads random transitions from memory.
This may also include forced terminal states if supplied force_terminal_state_selection_prob
> 0 in DqnAgent constructor for each batch. i.e. if force_terminal_state_selection_prob = 0.1, approximately every 1 in 10 batches will have at least one terminal state forced by the loader.
|
private |
This method computes the temporal difference for given transitions.
rewards | pytorch.Tensor: The sampled batch of rewards. |
q_values | pytorch.Tensor: The q-values inferred from target_model. |
dones | pytorch.Tensor: The done values for each transition in the batch. |
|
private |
Protected method of the class to train the policy model.
This method is called every policy_model_update_rate
timesteps supplied in the DqnAgent class constructor. This method will load the random samples from memory (number of samples depend on batch_size
supplied in DqnAgent constructor), and train the policy_model.
|
private |
Protected method of the class to update the target model.
This method is called every target_model_update_rate
timesteps supplied in the DqnAgent class constructor.
None rlpack.dqn.dqn_agent.DqnAgent.load | ( | self, | |
Optional[str] | custom_name_suffix = None |
||
) |
This method loads the target_model, policy_model, optimizer, lr_scheduler and agent_states from the supplied save_path
argument in the DQN Agent class' constructor (also called init).
custom_name_suffix | Optional[str]: If supplied, additional suffix is added to names of target_model, policy_model, optimizer and lr_scheduler. Useful to load the best model by a custom suffix supplied for evaluation. Default: None |
Reimplemented from rlpack.utils.base.agent.Agent.
int rlpack.dqn.dqn_agent.DqnAgent.policy | ( | self, | |
Union[ndarray, pytorch.Tensor, List[float]] | state_current | ||
) |
The policy for the agent.
This runs the inference on policy model with state_current
and uses q-values to obtain the best action.
state_current | Union[ndarray, pytorch.Tensor, List[float]]: The current state agent is in. |
Reimplemented from rlpack.utils.base.agent.Agent.
None rlpack.dqn.dqn_agent.DqnAgent.save | ( | self, | |
Optional[str] | custom_name_suffix = None |
||
) |
This method saves the target_model, policy_model, optimizer, lr_scheduler and agent_states in the supplied save_path
argument in the DQN Agent class' constructor (also called init).
agent_states includes current memory and epsilon values in a dictionary.
custom_name_suffix | Optional[str]: If supplied, additional suffix is added to names of target_model, policy_model, optimizer and lr_scheduler. Useful to save best model by a custom suffix supplied during a train run. Default: None |
Reimplemented from rlpack.utils.base.agent.Agent.
int rlpack.dqn.dqn_agent.DqnAgent.train | ( | self, | |
Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]] | state_current, | ||
Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]] | state_next, | ||
Union[int, float] | reward, | ||
Union[int, float] | action, | ||
Union[bool, int] | done, | ||
Optional[Union[pytorch.Tensor, np.ndarray, float]] | priority = 1.0 , |
||
Optional[Union[pytorch.Tensor, np.ndarray, float]] | probability = 1.0 , |
||
Optional[Union[pytorch.Tensor, np.ndarray, float]] | weight = 1.0 |
||
) |
state_current | Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]]: The current state in the environment. |
state_next | Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]]: The next state returned by the environment. |
reward | Union[int, float]: Reward obtained by performing the action for the transition. |
action | Union[int, float]: Action taken for the transition |
done | Union[bool, int]: Indicates weather episode has terminated or not. |
priority | Optional[Union[pytorch.Tensor, np.ndarray, float]]: The priority of the transition: for priority relay memory). Default: 1.0 |
probability | Optional[Union[pytorch.Tensor, np.ndarray, float]]: The probability of the transition : for priority relay memory). Default: 1.0 |
weight | Optional[Union[pytorch.Tensor, np.ndarray, float]]: The important sampling weight of the transition: for priority relay memory). Default: 1.0 |
state_next
. Reimplemented from rlpack.utils.base.agent.Agent.
|
private |
The prioritization strategy code.
|
private |
The list of gradients from each backward call.
This is only used when boostrap_rounds > 1 and is cleared after each boostrap round. The rlpack._C.grad_accumulator.GradAccumulator object for grad accumulation.
|
private |
The normalisation tool to be used for agent.
An instance of rlpack.utils.normalization.Normalization.
rlpack.dqn.dqn_agent.DqnAgent.apply_norm |
The input apply_norm
argument; indicating the normalisation to be used.
rlpack.dqn.dqn_agent.DqnAgent.apply_norm_to |
The input apply_norm_to
argument; indicating the quantity to normalise.
rlpack.dqn.dqn_agent.DqnAgent.backup_frequency |
The input model backup frequency in terms of timesteps.
rlpack.dqn.dqn_agent.DqnAgent.batch_size |
The batch size to be used when training policy model.
Corresponding number of samples are drawn from memory as per the prioritization strategy
rlpack.dqn.dqn_agent.DqnAgent.bootstrap_rounds |
The input boostrap rounds.
rlpack.dqn.dqn_agent.DqnAgent.device |
The input device
argument; indicating the device name.
rlpack.dqn.dqn_agent.DqnAgent.dim_for_norm |
The input dim_for_norm
argument; indicating dimension along which we wish to normalise.
rlpack.dqn.dqn_agent.DqnAgent.eps_for_norm |
The input eps_for_norm
argument; indicating epsilon to be used for normalisation.
rlpack.dqn.dqn_agent.DqnAgent.epsilon |
The input exploration factor.
rlpack.dqn.dqn_agent.DqnAgent.epsilon_decay_frequency |
The input epsilon decay frequency in terms of timesteps.
rlpack.dqn.dqn_agent.DqnAgent.epsilon_decay_rate |
The input epsilon decay rate.
rlpack.dqn.dqn_agent.DqnAgent.force_terminal_state_selection_prob |
The input force_terminal_state_selection_prob
.
This indicates the probability to force at least one terminal state sample in a batch.
rlpack.dqn.dqn_agent.DqnAgent.gamma |
The input discounting factor.
rlpack.dqn.dqn_agent.DqnAgent.grad_norm_p |
The input grad_norm_p
; indicating the p-value for p-normalisation for gradient clippings.
rlpack.dqn.dqn_agent.DqnAgent.loss_function |
The input loss function.
rlpack.dqn.dqn_agent.DqnAgent.lr_scheduler |
The input optional LR Scheduler (this can be None).
rlpack.dqn.dqn_agent.DqnAgent.lr_threshold |
The input LR Threshold.
rlpack.dqn.dqn_agent.DqnAgent.max_grad_norm |
The input max_grad_norm
; indicating the maximum gradient norm for gradient clippings.
rlpack.dqn.dqn_agent.DqnAgent.memory |
The instance of rlpack._C.memory.Memory used for Replay buffer.
rlpack.dqn.dqn_agent.DqnAgent.memory_buffer_size |
The input argument memory_buffer_size
; indicating the buffer size used.
rlpack.dqn.dqn_agent.DqnAgent.min_epsilon |
The input minimum exploration factor after decays.
rlpack.dqn.dqn_agent.DqnAgent.num_actions |
The input number of actions.
rlpack.dqn.dqn_agent.DqnAgent.optimizer |
The input optimizer wrapped with policy_model parameters.
rlpack.dqn.dqn_agent.DqnAgent.p_for_norm |
The input p_for_norm
argument; indicating p-value for p-normalisation.
rlpack.dqn.dqn_agent.DqnAgent.policy_model |
The input policy model.
rlpack.dqn.dqn_agent.DqnAgent.policy_model_update_rate |
The input argument policy_model_update_rate
; indicating the update rate of policy model.
Optimizer is called every policy_model_update_rate
.
rlpack.dqn.dqn_agent.DqnAgent.prioritization_params |
The input prioritization parameters.
rlpack.dqn.dqn_agent.DqnAgent.save_path |
The input save path for backing up agent models.
rlpack.dqn.dqn_agent.DqnAgent.step_counter |
The step counter; counting the total timesteps done so far up to memory_buffer_size.
rlpack.dqn.dqn_agent.DqnAgent.target_model |
The input target model.
This model's parameters are frozen.
rlpack.dqn.dqn_agent.DqnAgent.target_model_update_rate |
The input argument target_model_update_rate
; indicating the update rate of target model.
A soft copy of parameters takes place form policy_model to target model as per the update rate
rlpack.dqn.dqn_agent.DqnAgent.tau |
The input tau
; indicating the soft update used to update target_model parameters.