This is a helper class that selects the correct the variant of DQN implementations based on prioritization strategy determined by the argument prioritization_params
.
More...
Public Member Functions | |
def | __new__ (cls, pytorch.nn.Module target_model, pytorch.nn.Module policy_model, pytorch.optim.Optimizer optimizer, Union[LRScheduler, None] lr_scheduler, LossFunction loss_function, float gamma, float epsilon, float min_epsilon, float epsilon_decay_rate, int epsilon_decay_frequency, int memory_buffer_size, int target_model_update_rate, int policy_model_update_rate, int backup_frequency, float lr_threshold, int batch_size, int num_actions, str save_path, int bootstrap_rounds=1, str device="cpu", Optional[Dict[str, Any]] prioritization_params=None, float force_terminal_state_selection_prob=0.0, float tau=1.0, int apply_norm=-1, int apply_norm_to=-1, float eps_for_norm=5e-12, int p_for_norm=2, int dim_for_norm=0, Optional[float] max_grad_norm=None, float grad_norm_p=2.0) |
Static Private Member Functions | |
float | __anneal_alpha_default_fn (float alpha, float alpha_annealing_factor) |
Protected method to anneal alpha parameter for important sampling weights. More... | |
float | __anneal_beta_default_fn (float beta, float beta_annealing_factor) |
Protected method to anneal beta parameter for important sampling weights. More... | |
Dict[str, Any] | __process_prioritization_params (Dict[str, Any] prioritization_params, int prioritization_strategy_code, Callable[[float, float], float] anneal_alpha_default_fn, Callable[[float, float], float] anneal_beta_default_fn, int batch_size) |
Private method to process the prioritization parameters. More... | |
This is a helper class that selects the correct the variant of DQN implementations based on prioritization strategy determined by the argument prioritization_params
.
|
staticprivate |
Protected method to anneal alpha parameter for important sampling weights.
This will be called every alpha_annealing_frequency
times. alpha_annealing_frequency
is a key to be passed in dictionary prioritization_params
argument in the DqnAgent class' constructor. This method is called by default to anneal alpha.
If alpha_annealing_frequency
is not passed in prioritization_params
, the annealing of alpha will not take place. This method uses another value alpha_annealing_factor
that must also be passed in prioritization_params
. alpha_annealing_factor
is typically below 1 to slowly annealed it to 0 or min_alpha
.
alpha | float: The input alpha value to anneal. |
alpha_annealing_factor | float: The annealing factor to be used to anneal alpha. |
|
staticprivate |
Protected method to anneal beta parameter for important sampling weights.
This will be called every beta_annealing_frequency
times. beta_annealing_frequency
is a key to be passed in dictionary prioritization_params
argument in the DqnAgent class' constructor.
If beta_annealing_frequency
is not passed in prioritization_params
, the annealing of beta will not take place. This method uses another value beta_annealing_factor
that must also be passed in prioritization_params
. beta_annealing_factor
is typically above 1 to slowly annealed it to 1 or max_beta
beta | float: The input beta value to anneal. |
beta_annealing_factor | float: The annealing factor to be used to anneal beta. |
def rlpack.dqn.dqn.Dqn.__new__ | ( | cls, | |
pytorch.nn.Module | target_model, | ||
pytorch.nn.Module | policy_model, | ||
pytorch.optim.Optimizer | optimizer, | ||
Union[LRScheduler, None] | lr_scheduler, | ||
LossFunction | loss_function, | ||
float | gamma, | ||
float | epsilon, | ||
float | min_epsilon, | ||
float | epsilon_decay_rate, | ||
int | epsilon_decay_frequency, | ||
int | memory_buffer_size, | ||
int | target_model_update_rate, | ||
int | policy_model_update_rate, | ||
int | backup_frequency, | ||
float | lr_threshold, | ||
int | batch_size, | ||
int | num_actions, | ||
str | save_path, | ||
int | bootstrap_rounds = 1 , |
||
str | device = "cpu" , |
||
Optional[Dict[str, Any]] | prioritization_params = None , |
||
float | force_terminal_state_selection_prob = 0.0 , |
||
float | tau = 1.0 , |
||
int | apply_norm = -1 , |
||
int | apply_norm_to = -1 , |
||
float | eps_for_norm = 5e-12 , |
||
int | p_for_norm = 2 , |
||
int | dim_for_norm = 0 , |
||
Optional[float] | max_grad_norm = None , |
||
float | grad_norm_p = 2.0 |
||
) |
target_model | nn.Module: The target network for DQN model. This the network which has its weights frozen. |
policy_model | nn.Module: The policy network for DQN model. This is the network which is trained. |
optimizer | optim.Optimizer: The optimizer wrapped with policy model's parameters. |
lr_scheduler | Union[LRScheduler, None]: The PyTorch LR Scheduler with wrapped optimizer. |
loss_function | LossFunction: The loss function from PyTorch's nn module. Initialized instance must be passed. |
gamma | float: The gamma value for agent. |
epsilon | float: The initial epsilon for the agent. |
min_epsilon | float: The minimum epsilon for the agent. Once this value is reached, it is maintained for all further episodes. |
epsilon_decay_rate | float: The decay multiplier to decay the epsilon. |
epsilon_decay_frequency | int: The number of timesteps after which the epsilon is decayed. |
memory_buffer_size | int: The buffer size of memory; or replay buffer for DQN. |
target_model_update_rate | int: The timesteps after which target model's weights are updated with policy model weights: weights are weighted as per tau : see below)). |
policy_model_update_rate | int: The timesteps after which policy model is trained. This involves backpropagation through the policy network. |
backup_frequency | int: The timesteps after which models are backed up. This will also save optimizer, lr_scheduler and agent_states: epsilon the time of saving and memory. |
lr_threshold | float: The threshold LR which once reached LR scheduler is not called further. |
batch_size | int: The batch size used for inference through target_model and train through policy model |
num_actions | int: Number of actions for the environment. |
save_path | str: The save path for models: target_model and policy_model, optimizer, lr_scheduler and agent_states. |
bootstrap_rounds | int: The number of rounds until which gradients are to be accumulated before performing calling optimizer step. Gradients are mean reduced for bootstrap_rounds > 1. Default: 1. |
device | str: The device on which models are run. Default: "cpu". |
prioritization_params | Optional[Dict[str, Any]]: The parameters for prioritization in prioritized memory: or relay buffer). Default: None. |
force_terminal_state_selection_prob | float: The probability for forcefully selecting a terminal state in a batch. Default: 0.0. |
tau | float: The weighted update of weights from policy_model to target_model. This is done by formula target_weight = tau * policy_weight +: 1 - tau) * target_weight/. Default: -1. |
apply_norm | Union[int, str]: The code to select the normalization procedure to be applied on selected quantities; selected by apply_norm_to : see below)). Direct string can also be passed as per accepted keys. Refer below in Notes to see the accepted values. Default: -1 |
apply_norm_to | Union[int, List[str]]: The code to select the quantity to which normalization is to be applied. Direct list of quantities can also be passed as per accepted keys. Refer below in Notes to see the accepted values. Default: -1. |
eps_for_norm | float: Epsilon value for normalization: for numeric stability. For min-max normalization and standardized normalization. Default: 5e-12. |
p_for_norm | int: The p value for p-normalization. Default: 2: L2 Norm. |
dim_for_norm | int: The dimension across which normalization is to be performed. Default: 0. |
max_grad_norm | Optional[float]: The max norm for gradients for gradient clipping. Default: None |
grad_norm_p | Optional[float]: The p-value for p-normalization of gradients. Default: 2.0 |
Notes
For prioritization_params, when None: the default is passed, prioritized memory is not used. To use prioritized memory, pass a dictionary with keys alpha
and beta
. You can also pass alpha_decay_rate
and beta_decay_rate
additionally.
The code for prioritization strategies are:
uniform
proportional
rank-based
The codes for apply_norm
are given as follows: -
"none"
)"min_max"
)"standardize"
)"p_norm"
)The codes for apply_norm_to
are given as follows:
["none"]
)["states"]
)["rewards"]
)["td"]
)["states", "rewards"]
)["states", "td"]
)If a valid max_norm_grad
is passed, then gradient clipping takes place else gradient clipping step is skipped. If max_norm_grad
value was invalid, error will be raised from PyTorch.
|
staticprivate |
Private method to process the prioritization parameters.
This includes sanity check and loading of default values of mandatory parameters.
prioritization_params | Dict[str, Any]: The prioritization parameters for when we use prioritized memory. |
prioritization_strategy_code | int: The prioritization code corresponding to the given prioritization strategy string. |
anneal_alpha_default_fn | Callable[[float, float], float]: The default annealing function for alpha. |
anneal_beta_default_fn | Callable[[float, float], float]: The default annealing function for beta. |
batch_size | int: The requested batch size; used in rank-based prioritization to determine the number of segments. |