The A2C class implements the synchronous Actor-Critic method. More...
Public Member Functions | |
def | __init__ (self, pytorch.nn.Module policy_model, pytorch.optim.Optimizer optimizer, Union[LRScheduler, None] lr_scheduler, LossFunction loss_function, Distribution distribution, float gamma, float entropy_coefficient, float state_value_coefficient, float lr_threshold, Union[int, List[Union[int, List[int]]]] action_space, int backup_frequency, str save_path, int bootstrap_rounds=1, str device="cpu", Union[int, str] apply_norm=-1, Union[int, List[str]] apply_norm_to=-1, float eps_for_norm=5e-12, int p_for_norm=2, int dim_for_norm=0, Optional[float] max_grad_norm=None, float grad_norm_p=2.0, Optional[Tuple[float, Callable[[float, bool, int], float]]] variance=None) |
None | load (self, Optional[str] custom_name_suffix=None) |
This method loads the target_model, policy_model, optimizer, lr_scheduler and agent_states from the supplied save_path argument in the DQN Agent class' constructor (also called init). More... | |
Union[int, np.ndarray] | policy (self, Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]] state_current, **kwargs) |
The policy method to evaluate the agent. More... | |
None | save (self, Optional[str] custom_name_suffix=None) |
This method saves the target_model, policy_model, optimizer, lr_scheduler and agent_states in the supplied save_path argument in the DQN Agent class' constructor (also called init). More... | |
Union[int, np.ndarray] | train (self, Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]] state_current, Union[int, float] reward, Union[bool, int] done, **kwargs) |
The train method to train the agent and underlying policy model. More... | |
![]() | |
Dict[str, Any] | __getstate__ (self) |
To get the agent's current state (dict of attributes). More... | |
def | __init__ (self) |
The class initializer. More... | |
None | __setstate__ (self, Dict[str, Any] state) |
To load the agent's current state (dict of attributes). More... | |
None | load (self, *args, **kwargs) |
Load method for the agent. More... | |
Any | policy (self, *args, **kwargs) |
Policy method for the agent. More... | |
None | save (self, *args, **kwargs) |
Save method for the agent. More... | |
Any | train (self, *args, **kwargs) |
Training method for the agent. More... | |
Data Fields | |
action_log_probabilities | |
The list of sampled actions from each timestep from the action distribution. More... | |
action_space | |
The input number of actions. More... | |
apply_norm | |
The input apply_norm argument; indicating the normalisation to be used. More... | |
apply_norm_to | |
The input apply_norm_to argument; indicating the quantity to normalise. More... | |
backup_frequency | |
The input model backup frequency in terms of timesteps. More... | |
bootstrap_rounds | |
The input boostrap rounds. More... | |
device | |
The input device argument; indicating the device name. More... | |
dim_for_norm | |
The input dim_for_norm argument; indicating dimension along which we wish to normalise. More... | |
distribution | |
The input distribution object. More... | |
entropies | |
The list of entropies from each timestep. More... | |
entropy_coefficient | |
The input entropy coefficient. More... | |
eps_for_norm | |
The input eps_for_norm argument; indicating epsilon to be used for normalisation. More... | |
gamma | |
The input discounting factor. More... | |
grad_norm_p | |
The input grad_norm_p ; indicating the p-value for p-normalisation for gradient clippings. More... | |
is_continuous_action_space | |
Flag indicating if action space is continuous or discrete. More... | |
loss_function | |
The input loss function. More... | |
lr_scheduler | |
The input optional LR Scheduler (this can be None). More... | |
lr_threshold | |
The input LR Threshold. More... | |
max_grad_norm | |
The input max_grad_norm ; indicating the maximum gradient norm for gradient clippings. More... | |
optimizer | |
The input optimizer wrapped with policy_model parameters. More... | |
p_for_norm | |
The input p_for_norm argument; indicating p-value for p-normalisation. More... | |
policy_model | |
The input policy model moved to desired device. More... | |
rewards | |
The list of rewards from each timestep. More... | |
save_path | |
The input save path for backing up agent models. More... | |
state_value_coefficient | |
The input state value coefficient. More... | |
states_current_values | |
The list of state values at each timestep.This is cleared after each episode. More... | |
step_counter | |
The step counter; counting the total timesteps done so far. More... | |
variance_decay_fn | |
The variance decay method. More... | |
variance_value | |
The current variance value. More... | |
![]() | |
loss | |
The list of losses accumulated after each backward call. More... | |
save_path | |
The path to save agent states and models. More... | |
Private Member Functions | |
None | _call_to_save (self) |
Method calling the save method when required. More... | |
None | _call_to_train_policy_model (self, Union[bool, int] done) |
Protected method to train the policy model. More... | |
None | _clear (self) |
Protected void method to clear the lists of rewards, action_log_probs and state_values. More... | |
pytorch.Tensor | _compute_advantage (self, pytorch.Tensor returns, pytorch.Tensor state_current_values) |
Computes the advantage from returns and state values. More... | |
pytorch.Tensor | _compute_loss (self) |
Method to compute total loss (from actor and critic). More... | |
pytorch.Tensor | _compute_returns (self) |
Computes the discounted returns iteratively. More... | |
Distribution | _create_action_distribution (self, pytorch.Tensor action_values) |
Protected static method to create distributions from action logits. More... | |
None | _grad_mean_reduction (self) |
Performs mean reduction and assigns the policy model's parameter the mean reduced gradients. More... | |
None | _run_optimizer (self, loss) |
Protected void method to train the model or accumulate the gradients for training. More... | |
Private Attributes | |
_grad_accumulator | |
The list of gradients from each backward call. More... | |
_normalization | |
The normalisation tool to be used for agent. More... | |
_operate_with_variance | |
The boolean flag indicating if variance operations are to be used. More... | |
The A2C class implements the synchronous Actor-Critic method.
def rlpack.actor_critic.a2c.A2C.__init__ | ( | self, | |
pytorch.nn.Module | policy_model, | ||
pytorch.optim.Optimizer | optimizer, | ||
Union[LRScheduler, None] | lr_scheduler, | ||
LossFunction | loss_function, | ||
Distribution | distribution, | ||
float | gamma, | ||
float | entropy_coefficient, | ||
float | state_value_coefficient, | ||
float | lr_threshold, | ||
Union[int, List[Union[int, List[int]]]] | action_space, | ||
int | backup_frequency, | ||
str | save_path, | ||
int | bootstrap_rounds = 1 , |
||
str | device = "cpu" , |
||
Union[int, str] | apply_norm = -1 , |
||
Union[int, List[str]] | apply_norm_to = -1 , |
||
float | eps_for_norm = 5e-12 , |
||
int | p_for_norm = 2 , |
||
int | dim_for_norm = 0 , |
||
Optional[float] | max_grad_norm = None , |
||
float | grad_norm_p = 2.0 , |
||
Optional[Tuple[float, Callable[[float, bool, int], float]]] | variance = None |
||
) |
policy_model | pytorch.nn.Module: The policy model to be used. Policy model must return a tuple of action logits and state values. |
optimizer | pytorch.optim.Optimizer: The optimizer to be used for policy model. Optimizer must be initialized and wrapped with policy model parameters. |
lr_scheduler | Union[LRScheduler, None]: The LR Scheduler to be used to decay the learning rate. LR Scheduler must be initialized and wrapped with passed optimizer. |
loss_function | LossFunction: A PyTorch loss function. |
distribution | : dist_math.distribution.Distribution: The distribution of PyTorch to be used to sampled actions in action space. (See action_space ). |
gamma | float: The discounting factor for rewards. |
entropy_coefficient | float: The coefficient to be used for entropy in policy loss computation. |
state_value_coefficient | float: The coefficient to be used for state value in final loss computation. |
lr_threshold | float: The threshold LR which once reached LR scheduler is not called further. |
action_space | Union[int, List[Union[int, List[int]]]]: The action space of the environment. If discrete action set is used, number of actions can be passed. If continuous action space is used, a list must be passed with first element representing the output features from model, second representing the shape of action to be sampled. |
backup_frequency | int: The timesteps after which policy model, optimizer states and lr scheduler states are backed up. |
save_path | str: The path where policy model, optimizer states and lr scheduler states are to be saved. |
bootstrap_rounds | int: The number of rounds until which gradients are to be accumulated before performing calling optimizer step. Gradients are mean reduced for bootstrap_rounds > 1. Default: 1. |
device | str: The device on which models are run. Default: "cpu". |
apply_norm | Union[int, str]: The code to select the normalization procedure to be applied on selected quantities; selected by apply_norm_to : see below)). Direct string can also be passed as per accepted keys. Refer below in Notes to see the accepted values. Default: -1 |
apply_norm_to | Union[int, List[str]]: The code to select the quantity to which normalization is to be applied. Direct list of quantities can also be passed as per accepted keys. Refer below in Notes to see the accepted values. Default: -1. |
eps_for_norm | float: Epsilon value for normalization; for numeric stability. For min-max normalization and standardized normalization. Default: 5e-12. |
p_for_norm | int: The p value for p-normalization. Default: 2; L2 Norm. |
dim_for_norm | int: The dimension across which normalization is to be performed. Default: 0. |
max_grad_norm | Optional[float]: The max norm for gradients for gradient clipping. Default: None |
grad_norm_p | float: The p-value for p-normalization of gradients. Default: 2.0 |
variance | Optional[Tuple[float, Callable[[float, bool, int], float]]]: The tuple of variance to be used to sample actions for continuous action space and a method to be used to decay it. The passed method have the signature Callable[[float, int], float]. The first argument would be the variance value and second value be the boolean, done flag indicating if the state is terminal or not and third will be the timestep; returning the updated variance value. Default: None |
Notes
The codes for apply_norm
are given as follows: -
"none"
)"min_max"
)"standardize"
)"p_norm"
)The codes for apply_norm_to
are given as follows:
["none"]
)["states"]
)["rewards"]
)["advantage"]
)["states", "rewards"]
)["states", "advantage"]
)If a valid max_norm_grad
is passed, then gradient clipping takes place else gradient clipping step is skipped. If max_norm_grad
value was invalid, error will be raised from PyTorch. :param distribution:
Reimplemented from rlpack.utils.base.agent.Agent.
Reimplemented in rlpack.actor_critic.a3c.A3C.
|
private |
Method calling the save method when required.
This method is to be overriden by asynchronous methods.
Reimplemented in rlpack.actor_critic.a3c.A3C.
|
private |
Protected method to train the policy model.
If done flag is True, will compute the loss and run the optimizer. This method is meant to periodically check if episode hsa been terminated or and train policy models if episode has terminated.
done | Union[bool, int]: Flag indicating if episode has terminated or not |
|
private |
Protected void method to clear the lists of rewards, action_log_probs and state_values.
|
private |
Computes the advantage from returns and state values.
returns | pytorch.Tensor: The discounted returns; computed from _compute_returns method |
state_current_values | pytorch.Tensor: The corresponding state values |
|
private |
Method to compute total loss (from actor and critic).
|
private |
Computes the discounted returns iteratively.
|
private |
Protected static method to create distributions from action logits.
action_values | pytorch.Tensor: The action values from policy model |
|
private |
Performs mean reduction and assigns the policy model's parameter the mean reduced gradients.
|
private |
Protected void method to train the model or accumulate the gradients for training.
Reimplemented in rlpack.actor_critic.a3c.A3C.
None rlpack.actor_critic.a2c.A2C.load | ( | self, | |
Optional[str] | custom_name_suffix = None |
||
) |
This method loads the target_model, policy_model, optimizer, lr_scheduler and agent_states from the supplied save_path
argument in the DQN Agent class' constructor (also called init).
custom_name_suffix | Optional[str]: If supplied, additional suffix is added to names of target_model, policy_model, optimizer and lr_scheduler. Useful to load the best model by a custom suffix supplied for evaluation. Default: None |
Reimplemented from rlpack.utils.base.agent.Agent.
Union[int, np.ndarray] rlpack.actor_critic.a2c.A2C.policy | ( | self, | |
Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]] | state_current, | ||
** | kwargs | ||
) |
The policy method to evaluate the agent.
This runs in pure inference mode.
state_current | Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]]: The current state returned from gym environment |
kwargs | Other keyword arguments |
Reimplemented from rlpack.utils.base.agent.Agent.
None rlpack.actor_critic.a2c.A2C.save | ( | self, | |
Optional[str] | custom_name_suffix = None |
||
) |
This method saves the target_model, policy_model, optimizer, lr_scheduler and agent_states in the supplied save_path
argument in the DQN Agent class' constructor (also called init).
agent_states includes current memory and epsilon values in a dictionary.
custom_name_suffix | Optional[str]: If supplied, additional suffix is added to names of target_model, policy_model, optimizer and lr_scheduler. Useful to save best model by a custom suffix supplied during a train run. Default: None |
Reimplemented from rlpack.utils.base.agent.Agent.
Union[int, np.ndarray] rlpack.actor_critic.a2c.A2C.train | ( | self, | |
Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]] | state_current, | ||
Union[int, float] | reward, | ||
Union[bool, int] | done, | ||
** | kwargs | ||
) |
The train method to train the agent and underlying policy model.
state_current | Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]]: The current state returned |
reward | Union[int, float]: The reward returned from previous action |
done | Union[bool, int]: Flag indicating if episode has terminated or not |
kwargs | Other keyword arguments. |
Reimplemented from rlpack.utils.base.agent.Agent.
|
private |
The list of gradients from each backward call.
This is only used when boostrap_rounds > 1 and is cleared after each boostrap round. The rlpack._C.grad_accumulator.GradAccumulator object for grad accumulation.
|
private |
The normalisation tool to be used for agent.
An instance of rlpack.utils.normalization.Normalization.
|
private |
The boolean flag indicating if variance operations are to be used.
rlpack.actor_critic.a2c.A2C.action_log_probabilities |
The list of sampled actions from each timestep from the action distribution.
This is cleared after each episode.
rlpack.actor_critic.a2c.A2C.action_space |
The input number of actions.
rlpack.actor_critic.a2c.A2C.apply_norm |
The input apply_norm
argument; indicating the normalisation to be used.
rlpack.actor_critic.a2c.A2C.apply_norm_to |
The input apply_norm_to
argument; indicating the quantity to normalise.
rlpack.actor_critic.a2c.A2C.backup_frequency |
The input model backup frequency in terms of timesteps.
rlpack.actor_critic.a2c.A2C.bootstrap_rounds |
The input boostrap rounds.
rlpack.actor_critic.a2c.A2C.device |
The input device
argument; indicating the device name.
rlpack.actor_critic.a2c.A2C.dim_for_norm |
The input dim_for_norm
argument; indicating dimension along which we wish to normalise.
rlpack.actor_critic.a2c.A2C.distribution |
The input distribution object.
rlpack.actor_critic.a2c.A2C.entropies |
The list of entropies from each timestep.
This is cleared after each episode.
rlpack.actor_critic.a2c.A2C.entropy_coefficient |
The input entropy coefficient.
rlpack.actor_critic.a2c.A2C.eps_for_norm |
The input eps_for_norm
argument; indicating epsilon to be used for normalisation.
rlpack.actor_critic.a2c.A2C.gamma |
The input discounting factor.
rlpack.actor_critic.a2c.A2C.grad_norm_p |
The input grad_norm_p
; indicating the p-value for p-normalisation for gradient clippings.
rlpack.actor_critic.a2c.A2C.is_continuous_action_space |
Flag indicating if action space is continuous or discrete.
rlpack.actor_critic.a2c.A2C.loss_function |
The input loss function.
rlpack.actor_critic.a2c.A2C.lr_scheduler |
The input optional LR Scheduler (this can be None).
rlpack.actor_critic.a2c.A2C.lr_threshold |
The input LR Threshold.
rlpack.actor_critic.a2c.A2C.max_grad_norm |
The input max_grad_norm
; indicating the maximum gradient norm for gradient clippings.
rlpack.actor_critic.a2c.A2C.optimizer |
The input optimizer wrapped with policy_model parameters.
rlpack.actor_critic.a2c.A2C.p_for_norm |
The input p_for_norm
argument; indicating p-value for p-normalisation.
rlpack.actor_critic.a2c.A2C.policy_model |
The input policy model moved to desired device.
rlpack.actor_critic.a2c.A2C.rewards |
The list of rewards from each timestep.
This is cleared after each episode.
rlpack.actor_critic.a2c.A2C.save_path |
The input save path for backing up agent models.
rlpack.actor_critic.a2c.A2C.state_value_coefficient |
The input state value coefficient.
rlpack.actor_critic.a2c.A2C.states_current_values |
The list of state values at each timestep.This is cleared after each episode.
rlpack.actor_critic.a2c.A2C.step_counter |
The step counter; counting the total timesteps done so far.
rlpack.actor_critic.a2c.A2C.variance_decay_fn |
The variance decay method.
This will be None if variance
argument was not passed
rlpack.actor_critic.a2c.A2C.variance_value |
The current variance value.
This will be None if variance
argument was not passed