RLPack
 
Loading...
Searching...
No Matches
rlpack.actor_critic.a2c.A2C Class Reference

The A2C class implements the synchronous Actor-Critic method. More...

+ Inheritance diagram for rlpack.actor_critic.a2c.A2C:
+ Collaboration diagram for rlpack.actor_critic.a2c.A2C:

Public Member Functions

def __init__ (self, pytorch.nn.Module policy_model, pytorch.optim.Optimizer optimizer, Union[LRScheduler, None] lr_scheduler, LossFunction loss_function, Distribution distribution, float gamma, float entropy_coefficient, float state_value_coefficient, float lr_threshold, Union[int, List[Union[int, List[int]]]] action_space, int backup_frequency, str save_path, int bootstrap_rounds=1, str device="cpu", Union[int, str] apply_norm=-1, Union[int, List[str]] apply_norm_to=-1, float eps_for_norm=5e-12, int p_for_norm=2, int dim_for_norm=0, Optional[float] max_grad_norm=None, float grad_norm_p=2.0, Optional[Tuple[float, Callable[[float, bool, int], float]]] variance=None)
 
None load (self, Optional[str] custom_name_suffix=None)
 This method loads the target_model, policy_model, optimizer, lr_scheduler and agent_states from the supplied save_path argument in the DQN Agent class' constructor (also called init). More...
 
Union[int, np.ndarray] policy (self, Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]] state_current, **kwargs)
 The policy method to evaluate the agent. More...
 
None save (self, Optional[str] custom_name_suffix=None)
 This method saves the target_model, policy_model, optimizer, lr_scheduler and agent_states in the supplied save_path argument in the DQN Agent class' constructor (also called init). More...
 
Union[int, np.ndarray] train (self, Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]] state_current, Union[int, float] reward, Union[bool, int] done, **kwargs)
 The train method to train the agent and underlying policy model. More...
 
- Public Member Functions inherited from rlpack.utils.base.agent.Agent
Dict[str, Any] __getstate__ (self)
 To get the agent's current state (dict of attributes). More...
 
def __init__ (self)
 The class initializer. More...
 
None __setstate__ (self, Dict[str, Any] state)
 To load the agent's current state (dict of attributes). More...
 
None load (self, *args, **kwargs)
 Load method for the agent. More...
 
Any policy (self, *args, **kwargs)
 Policy method for the agent. More...
 
None save (self, *args, **kwargs)
 Save method for the agent. More...
 
Any train (self, *args, **kwargs)
 Training method for the agent. More...
 

Data Fields

 action_log_probabilities
 The list of sampled actions from each timestep from the action distribution. More...
 
 action_space
 The input number of actions. More...
 
 apply_norm
 The input apply_norm argument; indicating the normalisation to be used. More...
 
 apply_norm_to
 The input apply_norm_to argument; indicating the quantity to normalise. More...
 
 backup_frequency
 The input model backup frequency in terms of timesteps. More...
 
 bootstrap_rounds
 The input boostrap rounds. More...
 
 device
 The input device argument; indicating the device name. More...
 
 dim_for_norm
 The input dim_for_norm argument; indicating dimension along which we wish to normalise. More...
 
 distribution
 The input distribution object. More...
 
 entropies
 The list of entropies from each timestep. More...
 
 entropy_coefficient
 The input entropy coefficient. More...
 
 eps_for_norm
 The input eps_for_norm argument; indicating epsilon to be used for normalisation. More...
 
 gamma
 The input discounting factor. More...
 
 grad_norm_p
 The input grad_norm_p; indicating the p-value for p-normalisation for gradient clippings. More...
 
 is_continuous_action_space
 Flag indicating if action space is continuous or discrete. More...
 
 loss_function
 The input loss function. More...
 
 lr_scheduler
 The input optional LR Scheduler (this can be None). More...
 
 lr_threshold
 The input LR Threshold. More...
 
 max_grad_norm
 The input max_grad_norm; indicating the maximum gradient norm for gradient clippings. More...
 
 optimizer
 The input optimizer wrapped with policy_model parameters. More...
 
 p_for_norm
 The input p_for_norm argument; indicating p-value for p-normalisation. More...
 
 policy_model
 The input policy model moved to desired device. More...
 
 rewards
 The list of rewards from each timestep. More...
 
 save_path
 The input save path for backing up agent models. More...
 
 state_value_coefficient
 The input state value coefficient. More...
 
 states_current_values
 The list of state values at each timestep.This is cleared after each episode. More...
 
 step_counter
 The step counter; counting the total timesteps done so far. More...
 
 variance_decay_fn
 The variance decay method. More...
 
 variance_value
 The current variance value. More...
 
- Data Fields inherited from rlpack.utils.base.agent.Agent
 loss
 The list of losses accumulated after each backward call. More...
 
 save_path
 The path to save agent states and models. More...
 

Private Member Functions

None _call_to_save (self)
 Method calling the save method when required. More...
 
None _call_to_train_policy_model (self, Union[bool, int] done)
 Protected method to train the policy model. More...
 
None _clear (self)
 Protected void method to clear the lists of rewards, action_log_probs and state_values. More...
 
pytorch.Tensor _compute_advantage (self, pytorch.Tensor returns, pytorch.Tensor state_current_values)
 Computes the advantage from returns and state values. More...
 
pytorch.Tensor _compute_loss (self)
 Method to compute total loss (from actor and critic). More...
 
pytorch.Tensor _compute_returns (self)
 Computes the discounted returns iteratively. More...
 
Distribution _create_action_distribution (self, pytorch.Tensor action_values)
 Protected static method to create distributions from action logits. More...
 
None _grad_mean_reduction (self)
 Performs mean reduction and assigns the policy model's parameter the mean reduced gradients. More...
 
None _run_optimizer (self, loss)
 Protected void method to train the model or accumulate the gradients for training. More...
 

Private Attributes

 _grad_accumulator
 The list of gradients from each backward call. More...
 
 _normalization
 The normalisation tool to be used for agent. More...
 
 _operate_with_variance
 The boolean flag indicating if variance operations are to be used. More...
 

Detailed Description

The A2C class implements the synchronous Actor-Critic method.

Constructor & Destructor Documentation

◆ __init__()

def rlpack.actor_critic.a2c.A2C.__init__ (   self,
pytorch.nn.Module  policy_model,
pytorch.optim.Optimizer  optimizer,
Union[LRScheduler, None]  lr_scheduler,
LossFunction  loss_function,
Distribution  distribution,
float  gamma,
float  entropy_coefficient,
float  state_value_coefficient,
float  lr_threshold,
Union[int, List[Union[int, List[int]]]]  action_space,
int  backup_frequency,
str  save_path,
int   bootstrap_rounds = 1,
str   device = "cpu",
Union[int, str]   apply_norm = -1,
Union[int, List[str]]   apply_norm_to = -1,
float   eps_for_norm = 5e-12,
int   p_for_norm = 2,
int   dim_for_norm = 0,
Optional[float]   max_grad_norm = None,
float   grad_norm_p = 2.0,
Optional[Tuple[float, Callable[[float, bool, int], float]]]   variance = None 
)
Parameters
policy_modelpytorch.nn.Module: The policy model to be used. Policy model must return a tuple of action logits and state values.
optimizerpytorch.optim.Optimizer: The optimizer to be used for policy model. Optimizer must be initialized and wrapped with policy model parameters.
lr_schedulerUnion[LRScheduler, None]: The LR Scheduler to be used to decay the learning rate. LR Scheduler must be initialized and wrapped with passed optimizer.
loss_functionLossFunction: A PyTorch loss function.
distribution: dist_math.distribution.Distribution: The distribution of PyTorch to be used to sampled actions in action space. (See action_space).
gammafloat: The discounting factor for rewards.
entropy_coefficientfloat: The coefficient to be used for entropy in policy loss computation.
state_value_coefficientfloat: The coefficient to be used for state value in final loss computation.
lr_thresholdfloat: The threshold LR which once reached LR scheduler is not called further.
action_spaceUnion[int, List[Union[int, List[int]]]]: The action space of the environment. If discrete action set is used, number of actions can be passed. If continuous action space is used, a list must be passed with first element representing the output features from model, second representing the shape of action to be sampled.
backup_frequencyint: The timesteps after which policy model, optimizer states and lr scheduler states are backed up.
save_pathstr: The path where policy model, optimizer states and lr scheduler states are to be saved.
bootstrap_roundsint: The number of rounds until which gradients are to be accumulated before performing calling optimizer step. Gradients are mean reduced for bootstrap_rounds > 1. Default: 1.
devicestr: The device on which models are run. Default: "cpu".
apply_normUnion[int, str]: The code to select the normalization procedure to be applied on selected quantities; selected by apply_norm_to: see below)). Direct string can also be passed as per accepted keys. Refer below in Notes to see the accepted values. Default: -1
apply_norm_toUnion[int, List[str]]: The code to select the quantity to which normalization is to be applied. Direct list of quantities can also be passed as per accepted keys. Refer below in Notes to see the accepted values. Default: -1.
eps_for_normfloat: Epsilon value for normalization; for numeric stability. For min-max normalization and standardized normalization. Default: 5e-12.
p_for_normint: The p value for p-normalization. Default: 2; L2 Norm.
dim_for_normint: The dimension across which normalization is to be performed. Default: 0.
max_grad_normOptional[float]: The max norm for gradients for gradient clipping. Default: None
grad_norm_pfloat: The p-value for p-normalization of gradients. Default: 2.0
varianceOptional[Tuple[float, Callable[[float, bool, int], float]]]: The tuple of variance to be used to sample actions for continuous action space and a method to be used to decay it. The passed method have the signature Callable[[float, int], float]. The first argument would be the variance value and second value be the boolean, done flag indicating if the state is terminal or not and third will be the timestep; returning the updated variance value. Default: None

Notes

The codes for apply_norm are given as follows: -

  • No Normalization: -1; ("none")
  • Min-Max Normalization: 0; ("min_max")
  • Standardization: 1; ("standardize")
  • P-Normalization: 2; ("p_norm")

The codes for apply_norm_to are given as follows:

  • No Normalization: -1; (["none"])
  • On States only: 0; (["states"])
  • On Rewards only: 1; (["rewards"])
  • On TD value only: 2; (["advantage"])
  • On States and Rewards: 3; (["states", "rewards"])
  • On States and TD: 4; (["states", "advantage"])

If a valid max_norm_grad is passed, then gradient clipping takes place else gradient clipping step is skipped. If max_norm_grad value was invalid, error will be raised from PyTorch. :param distribution:

Reimplemented from rlpack.utils.base.agent.Agent.

Reimplemented in rlpack.actor_critic.a3c.A3C.

Member Function Documentation

◆ _call_to_save()

None rlpack.actor_critic.a2c.A2C._call_to_save (   self)
private

Method calling the save method when required.

This method is to be overriden by asynchronous methods.

Reimplemented in rlpack.actor_critic.a3c.A3C.

◆ _call_to_train_policy_model()

None rlpack.actor_critic.a2c.A2C._call_to_train_policy_model (   self,
Union[bool, int]  done 
)
private

Protected method to train the policy model.

If done flag is True, will compute the loss and run the optimizer. This method is meant to periodically check if episode hsa been terminated or and train policy models if episode has terminated.

Parameters
doneUnion[bool, int]: Flag indicating if episode has terminated or not

◆ _clear()

None rlpack.actor_critic.a2c.A2C._clear (   self)
private

Protected void method to clear the lists of rewards, action_log_probs and state_values.

◆ _compute_advantage()

pytorch.Tensor rlpack.actor_critic.a2c.A2C._compute_advantage (   self,
pytorch.Tensor  returns,
pytorch.Tensor   state_current_values 
)
private

Computes the advantage from returns and state values.

Parameters
returnspytorch.Tensor: The discounted returns; computed from _compute_returns method
state_current_valuespytorch.Tensor: The corresponding state values
Returns
pytorch.Tensor: The advantage for the given returns and state values

◆ _compute_loss()

pytorch.Tensor rlpack.actor_critic.a2c.A2C._compute_loss (   self)
private

Method to compute total loss (from actor and critic).

Returns
pytorch.Tensor: The loss tensor.

◆ _compute_returns()

pytorch.Tensor rlpack.actor_critic.a2c.A2C._compute_returns (   self)
private

Computes the discounted returns iteratively.

Returns
pytorch.Tensor: The discounted returns

◆ _create_action_distribution()

Distribution rlpack.actor_critic.a2c.A2C._create_action_distribution (   self,
pytorch.Tensor  action_values 
)
private

Protected static method to create distributions from action logits.

Parameters
action_valuespytorch.Tensor: The action values from policy model
Returns
Distribution: A Distribution object initialized with given action logits

◆ _grad_mean_reduction()

None rlpack.actor_critic.a2c.A2C._grad_mean_reduction (   self)
private

Performs mean reduction and assigns the policy model's parameter the mean reduced gradients.

◆ _run_optimizer()

None rlpack.actor_critic.a2c.A2C._run_optimizer (   self,
  loss 
)
private

Protected void method to train the model or accumulate the gradients for training.

  • If bootstrap_rounds is passed as 1 (default), model is trained each time the method is called.
  • If bootstrap_rounds > 1, the gradients are accumulated in grad_accumulator and model is trained via _train_models method.

Reimplemented in rlpack.actor_critic.a3c.A3C.

◆ load()

None rlpack.actor_critic.a2c.A2C.load (   self,
Optional[str]   custom_name_suffix = None 
)

This method loads the target_model, policy_model, optimizer, lr_scheduler and agent_states from the supplied save_path argument in the DQN Agent class' constructor (also called init).

Parameters
custom_name_suffixOptional[str]: If supplied, additional suffix is added to names of target_model, policy_model, optimizer and lr_scheduler. Useful to load the best model by a custom suffix supplied for evaluation. Default: None

Reimplemented from rlpack.utils.base.agent.Agent.

◆ policy()

Union[int, np.ndarray] rlpack.actor_critic.a2c.A2C.policy (   self,
Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]]  state_current,
**  kwargs 
)

The policy method to evaluate the agent.

This runs in pure inference mode.

Parameters
state_currentUnion[pytorch.Tensor, np.ndarray, List[Union[float, int]]]: The current state returned from gym environment
kwargsOther keyword arguments
Returns
int: The action to be taken

Reimplemented from rlpack.utils.base.agent.Agent.

◆ save()

None rlpack.actor_critic.a2c.A2C.save (   self,
Optional[str]   custom_name_suffix = None 
)

This method saves the target_model, policy_model, optimizer, lr_scheduler and agent_states in the supplied save_path argument in the DQN Agent class' constructor (also called init).

agent_states includes current memory and epsilon values in a dictionary.

Parameters
custom_name_suffixOptional[str]: If supplied, additional suffix is added to names of target_model, policy_model, optimizer and lr_scheduler. Useful to save best model by a custom suffix supplied during a train run. Default: None

Reimplemented from rlpack.utils.base.agent.Agent.

◆ train()

Union[int, np.ndarray] rlpack.actor_critic.a2c.A2C.train (   self,
Union[pytorch.Tensor, np.ndarray, List[Union[float, int]]]  state_current,
Union[int, float]  reward,
Union[bool, int]  done,
**  kwargs 
)

The train method to train the agent and underlying policy model.

Parameters
state_currentUnion[pytorch.Tensor, np.ndarray, List[Union[float, int]]]: The current state returned
rewardUnion[int, float]: The reward returned from previous action
doneUnion[bool, int]: Flag indicating if episode has terminated or not
kwargsOther keyword arguments.
Returns
int: The action to be taken

Reimplemented from rlpack.utils.base.agent.Agent.

Field Documentation

◆ _grad_accumulator

rlpack.actor_critic.a2c.A2C._grad_accumulator
private

The list of gradients from each backward call.

This is only used when boostrap_rounds > 1 and is cleared after each boostrap round. The rlpack._C.grad_accumulator.GradAccumulator object for grad accumulation.

◆ _normalization

rlpack.actor_critic.a2c.A2C._normalization
private

The normalisation tool to be used for agent.

An instance of rlpack.utils.normalization.Normalization.

◆ _operate_with_variance

rlpack.actor_critic.a2c.A2C._operate_with_variance
private

The boolean flag indicating if variance operations are to be used.

◆ action_log_probabilities

rlpack.actor_critic.a2c.A2C.action_log_probabilities

The list of sampled actions from each timestep from the action distribution.

This is cleared after each episode.

◆ action_space

rlpack.actor_critic.a2c.A2C.action_space

The input number of actions.

◆ apply_norm

rlpack.actor_critic.a2c.A2C.apply_norm

The input apply_norm argument; indicating the normalisation to be used.

◆ apply_norm_to

rlpack.actor_critic.a2c.A2C.apply_norm_to

The input apply_norm_to argument; indicating the quantity to normalise.

◆ backup_frequency

rlpack.actor_critic.a2c.A2C.backup_frequency

The input model backup frequency in terms of timesteps.

◆ bootstrap_rounds

rlpack.actor_critic.a2c.A2C.bootstrap_rounds

The input boostrap rounds.

◆ device

rlpack.actor_critic.a2c.A2C.device

The input device argument; indicating the device name.

◆ dim_for_norm

rlpack.actor_critic.a2c.A2C.dim_for_norm

The input dim_for_norm argument; indicating dimension along which we wish to normalise.

◆ distribution

rlpack.actor_critic.a2c.A2C.distribution

The input distribution object.

◆ entropies

rlpack.actor_critic.a2c.A2C.entropies

The list of entropies from each timestep.

This is cleared after each episode.

◆ entropy_coefficient

rlpack.actor_critic.a2c.A2C.entropy_coefficient

The input entropy coefficient.

◆ eps_for_norm

rlpack.actor_critic.a2c.A2C.eps_for_norm

The input eps_for_norm argument; indicating epsilon to be used for normalisation.

◆ gamma

rlpack.actor_critic.a2c.A2C.gamma

The input discounting factor.

◆ grad_norm_p

rlpack.actor_critic.a2c.A2C.grad_norm_p

The input grad_norm_p; indicating the p-value for p-normalisation for gradient clippings.

◆ is_continuous_action_space

rlpack.actor_critic.a2c.A2C.is_continuous_action_space

Flag indicating if action space is continuous or discrete.

◆ loss_function

rlpack.actor_critic.a2c.A2C.loss_function

The input loss function.

◆ lr_scheduler

rlpack.actor_critic.a2c.A2C.lr_scheduler

The input optional LR Scheduler (this can be None).

◆ lr_threshold

rlpack.actor_critic.a2c.A2C.lr_threshold

The input LR Threshold.

◆ max_grad_norm

rlpack.actor_critic.a2c.A2C.max_grad_norm

The input max_grad_norm; indicating the maximum gradient norm for gradient clippings.

◆ optimizer

rlpack.actor_critic.a2c.A2C.optimizer

The input optimizer wrapped with policy_model parameters.

◆ p_for_norm

rlpack.actor_critic.a2c.A2C.p_for_norm

The input p_for_norm argument; indicating p-value for p-normalisation.

◆ policy_model

rlpack.actor_critic.a2c.A2C.policy_model

The input policy model moved to desired device.

◆ rewards

rlpack.actor_critic.a2c.A2C.rewards

The list of rewards from each timestep.

This is cleared after each episode.

◆ save_path

rlpack.actor_critic.a2c.A2C.save_path

The input save path for backing up agent models.

◆ state_value_coefficient

rlpack.actor_critic.a2c.A2C.state_value_coefficient

The input state value coefficient.

◆ states_current_values

rlpack.actor_critic.a2c.A2C.states_current_values

The list of state values at each timestep.This is cleared after each episode.


◆ step_counter

rlpack.actor_critic.a2c.A2C.step_counter

The step counter; counting the total timesteps done so far.

◆ variance_decay_fn

rlpack.actor_critic.a2c.A2C.variance_decay_fn

The variance decay method.

This will be None if variance argument was not passed

◆ variance_value

rlpack.actor_critic.a2c.A2C.variance_value

The current variance value.

This will be None if variance argument was not passed