fsrl.policy¶

Base¶

class fsrl.policy.BasePolicy(actor: ~torch.nn.modules.module.Module, critics: ~torch.nn.modules.module.Module | ~typing.List[~torch.nn.modules.module.Module], dist_fn: ~typing.Type[~torch.distributions.distribution.Distribution] | None = None, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, gamma: float = 0.99, max_batchsize: int | None = 99999, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', observation_space: ~gymnasium.spaces.space.Space | None = None, action_space: ~gymnasium.spaces.space.Space | None = None, lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | ~tianshou.utils.lr_scheduler.MultipleLRSchedulers | None = None)[source]¶

Bases: ABC, Module

The base class for safe RL policy.

The base class follows a similar structure as Tianshou. All of the policy classes must inherit BasePolicy.

A policy class typically has the following parts:

__init__(): initialize the policy, including coping
the target network and so on;
forward(): compute action with given observation;
process_fn(): pre-process data from the replay buffer
(this function can interact with replay buffer);
learn(): update policy with a given batch of data.
post_process_fn(): update the replay buffer from the
learning process (e.g., prioritized replay buffer needs to update the weight);
update(): the main interface for training, i.e.,
process_fn -> learn -> post_process_fn.

Most of the policy needs a neural network to predict the action and an optimizer to optimize the policy. The rules of self-defined networks are:

1. Input: observation “obs” (may be a numpy.ndarray, a torch.Tensor, a dict or any others), hidden state “state” (for RNN usage), and other information “info” provided by the environment. 2. Output: some “logits”, the next hidden state “state”, and the intermediate result during policy forwarding procedure “policy”. The “logits” could be a tuple instead of a torch.Tensor. It depends on how the policy process the network output. For example, in PPO, the return of the network might be (mu, sigma), state for Gaussian policy. The “policy” can be a Batch of torch.Tensor or other things, which will be stored in the replay buffer, and can be accessed in the policy update process (e.g. in “policy.learn()”, the “batch.policy” is what you need).

Since BasePolicy inherits torch.nn.Module, you can use BasePolicy almost the same as torch.nn.Module, for instance, loading and saving the model:

torch.save(policy.state_dict(), "policy.pth")
policy.load_state_dict(torch.load("policy.pth"))

Parameters:

actor (torch.nn.Module) – the actor network.
critics (Union[nn.Module, List[nn.Module]]) – the critic network(s). (s -> V(s))
dist_fn – distribution class for stochastic policy to sample the action. Default to None :type dist_fn: Type[torch.distributions.Distribution]
logger (BaseLogger) – the logger instance for logging training information. Default to DummyLogger.
gamma (float) – the discounting factor for cost and reward, should be in [0, 1]. Default to 0.99.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 99999.
reward_normalization (bool) – normalize estimated values to have std close to 1, also normalize the advantage to Normal(0, 1). Default to False.
deterministic_eval – whether to use deterministic action instead of stochastic action sampled by the policy. Default to True.
action_scaling – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method – method to bound action to range [-1, 1]. Default to “clip”.
observation_space – environment’s observation space. Default to None.
action_space – environment’s action space. Default to None.
lr_scheduler – learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None.

forward(batch: Batch, state: dict | Batch | ndarray | None = None, **kwargs: Any) → Batch[source]¶

Compute action over the given batch data.

Returns:

A Batch which MUST have the following keys:

act an numpy.ndarray or a torch.Tensor, the action over given batch data.
state a dict, an numpy.ndarray or a torch.Tensor, the internal state of the policy, None as default.

Other keys are user-defined based on the algorithm. For example,

# for stochastic policy return Batch(logits=..., act=..., state=None,
dist=...)

where, * ``logits`` the network's raw output. * ``dist`` the action
distribution. * ``state`` the hidden state.

The keyword policy is reserved and the corresponding data will be stored into the replay buffer. For instance,

# some code return Batch(..., policy=Batch(log_prob=dist.log_prob(act))) #
and in the sampled data batch, you can directly use # batch.policy.log_prob
to get your data.

Note

In continuous action space, you should do another step “map_action” to get the real action:

act = policy(batch).act  # doesn't map to the target action range act =
policy.map_action(act, batch)

pre_update_fn(**kwarg: Any) → Any[source]¶

Pre-process the policy or data before updating the policy.

This function is called after each data collection in trainer() and could be used to update the Lagrangian multiplier.

post_update_fn(**kwarg: Any) → Any[source]¶

Post-process the policy or data after updating the policy.

This function is in trainer() and could be used to sync the weight or old variables.

exploration_noise(act: ndarray | Batch, batch: Batch) → ndarray | Batch[source]¶

Modify the action from policy.forward with exploration noise.

Parameters:

act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.

Returns:

action in the same form of input “act” but with added exploration noise.

soft_update(tgt: Module, src: Module, tau: float) → None[source]¶: Softly update the parameters of target module towards the parameters of source module.

map_action(act: Batch | ndarray) → Batch | ndarray[source]¶

Map raw network output to action range in gym’s env.action_space.

This function is called in collect() and only affects action sending to env. Remapped action will not be stored in buffer and thus can be viewed as a part of env (a black box action transformation).

Action mapping includes 2 standard procedures: bounding and scaling. Bounding procedure expects original action range is (-inf, inf) and maps it to [-1, 1], while scaling procedure expects original action range is (-1, 1) and maps it to [action_space.low, action_space.high]. Bounding procedure is applied first.

Parameters:: act – a data batch or numpy.ndarray which is the action taken by policy.forward.
Returns:: action in the same form of input “act” but remap to the target action space.

map_action_inverse(act: Batch | List | ndarray) → Batch | List | ndarray[source]¶

Inverse operation to map_action().

This function is called in collect() for random initial steps. It scales [action_space.low, action_space.high] to the value ranges of policy.forward.

Parameters:: act – a data batch, list or numpy.ndarray which is the action taken by gym.spaces.Box.sample().
Returns:: action remapped.

process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) → Batch[source]¶

Pre-process the data from the provided replay buffer.

Used in update(). Check out here for more information.

abstract learn(batch: Batch, **kwargs: Any) → Dict[str, Any][source]¶

Update policy with a given batch of data.

Returns:: A dict, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to this for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

post_process_fn(batch: Batch, buffer: ReplayBuffer, indices: ndarray) → None[source]¶

Post-process the data from the provided replay buffer.

Typical usage is to update the sampling weight in prioritized experience replay. Used in update().

update(sample_size: int, buffer: ReplayBuffer | None, **kwargs: Any) → Dict[str, Any][source]¶

Update the policy network and replay buffer.

It includes 3 function steps: process_fn, learn, and post_process_fn. In addition, this function will change the value of self.updating: it will be False before this function and will be True when executing update().

Parameters:

sample_size (int) – 0 means it will extract all the data from the buffer, otherwise it will sample a batch with given sample_size.
buffer (ReplayBuffer) – the corresponding replay buffer.

Returns:

No return because all the info should be stored in the logger.

static value_mask(buffer: ReplayBuffer, indices: ndarray) → ndarray[source]¶

Value mask determines whether the obs_next of buffer[indices] is valid.

For instance, usually “obs_next” after “done” flag is considered to be invalid, and its q/advantage value can provide meaningless (even misleading) information, and should be set to 0 by hand. But if “done” flag is generated because timelimit of game length (info[“TimeLimit.truncated”] is set to True in gym’s settings), “obs_next” will instead be valid. Value mask is typically used for assisting in calculating the correct q/advantage value.

Parameters:

buffer (ReplayBuffer) – the corresponding replay buffer.
indices (numpy.ndarray) – indices of replay buffer whose “obs_next” will be judged.

Returns:

A bool type numpy.ndarray in the same shape with indices. “True” means “obs_next” of that buffer[indices] is valid.

static get_metrics(batch: Batch)[source]¶

compute_gae_returns(batch: Batch, buffer: ReplayBuffer, indices: ndarray, gae_lambda: float = 0.95) → Batch[source]¶

Compute Generalized Advantage Estimation (GAE) returns.

This function takes in a data batch, a data buffer, an array of indices, a GAE lambda value, and computes the GAE returns for each critic. It returns a Batch object with the result stored in batch.values, batch.rets, and batch.advs as torch.Tensors.

Parameters:

batch (Batch) – A data batch.
buffer (ReplayBuffer) – The data buffer.
indices (ndarray) – An array of indices.
gae_lambda (float) – The GAE lambda value. Should be in [0, 1]. Defaults to 0.95.

Returns:

Batch object with the result stored in batch.values, batch.rets, and batch.advs as torch.Tensors.

compute_nstep_returns(batch: Batch, buffer: ReplayBuffer, indice: ndarray, target_q_fn: Callable[[ReplayBuffer, ndarray], Tensor], n_step: int = 1) → Batch[source]¶

Compute n-step return for Q-learning targets.

\[G_t = \sum_{i = t}^{t + n - 1} \gamma^{i - t}(1 - d_i)r_i + \gamma^n (1 - d_{t + n}) Q_{\mathrm{target}}(s_{t + n})\]

where \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\), \(d_t\) is the done flag of step \(t\).

Parameters:

batch (Batch) – a data batch, which is equal to buffer[indice].
buffer (ReplayBuffer) – the data buffer.
indice (ndarray) – the sampled batch indices in the buffer.
target_q_fn (function) – a function which compute target Q value of “obs_next” given data buffer and wanted indices.
n_step (int) – the number of estimation step, should be an int greater than 0. Default to 1.

Returns:

a Batch. The result will be stored in batch.returns as a torch.Tensor with the same shape as target_q_fn’s return tensor.

training: bool¶

Lagrangian Base¶

class fsrl.policy.LagrangianPolicy(actor: ~torch.nn.modules.module.Module, critics: ~torch.nn.modules.module.Module | ~typing.List[~torch.nn.modules.module.Module], dist_fn: ~typing.Type[~torch.distributions.distribution.Distribution], logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, use_lagrangian: bool = True, lagrangian_pid: ~typing.Tuple = (0.05, 0.0005, 0.1), cost_limit: ~typing.List | float = inf, rescaling: bool = True, gamma: float = 0.99, max_batchsize: int = 99999, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', observation_space: ~gymnasium.spaces.space.Space | None = None, action_space: ~gymnasium.spaces.space.Space | None = None, lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | ~tianshou.utils.lr_scheduler.MultipleLRSchedulers | None = None)[source]¶

Bases: BasePolicy

Implementation of PID Lagrangian-based method.

Parameters:

actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)
critics (List[torch.nn.Module]) – a list of the critic network. (s -> V(s))
dist_fn – distribution class for computing the action. :type dist_fn: Type[torch.distributions.Distribution]
logger (BaseLogger) – dummy logger for logging events.
use_lagrangian (bool) – whether to use Lagrangian method. Default to True.
lagrangian_pid (list) – list of PID constants for Lagrangian multiplier. Default to [0.05, 0.0005, 0.1].
cost_limit (float) – cost limit for the Lagrangian method. Default to np.inf.
rescaling (bool) – whether use the rescaling trick for Lagrangian multiplier, see Alg. 1 in http://proceedings.mlr.press/v119/stooke20a/stooke20a.pdf
gamma (float) – the discounting factor for cost and reward, should be in [0, 1]. Default to 0.99.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 99999.
reward_normalization (bool) – normalize estimated values to have std close to 1. Default to False.
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to True.
action_scaling (bool) – whether to map actions from range [-1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [-1, 1]. can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.
observation_space (gym.Space) – environment’s observation space. Default to None.
action_space (gym.Space) – environment’s action space. Default to None.
lr_scheduler – learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None.

See also

Please refer to BasePolicy for more detailed explanation.

pre_update_fn(stats_train: Dict, **kwarg) → None[source]¶

Pre-process the policy or data before updating the policy.

This function is called after each data collection in trainer() and could be used to update the Lagrangian multiplier.

update_cost_limit(cost_limit: float) → None[source]¶

Update the cost limit threshold.

Parameters:: cost_limit (float) – new cost threshold

update_lagrangian(cost_values: List | float) → None[source]¶

Update the Lagrangian multiplier before updating the policy.

Parameters:: cost_values (Union[List, float]) – the estimation of cost values that want to be controlled under the target thresholds. It could be a list (multiple constraints) or a scalar value.

get_extra_state()[source]¶

Save the lagrangian optimizer’s parameters.

This function is called when call the policy.state_dict(), see https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.get_extra_state

set_extra_state(state)[source]¶

Load the lagrangian optimizer’s parameters.

This function is called from load_state_dict() to handle any extra state found within the state_dict.

safety_loss(values: List) → Tuple[tensor, dict][source]¶

Compute the safety loss based on Lagrangian and return the scaling factor.

Parameters:: values (list) – the cost values that want to be constrained. They will be multiplied with the Lagrangian multipliers.
Return tuple[torch.tensor, dict]:: the total safety loss and a dictionary of info (including the rescaling factor, lagrangian, safety loss etc.)

CVPO¶

class fsrl.policy.CVPO(actor: ~torch.nn.modules.module.Module, critics: ~torch.nn.modules.module.Module | ~typing.List[~torch.nn.modules.module.Module], actor_optim: ~torch.optim.optimizer.Optimizer, critic_optim: ~torch.optim.optimizer.Optimizer, action_space: ~gymnasium.spaces.space.Space, dist_fn: ~typing.Type[~torch.distributions.distribution.Distribution], max_episode_steps: int, logger: ~fsrl.utils.logger.base_logger.BaseLogger | None = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: ~typing.List | float = inf, tau: float = 0.05, gamma: float = 0.99, n_step: int = 2, estep_iter_num: int = 1, estep_kl: float = 0.02, estep_dual_max: float = 20, estep_dual_lr: float = 0.02, sample_act_num: int = 16, mstep_iter_num: int = 1, mstep_kl_mu: float = 0.005, mstep_kl_std: float = 0.0005, mstep_dual_max: float = 0.5, mstep_dual_lr: float = 0.1, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: BasePolicy

Implementation of the Constrained Variational Policy Optimization (CVPO).

More details, please refer to https://arxiv.org/abs/2201.11927.

Parameters:

actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)
critics (Union[nn.Module, List[nn.Module]]) – the critic network(s). (s -> V(s))
actor_optim (torch.optim.Optimizer) – the optimizer for the actor network.
critic_optim (torch.optim.Optimizer) – the optimizer for the critic network(s).
action_space (gym.Space) – the action space of the environment.
dist_fn (Type[torch.distributions.Distribution]) – the probability distribution function for sampling actions.
max_episode_steps (int) – the maximum number of steps per episode for computing the step-wise qc threshold.
logger (Optional[BaseLogger]) – the logger instance for logging training information. (default=DummyLogger)
cost_limit (Union[List, float]) – the constraint limit(s) for the optimization. (default=np.inf)
tau (float) – target smoothing coefficient for soft update of target networks. (default=0.05)
gamma (float) – the discount factor for future rewards. (default=0.99)
n_step (int) – number of steps for multi-step learning. (default=2)
estep_iter_num (int) – the number of iterations for the E-step. (default=1)
estep_kl (float) – the KL divergence threshold for the E-step. (default=0.02)
estep_dual_max (float) – the maximum value for the dual variable in the E-step. (default=20)
estep_dual_lr (float) – the learning rate for the dual variable in the E-step. (default=0.02)
sample_act_num (int) – the number of actions to sample for the E-step. (default=16)
mstep_iter_num (int) – the number of iterations for the M-step. (default=1)
mstep_kl_mu (float) – the KL divergence threshold for the M-step (mean). (default=0.005)
mstep_kl_std (float) – the KL divergence threshold for the M-step (standard deviation). (default=0.0005)
mstep_dual_max (float) – the maximum value for the dual variable in the M-step. (default=0.5)
mstep_dual_lr (float) – the learning rate for the dual variable in the M-step. (default=0.1)
deterministic_eval (bool) – whether to use deterministic action selection during evaluation. (default=True)
action_scaling (bool) – whether to scale the actions according to the action space bounds. (default=True)
action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods). (default=”clip”)
lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer.

See also

Please refer to BasePolicy for more detailed hyperparameter explanations and usage.

update_cost_limit(cost_limit: float)[source]¶

Update the cost limit threshold.

Parameters:: cost_limit (float) – new cost threshold

pre_update_fn(**kwarg: Any) → Any[source]¶: Init the mstep optimizer and dual variables.

post_update_fn(**kwarg: Any) → Any[source]¶: Update the old actor network.

sync_weight() → None[source]¶: Soft-update the weight for the target network.

static gaussian_kl(mu_old: Tensor, std_old: Tensor, mu: Tensor, std: Tensor) → Tuple[Tensor, Tensor][source]¶

Decoupled KL between two multivariate Gaussians with diagonal covariance.

See https://arxiv.org/pdf/1812.02256.pdf Sec. 4.2.1 for details. kl_mu = KL( pi(mu_old, std_old) || pi(mu, std_old) ) kl_std = KL( pi(mu_old, std_old) || pi(mu_old, std) )

Parameters:

mu_old – (B, n)
mu – (B, n)
std_old – (B, n)
std – (B, n)

Returns:

kl_mu, kl_std: scalar mean and covariance terms of the KL

get_extra_state()[source]¶

Save the dual variables and their optimizers.

This function is called when call the policy.state_dict(), see https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.get_extra_state

set_extra_state(state)[source]¶

Load the dual variables and their optimizers.

This function is called from load_state_dict() to handle any extra state found within the state_dict.

DDPG-Lagrangian¶

class fsrl.policy.DDPGLagrangian(actor: ~torch.nn.modules.module.Module, critics: ~torch.nn.modules.module.Module | ~typing.List[~torch.nn.modules.module.Module], actor_optim: ~torch.optim.optimizer.Optimizer | None, critic_optim: ~torch.optim.optimizer.Optimizer | None, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, tau: float = 0.05, exploration_noise: ~tianshou.exploration.random.BaseNoise | None = <tianshou.exploration.random.GaussianNoise object>, n_step: int = 2, use_lagrangian: bool = True, lagrangian_pid: ~typing.Tuple = (0.05, 0.0005, 0.1), cost_limit: ~typing.List | float = inf, rescaling: bool = True, gamma: float = 0.99, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', observation_space: ~gymnasium.spaces.space.Space | None = None, action_space: ~gymnasium.spaces.space.Space | None = None, lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: LagrangianPolicy

The Deep Deterministic Policy Gradient (DDPG) with PID Lagrangian.

More details, please refer to https://arxiv.org/abs/1509.02971 (DDPG) and https://arxiv.org/abs/2007.03964 (PID Lagrangian).

Parameters:

actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)
critics (Union[nn.Module, List[nn.Module]]) – the critic network(s). (s -> V(s))
actor_optim (Optional[torch.optim.Optimizer]) – the optimizer for the actor network. Default is None.
critic_optim (Optional[torch.optim.Optimizer]) – the optimizer for the critic network(s). Default is None.
logger (BaseLogger) – the logger instance for logging training information. Default is DummyLogger.
tau (float) – the soft update coefficient for updating target networks. Default is 0.05.
exploration_noise (Optional[BaseNoise]) – the noise instance for exploration. Default is GaussianNoise(sigma=0.1).
n_step (int) – the number of steps for multi-step bootstrap targets. Default is 2.
use_lagrangian (bool) – whether to use the Lagrangian constraint optimization. Default is True.
lagrangian_pid (List) – the PID coefficients for the Lagrangian constraint optimization. Default is [0.05, 0.0005, 0.1].
cost_limit (Union[List, float]) – the constraint limit(s) for the Lagrangian optimization. Default is np.inf.
rescaling (bool) – whether to rescale the Lagrangian multiplier. Default is True.
gamma (float) – the discount factor for future rewards. Default is 0.99.
reward_normalization (bool) – normalize rewards if True. Default is False.
deterministic_eval (bool) – whether to use deterministic action selection during evaluation. Default is True.
action_scaling (bool) – whether to scale the actions according to the action space bounds. Default is True.
action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods). Default is “clip”.
observation_space (Optional[gym.Space]) – the observation space of the environment. Default is None.
action_space (Optional[gym.Space]) – the action space of the environment. Default is None.
lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer. Default is None.

See also

Please refer to BasePolicy and LagrangianPolicy for more detailed hyperparameter explanations and usage.

set_exp_noise(noise: BaseNoise | None) → None[source]¶: Set the exploration noise.

sync_weight() → None[source]¶: Soft-update the weight for the target network.

exploration_noise(act: ndarray | Batch, batch: Batch) → ndarray | Batch[source]¶

Modify the action from policy.forward with exploration noise.

Parameters:

act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.

Returns:

action in the same form of input “act” but with added exploration noise.

SAC-Lagrangian¶

class fsrl.policy.SACLagrangian(actor: ~torch.nn.modules.module.Module, critics: ~torch.nn.modules.module.Module | ~typing.List[~torch.nn.modules.module.Module], actor_optim: ~torch.optim.optimizer.Optimizer | None, critic_optim: ~torch.optim.optimizer.Optimizer | None, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, alpha: float | ~typing.Tuple[float, ~torch.Tensor, ~torch.optim.optimizer.Optimizer] = 0.005, tau: float = 0.05, exploration_noise: ~tianshou.exploration.random.BaseNoise | None = None, n_step: int = 2, use_lagrangian: bool = True, lagrangian_pid: ~typing.Tuple = (0.05, 0.0005, 0.1), cost_limit: ~typing.List | float = inf, rescaling: bool = True, gamma: float = 0.99, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', observation_space: ~gymnasium.spaces.space.Space | None = None, action_space: ~gymnasium.spaces.space.Space | None = None, lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: LagrangianPolicy

Implementation of the Soft Actor-Critic (SAC) with PID Lagrangian.

More details, please refer to https://arxiv.org/abs/1801.01290 (SAC) and https://arxiv.org/abs/2007.03964 (PID Lagrangian).

Parameters:

actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)
critics (Union[nn.Module, List[nn.Module]]) – the critic network(s). (s -> V(s))
actor_optim (Optional[torch.optim.Optimizer]) – the optimizer for the actor network.
critic_optim (Optional[torch.optim.Optimizer]) – the optimizer for the critic network(s).
logger (BaseLogger) – the logger instance for logging training information. (default: DummyLogger)
alpha (Union[float, Tuple[float, torch.Tensor, torch.optim.Optimizer]]) – initial temperature for entropy regularization. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatically tuned.(default: 0.005)
tau (float) – target smoothing coefficient for soft update of target networks. (default: 0.05)
exploration_noise (Optional[BaseNoise]) – the exploration noise. (default: None)
n_step (int) – number of steps for multi-step learning. (default: 2)
use_lagrangian (bool) – whether to use the Lagrangian constraint optimization. (default: True)
lagrangian_pid (List) – the PID coefficients for the Lagrangian constraint optimization. (default: [0.05, 0.0005, 0.1])
cost_limit (Union[List, float]) – the constraint limit(s) for the Lagrangian optimization. (default: np.inf)
rescaling (bool) – whether use the rescaling trick for Lagrangian multiplier, see Alg. 1 in http://proceedings.mlr.press/v119/stooke20a/stooke20a.pdf
gamma (float) – the discount factor for future rewards. (default: 0.99)
reward_normalization (bool) – normalize rewards if True. (default: False)
deterministic_eval (bool) – whether to use deterministic action selection during evaluation. (default: True)
action_scaling (bool) – whether to scale the actions according to the action space bounds. (default: True)
action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods). (default: “clip”)
observation_space (Optional[gym.Space]) – the observation space of the environment. (default: None)
action_space (Optional[gym.Space]) – the action space of the environment. (default: None)
lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer. (default: None)

See also

Please refer to BasePolicy and LagrangianPolicy for more detailed hyperparameter explanations and usage.

set_exp_noise(noise: BaseNoise | None) → None[source]¶: Set the exploration noise.

sync_weight() → None[source]¶: Soft-update the weight for the target network.

exploration_noise(act: ndarray | Batch, batch: Batch) → ndarray | Batch[source]¶

Modify the action from policy.forward with exploration noise.

Parameters:

act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.

Returns:

action in the same form of input “act” but with added exploration noise.

TRPO-Lagrangian¶

class fsrl.policy.TRPOLagrangian(actor: ~torch.nn.modules.module.Module, critics: ~torch.nn.modules.module.Module | ~typing.List[~torch.nn.modules.module.Module], optim: ~torch.optim.optimizer.Optimizer, dist_fn: ~typing.Type[~torch.distributions.distribution.Distribution], logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, target_kl: float = 0.001, backtrack_coeff: float = 0.8, max_backtracks: int = 10, optim_critic_iters: int = 5, gae_lambda: float = 0.95, advantage_normalization: bool = True, use_lagrangian: bool = True, lagrangian_pid: ~typing.Tuple = (0.05, 0.0005, 0.1), cost_limit: ~typing.List | float = inf, rescaling: bool = True, gamma: float = 0.99, max_batchsize: int = 99999, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', observation_space: ~gymnasium.spaces.space.Space | None = None, action_space: ~gymnasium.spaces.space.Space | None = None, lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: LagrangianPolicy

Implementation of the Trust Region Policy Optimization (TRPO) with PID Lagrangian.

More details, please refer to https://arxiv.org/abs/1502.05477 (TRPO) and https://arxiv.org/abs/2007.03964 (PID Lagrangian).

Parameters:

actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)
critics (Union[nn.Module, List[nn.Module]]) – the critic network(s). (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – the distribution function for the policy.
logger (BaseLogger) – the logger instance for logging training information.
target_kl (float) – the target KL divergence for the line search (default: 0.001).
backtrack_coeff (float) – the coefficient for backtracking during the line search (default: 0.8).
max_backtracks (int) – the maximum number of backtracks allowed during the line search (default: 10).
optim_critic_iters (int) – the number of optimization iterations for the critic network (default: 5).
gae_lambda (float) – the GAE lambda value (default: 0.95).
advantage_normalization (bool) – whether to normalize advantage (default: True).
use_lagrangian (bool) – whether to use the Lagrangian constraint optimization (default: True).
lagrangian_pid (List) – the PID coefficients for the Lagrangian constraint optimization (default: [0.05, 0.0005, 0.1]).
cost_limit (Union[List, float]) – the constraint limit(s) for the Lagrangian optimization (default: np.inf).
rescaling (bool) – whether use the rescaling trick for Lagrangian multiplier, see Alg. 1 in http://proceedings.mlr.press/v119/stooke20a/stooke20a.pdf
gamma (float) – the discount factor for future rewards (default: 0.99).
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 99999.
reward_normalization (bool) – whether to normalize rewards (default: False).
deterministic_eval (bool) – whether to use deterministic action selection during evaluation (default: True).
action_scaling (bool) – whether to scale the actions according to the action space bounds (default: True).
action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods) (default: “clip”).
observation_space (Optional[gym.Space]) – the observation space of the environment (default: None).
action_space (Optional[gym.Space]) – the action space of the environment (default: None).
lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer (default: None).

See also

Please refer to BasePolicy and LagrangianPolicy for more detailed hyperparameter explanations and usage.

PPO-Lagrangian¶

class fsrl.policy.PPOLagrangian(actor: ~torch.nn.modules.module.Module, critics: ~torch.nn.modules.module.Module | ~typing.List[~torch.nn.modules.module.Module], optim: ~torch.optim.optimizer.Optimizer, dist_fn: ~typing.Type[~torch.distributions.distribution.Distribution], logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, target_kl: float = 0.02, vf_coef: float = 0.25, max_grad_norm: float | None = None, gae_lambda: float = 0.95, eps_clip: float = 0.2, dual_clip: float | None = None, value_clip: bool = False, advantage_normalization: bool = True, recompute_advantage: bool = False, use_lagrangian: bool = True, lagrangian_pid: ~typing.Tuple = (0.05, 0.0005, 0.1), cost_limit: ~typing.List | float = inf, rescaling: bool = True, gamma: float = 0.99, max_batchsize: int = 99999, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', observation_space: ~gymnasium.spaces.space.Space | None = None, action_space: ~gymnasium.spaces.space.Space | None = None, lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: LagrangianPolicy

Implementation of the Proximal Policy Optimization (PPO) with PID Lagrangian.

More details, please refer to https://arxiv.org/abs/1707.06347 (PPO) and https://arxiv.org/abs/2007.03964 (PID Lagrangian).

Parameters:

actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)
critics (Union[nn.Module, List[nn.Module]]) – the critic network(s). (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for the actor and critic networks.
dist_fn (Type[torch.distributions.Distribution]) – the distribution class for action sampling.
logger (BaseLogger) – the logger instance for logging training information. Default is BaseLogger().
target_kl (float) – the target KL divergence for the PPO update. Default is 0.02.
vf_coef (float) – the value function coefficient for the loss function. Default is 0.25.
max_grad_norm (Optional[float]) – the maximum gradient norm for gradient clipping (None for no clipping). Default is None.
gae_lambda (float) – the Generalized Advantage Estimation (GAE) parameter. Default is 0.95.
eps_clip (float) – the PPO clipping parameter for the policy update. Default is 0.2.
dual_clip (Optional[float]) – the PPO dual clipping parameter (None for no dual clipping). Default is None.
value_clip (bool) – whether to clip the value function update. Default is False.
advantage_normalization (bool) – whether to normalize the advantages. Default is True.
recompute_advantage (bool) – whether to recompute the advantages during the optimization process. Default is False.
use_lagrangian (bool) – whether to use the Lagrangian constraint optimization. Default is True.
lagrangian_pid (List) – the PID coefficients for the Lagrangian constraint optimization. Default is [0.05, 0.0005, 0.1].
cost_limit (Union[List, float]) – the constraint limit(s) for the Lagrangian optimization. Default is np.inf.
rescaling (bool) – whether use the rescaling trick for Lagrangian multiplier, see Alg. 1 in http://proceedings.mlr.press/v119/stooke20a/stooke20a.pdf
gamma (float) – the discount factor for future rewards. Default is 0.99.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 99999.
reward_normalization (bool) – whether to normalize the rewards. Default is False.
deterministic_eval (bool) – whether to use deterministic actions during evaluation. Default is True.
action_scaling (bool) – whether to scale actions based on the action space. Default is True.
action_bound_method (str) – the method used to handle out-of-bound actions. Default is “clip”.
observation_space (Optional[gym.Space]) – the observation space of the environment. Default is None.
action_space (Optional[gym.Space]) – the action space of the environment. Default is None.
lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – the learning rate scheduler. Default is None.

See also

Please refer to BasePolicy and LagrangianPolicy for more detailed hyperparameter explanations and usage.

FOCOPS¶

class fsrl.policy.FOCOPS(actor: ~torch.nn.modules.module.Module, critics: ~torch.nn.modules.module.Module | ~typing.List[~torch.nn.modules.module.Module], actor_optim: ~torch.optim.optimizer.Optimizer, critic_optim: ~torch.optim.optimizer.Optimizer, dist_fn: ~typing.Type[~torch.distributions.distribution.Distribution], logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, nu: float | ~typing.Tuple[float, float, ~torch.Tensor] = 0.01, l2_reg: float = 0.001, delta: float = 0.02, eta: float = 0.02, tem_lambda: float = 0.95, gae_lambda: float = 0.95, max_grad_norm: float | None = 0.5, advantage_normalization: bool = True, recompute_advantage: bool = False, gamma: float = 0.99, max_batchsize: int = 99999, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', observation_space: ~gymnasium.spaces.space.Space | None = None, action_space: ~gymnasium.spaces.space.Space | None = None, lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: BasePolicy

Implementation of the First Order Constrained Optimization in Policy Space.

More details, please refer to https://arxiv.org/pdf/2002.06506.pdf

Parameters:

actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)
critics (Union[nn.Module, List[nn.Module]]) – the critic network(s). (s -> V(s))
actor_optim (torch.optim.Optimizer) – the optimizer for the actor network.
critic_optim (torch.optim.Optimizer) – the optimizer for the critic network(s).
dist_fn (Type[torch.distributions.Distribution]) – the probability distribution function for sampling actions.
logger (BaseLogger) – the logger instance for logging training information.
cost_limit (float) – the constraint limit for the optimization. Default value is 10.
nu (Union[float, Tuple[float, float, torch.Tensor]]) – cost coefficient. Default value is 0.01.
l2_reg (float) – L2 regularization rate. Default value is 1e-3.
delta (float) – early stop KL bound. Default value is 0.02.
eta (float) – KL bound for indicator function. Default value is 0.02.
tem_lambda (float) – inverse temperature lambda. Default value is 0.95.
gae_lambda (float) – GAE (Generalized Advantage Estimation) lambda for advantage computation. Default value is 0.95.
max_grad_norm (Optional[float]) – maximum gradient norm for gradient clipping, if specified. Default value is 0.5.
advantage_normalization (bool) – normalize advantage if True. Default value is True.
recompute_advantage (bool) – recompute advantage using the updated value function. Default value is False.
gamma (float) – the discount factor for future rewards. Default value is 0.99.
max_batchsize (int) – maximum batch size for the optimization. Default value is 99999.
reward_normalization (bool) – normalize the rewards if True. Default value is False.
deterministic_eval (bool) – whether to use deterministic action selection during evaluation. Default value is True.
action_scaling (bool) – whether to scale the actions according to the action space bounds. Default value is True.
action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods). Default value is “clip”.
observation_space (Optional[gym.Space]) – the observation space of the environment. Default value is None.
action_space (Optional[gym.Space]) – the action space of the environment. Default value is None.
lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer. Default value is None.

See also

Please refer to BasePolicy for more detailed hyperparameter explanations and usage.

pre_update_fn(stats_train: Dict, **kwarg) → Any[source]¶

Pre-process the policy or data before updating the policy.

This function is called after each data collection in trainer() and could be used to update the Lagrangian multiplier.

update_cost_limit(cost_limit: float) → None[source]¶

Update the cost limit threshold.

Parameters:: cost_limit (float) – new cost threshold

nu_loss(batch: Batch)[source]¶

CPO¶

class fsrl.policy.CPO(actor: ~torch.nn.modules.module.Module, critics: ~torch.nn.modules.module.Module | ~typing.List[~torch.nn.modules.module.Module], optim: ~torch.optim.optimizer.Optimizer, dist_fn: ~typing.Type[~torch.distributions.distribution.Distribution], logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, target_kl: float = 0.01, backtrack_coeff: float = 0.8, damping_coeff: float = 0.1, max_backtracks: int = 10, optim_critic_iters: int = 20, l2_reg: float = 0.001, gae_lambda: float = 0.95, advantage_normalization: bool = True, cost_limit: ~typing.List | float = inf, gamma: float = 0.99, max_batchsize: int = 99999, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', observation_space: ~gymnasium.spaces.space.Space | None = None, action_space: ~gymnasium.spaces.space.Space | None = None, lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: BasePolicy

Implementation of the Constrained Policy Optimization (CPO).

More details, please refer to https://arxiv.org/abs/1705.10528.

Parameters:

actor (torch.nn.Module) – the actor network following the rules in BasePolicy. (s -> logits)
critics (Union[nn.Module, List[nn.Module]]) – the critic network(s). (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – the distribution function for the policy.
logger (BaseLogger) – the logger instance for logging training information.
target_kl (float) – the target KL divergence for the line search. (default: 0.01)
backtrack_coeff (float) – the coefficient for backtracking during the line search. (default: 0.8)
damping_coeff (float) – the damping coefficient for the Fisher matrix. (default: 0.1)
max_backtracks (int) – the maximum number of backtracks allowed during the line search. (default: 10)
optim_critic_iters (int) – the number of optimization iterations for the critic network. (default: 20)
l2_reg (float) – the L2 regularization coefficient for the critic network. (default: 0.001)
gae_lambda (float) – the GAE lambda value. (default: 0.95)
advantage_normalization (bool) – normalize advantage if True. (default: True)
cost_limit (Union[List, float]) – the constraint limit(s) for the Lagrangian optimization. (default: np.inf)
gamma (float) – the discount factor for future rewards. (default: 0.99)
max_batchsize (int) – the maximum batch size for updating the policy. (default: 99999)
reward_normalization (bool) – normalize rewards if True. (default: False)
deterministic_eval (bool) – whether to use deterministic action selection during evaluation. (default: True)
action_scaling (bool) – whether to scale the actions according to the action space bounds. (default: True)
action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods). (default: “clip”)
observation_space (Optional[gym.Space]) – the observation space of the environment. (default: None)
action_space (Optional[gym.Space]) – the action space of the environment. (default: None)
lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer. (default: None)

See also

Please refer to BasePolicy for more detailed hyperparameter explanations and usage.

pre_update_fn(stats_train: Dict, **kwarg) → Any[source]¶

Pre-process the policy or data before updating the policy.

This function is called after each data collection in trainer() and could be used to update the Lagrangian multiplier.

update_cost_limit(cost_limit: float) → None[source]¶

Update the cost limit threshold.

Parameters:: cost_limit (float) – new cost threshold