fsrl.agent¶

The default MLP agent package.

class fsrl.agent.BaseAgent(*args, **kwargs)[source]¶

Bases: ABC

The base class for a default agent.

A agent class should have the following parts:

__init__(): initialize the agent, including the policy,
networks, optimizers, and so on;
learn(): start training given the learning parameters;
evaluate(): evaluate the agent multiple episodes;
state_dict: the agent state dictionary that can be
saved as checkpoints;

Example of usage:

# initialize the CVPO agent
agent = CVPOAgent(env, other_algo_params) # train multiple epochs
agent.learn(training_envs, other_training_params)

# test after the training is finished agent.eval(testing_envs)

# test with agent's state_dict agent.eval(testing_envs, agent.state_dict)

All of the agent classes must inherit BaseAgent.

name = 'BaseAgent'¶

abstract learn(*args, **kwargs) → None[source]¶: Train the policy on a set of training environments.

evaluate(test_envs: Env | BaseVectorEnv, state_dict: dict | None = None, eval_episodes: int = 10, render: bool = False, train_mode: bool = False) → Tuple[float, float, float][source]¶

Evaluate the policy on a set of test environments.

Parameters:

test_envs (Union[gym.Env, BaseVectorEnv]) – A single environment or a vectorized environment to evaluate the policy on.
state_dict (Optional[dict]) – An optional dictionary containing the state params of the agent to be evaluated., defaults to None
eval_episodes (int) – The number of episodes to evaluate, defaults to 10
render (bool) – Whether to render the environment during evaluation, defaults to False
train_mode (bool) – Whether to set the policy to training mode during evaluation, defaults to False

Return Tuple:

rewards, episode lengths, and constraint costs obtained during evaluation.

property state_dict¶: Return the policy’s state_dict.

class fsrl.agent.OffpolicyAgent[source]¶

Bases: BaseAgent

The base class for an off-policy agent.

The learn(): function is customized to work with the off-policy trainer. See BaseAgent for more details.

name = 'OffpolicyAgent'¶

learn(train_envs: Env | BaseVectorEnv, test_envs: Env | BaseVectorEnv | None = None, epoch: int = 300, episode_per_collect: int = 5, step_per_epoch: int = 3000, update_per_step: float = 0.1, buffer_size: int = 100000, testing_num: int = 2, batch_size: int = 256, reward_threshold: float = 450, save_interval: int = 4, resume: bool = False, save_ckpt: bool = True, verbose: bool = True, show_progress: bool = True) → None[source]¶

Train the policy on a set of training environments.

Parameters:

train_envs (Union[gym.Env, BaseVectorEnv]) – A single environment or a vectorized environment to train the policy on.
test_envs (Union[gym.Env, BaseVectorEnv]) – A single environment or a vectorized environment to evaluate the policy on, default to None.
epoch (int) – The number of training epochs, defaults to 300.
episode_per_collect (int) – The number of episodes to collect before each policy update, defaults to 5.
step_per_epoch (int) – The number of environment steps per epoch, defaults to 3000.
update_per_step (float) – The ratio of policy updates to environment steps, defaults to 0.1.
buffer_size (int) – The maximum size of the replay buffer, defaults to 100000.
testing_num (int) – The number of episodes to use for evaluation, defaults to 2.
batch_size (int) – The batch size for each policy update, defaults to 256.
reward_threshold (float) – The reward threshold for early stopping, defaults to 450.
save_interval (int) – The interval (in epochs) for saving the policy model, defaults to 4.
resume (bool) – Whether to resume training from the last checkpoint, defaults to False.
save_ckpt (bool) – Whether to save the policy model, defaults to True.
verbose (bool) – Whether to print progress information during training, defaults to True.
show_progress (bool) – Whether to show the tqdm training progress bar, defaults to True

policy: BasePolicy¶

class fsrl.agent.OnpolicyAgent[source]¶

Bases: BaseAgent

The base class for an on-policy agent.

The learn(): function is customized to work with the on-policy trainer. See BaseAgent for more details.

name = 'OnpolicyAgent'¶

learn(train_envs: Env | BaseVectorEnv, test_envs: Env | BaseVectorEnv | None = None, epoch: int = 300, episode_per_collect: int = 20, step_per_epoch: int = 10000, repeat_per_collect: int = 4, buffer_size: int = 100000, testing_num: int = 2, batch_size: int = 512, reward_threshold: float = 450, save_interval: int = 4, resume: bool = False, save_ckpt: bool = True, verbose: bool = True, show_progress: bool = True) → None[source]¶

Train the policy on a set of training environments.

Parameters:

train_envs (Union[gym.Env, BaseVectorEnv]) – A single environment or a vectorized environment to train the policy on.
test_envs (Union[gym.Env, BaseVectorEnv]) – A single environment or a vectorized environment to evaluate the policy on, defaults to None.
epoch (int) – The number of training epochs, defaults to 300
episode_per_collect (int) – The number of episodes collected per data collection, defaults to 20
step_per_epoch (int) – The number of steps per training epoch, defaults to 10000
repeat_per_collect (int) – The number of repeats of policy update for one episode collection, defaults to 4
buffer_size (int) – The size of the replay buffer, defaults to 100000
testing_num (int) – The number of episodes to evaluate during testing, defaults to 2
batch_size (int) – The batch size for training, default is 99999 for TRPOLagAgent CPOLagAgent, and is 512 for others
reward_threshold (float) – The threshold for stopping training when the mean reward exceeds it, defaults to 450
save_interval (int) – The number of epochs to save the policy, defaults to 4
resume (bool) – Whether to resume training from the saved checkpoint, defaults to False
save_ckpt (bool) – Whether to save the policy model, defaults to True
verbose (bool) – Whether to print the training information, defaults to True
show_progress (bool) – Whether to show the tqdm training progress bar, defaults to True

policy: BasePolicy¶

class fsrl.agent.CVPOAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, estep_iter_num: int = 1, estep_kl: float = 0.02, estep_dual_max: float = 20, estep_dual_lr: float = 0.02, sample_act_num: int = 16, mstep_iter_num: int = 1, mstep_kl_mu: float = 0.005, mstep_kl_std: float = 0.0005, mstep_dual_max: float = 0.5, mstep_dual_lr: float = 0.1, actor_lr: float = 0.0005, critic_lr: float = 0.001, gamma: float = 0.98, n_step: int = 2, tau: float = 0.05, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), double_critic: bool = False, conditioned_sigma: bool = True, unbounded: bool = False, last_layer_scale: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: OffpolicyAgent

Constrained Variational Policy Optimization (CVPO) agent.

More details, please refer to https://arxiv.org/abs/2201.11927.

Parameters:

env (gym.Env) – The Gym environment to train the agent on.
logger (BaseLogger) – The logger to use for the agent (default is DummyLogger).
cost_limit (int) – The cost limit of the task.
device (str) – The device to use for training (default is ‘cpu’).
thread (str) – The number of threads to use for training when using the CPU (default is 4).
seed (int) – The random seed to use for training (default is 10).
estep_iter_num (int) – the number of iterations for the E-step. (default=1)
estep_kl (float) – the KL divergence threshold for the E-step. (default=0.02)
estep_dual_max (float) – the maximum value for the dual variable in the E-step. (default=20)
estep_dual_lr (float) – the learning rate for the dual variable in the E-step. (default=0.02)
sample_act_num (int) – the number of actions to sample for the E-step. (default=16)
mstep_iter_num (int) – the number of iterations for the M-step. (default=1)
mstep_kl_mu (float) – the KL divergence threshold for the M-step (mean). (default=0.005)
mstep_kl_std (float) – the KL divergence threshold for the M-step (standard deviation). (default=0.0005)
mstep_dual_max (float) – the maximum value for the dual variable in the M-step. (default=0.5)
mstep_dual_lr (float) – the learning rate for the dual variable in the M-step. (default=0.1)
actor_lr (float) – The learning rate of the actor network (default is 5e-4).
critic_lr (float) – The learning rate of the critic network (default is 1e-3).
gamma (float) – The discount factor (default is 0.98).
n_step (int) – The number of steps to look ahead when computing returns (default is 2).
tau (float) – The critics soft update coefficient (default is 0.05).
hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers in the actor and critic networks (default is (128, 128)).
double_critic (bool) – Whether to use two critic networks instead of one (default is False).
conditioned_sigma (bool) – Whether the variance of the Gaussian policy is conditioned on the state (default is True).
unbounded (bool) – Whether to use an unbounded output layer for the actor network (default is False).
last_layer_scale (bool) – Whether to scale the last layer of the actor network (default is False).
deterministic_eval (bool) – Whether to use a deterministic policy during evaluation (default is True).
action_scaling (bool) – Whether to scale actions by the maximum action value (default is True).
action_bound_method (str) – The method to use for action bounds (‘clip’ or ‘tanh’) (default is ‘clip’).
torch.optim.lr_scheduler.LambdaLR – The learning rate scheduler (default is None).

See also

Please refer to BaseAgent and OffpolicyAgent for more details of usage.

name = 'CVPOAgent'¶

policy: BasePolicy¶

class fsrl.agent.SACLagAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, actor_lr: float = 0.0005, critic_lr: float = 0.001, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), auto_alpha: bool = True, alpha_lr: float = 0.0003, alpha: float = 0.002, tau: float = 0.05, n_step: int = 2, use_lagrangian: bool = True, lagrangian_pid: ~typing.Tuple[float, ...] = (0.05, 0.0005, 0.1), rescaling: bool = True, gamma: float = 0.99, conditioned_sigma: bool = True, unbounded: bool = True, last_layer_scale: bool = False, deterministic_eval: bool = False, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: OffpolicyAgent

Soft Actor-Critic (SAC) with PID Lagrangian agent.

More details, please refer to https://arxiv.org/abs/1801.01290 (SAC) and https://arxiv.org/abs/2007.03964 (PID Lagrangian).

Parameters:

env (gym.Env) – The environment to train and evaluate the agent on.
logger (BaseLogger) – A logger instance to log training and evaluation statistics, default to a dummy logger.
cost_limit (float) – the constraint limit(s) for the Lagrangian optimization. (default: 10)
device (str) – The device to use for training and inference, default to “cpu”.
thread (int) – The number of threads to use for training, ignored if device is “cuda”, default to 4.
seed (int) – The random seed for reproducibility, default to 10.
actor_lr (float) – The learning rate of the actor network (default: 5e-4).
critic_lr (float) – The learning rate of the critic network (default: 1e-3).
hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers for the policy and value networks, default to (128, 128).
auto_alpha (bool) – whether to automatically tune “alpha”, the temperature. (default: True)
alpha_lr (float) – the learning rate of learning “alpha” if auto_alpha is True. (default: 3e-4)
alpha (float) – initial temperature for entropy regularization. (default: 0.005)
tau (float) – target smoothing coefficient for soft update of target networks. (default: 0.05)
n_step (int) – number of steps for multi-step learning. (default: 2)
use_lagrangian (bool) – whether to use the Lagrangian constraint optimization. (default: True)
lagrangian_pid (List) – the PID coefficients for the Lagrangian constraint optimization. (default: [0.05, 0.0005, 0.1])
rescaling (bool) – whether use the rescaling trick for Lagrangian multiplier, see Alg. 1 in http://proceedings.mlr.press/v119/stooke20a/stooke20a.pdf
gamma (float) – the discount factor for future rewards. (default: 0.99)
conditioned_sigma (bool) – Whether the variance of the Gaussian policy is conditioned on the state (default: True).
unbounded (bool) – Whether the action space is unbounded. (default: False)
last_layer_scale (bool) – Whether to scale the last layer output for the policy network. (default: False)
deterministic_eval (bool) – whether to use deterministic action selection during evaluation. (default: True)
action_scaling (bool) – whether to scale the actions according to the action space bounds. (default: True)
action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods). (default: “clip”)
lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer. (default: None)

See also

Please refer to BaseAgent and OffpolicyAgent for more details of usage.

name = 'SACLagAgent'¶

policy: BasePolicy¶

class fsrl.agent.DDPGLagAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, actor_lr: float = 0.0001, critic_lr: float = 0.001, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), tau: float = 0.005, exploration_noise: float = 0.1, n_step: int = 3, use_lagrangian: bool = True, lagrangian_pid: ~typing.Tuple[float, ...] = (0.5, 0.001, 0.1), rescaling: bool = True, gamma: float = 0.99, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: OffpolicyAgent

Deep Deterministic Policy Gradient (DDPG) with PID Lagrangian agent.

More details, please refer to https://arxiv.org/abs/1509.02971 (DDPG) and https://arxiv.org/abs/2007.03964 (PID Lagrangian).

Parameters:

env (gym.Env) – The environment to train and evaluate the agent on.
logger (BaseLogger) – A logger instance to log training and evaluation statistics, default to a dummy logger.
cost_limit (float) – The maximum constraint cost allowed, default to 10.
device (str) – The device to use for training and inference, default to “cpu”.
thread (int) – The number of threads to use for training, ignored if device is “cuda”, default to 4.
seed (int) – The random seed for reproducibility, default to 10.
actor_lr (float) – The learning rate of the actor network (default is 5e-4).
critic_lr (float) – The learning rate of the critic network (default is 1e-3).
hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers in the actor and critic networks (default is (128, 128)).
tau (float) – the soft update coefficient for updating target networks. Default is 0.05.
exploration_noise (Optional[BaseNoise]) – the noise instance for exploration. Default is GaussianNoise(sigma=0.1).
n_step (int) – the number of steps for multi-step bootstrap targets. Default is 2.
use_lagrangian (bool) – whether to use the Lagrangian constraint optimization. Default is True.
lagrangian_pid (List) – the PID coefficients for the Lagrangian constraint optimization. Default is [0.05, 0.0005, 0.1].
rescaling (bool) – whether use the rescaling trick for Lagrangian multiplier, see Alg. 1 in http://proceedings.mlr.press/v119/stooke20a/stooke20a.pdf
gamma (float) – the discount factor for future rewards. Default is 0.99.
deterministic_eval (bool) – whether to use deterministic action selection during evaluation. Default is True.
action_scaling (bool) – whether to scale the actions according to the action space bounds. Default is True.
action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods). Default is “clip”.
lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer. Default is None.

See also

Please refer to BaseAgent and OffpolicyAgent for more details of usage.

name = 'DDPGLagAgent'¶

policy: BasePolicy¶

class fsrl.agent.PPOLagAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, lr: float = 0.0005, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), unbounded: bool = False, last_layer_scale: bool = False, target_kl: float = 0.02, vf_coef: float = 0.25, max_grad_norm: float | None = None, gae_lambda: float = 0.95, eps_clip: float = 0.2, dual_clip: float | None = None, value_clip: bool = False, advantage_normalization: bool = True, recompute_advantage: bool = False, use_lagrangian: bool = True, lagrangian_pid: ~typing.Tuple = (0.05, 0.0005, 0.1), rescaling: bool = True, gamma: float = 0.99, max_batchsize: int = 99999, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: OnpolicyAgent

Proximal Policy Optimization (PPO) with PID Lagrangian agent.

More details, please refer to https://arxiv.org/abs/1707.06347 (PPO) and https://arxiv.org/abs/2007.03964 (PID Lagrangian).

Parameters:

env (gym.Env) – The environment to train and evaluate the agent on.
logger (BaseLogger) – A logger instance to log training and evaluation statistics, default to a dummy logger.
cost_limit (float) – the constraint limit(s) for the Lagrangian optimization. Default is 10.
device (str) – The device to use for training and inference, default to “cpu”.
thread (int) – The number of threads to use for training, ignored if device is “cuda”, default to 4.
seed (int) – The random seed for reproducibility, default to 10.
lr (float) – The learning rate, default to 5e-4.
hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers for the policy and value networks, default to (128, 128).
unbounded (bool) – Whether the action space is unbounded, default to False.
last_layer_scale (bool) – Whether to scale the last layer output for the policy network, default to False.
target_kl (float) – the target KL divergence for the PPO update. Default is 0.02.
vf_coef (float) – the value function coefficient for the loss function. Default is 0.25.
max_grad_norm (Optional[float]) – the maximum gradient norm for gradient clipping (None for no clipping). Default is None.
gae_lambda (float) – the Generalized Advantage Estimation (GAE) parameter. Default is 0.95.
eps_clip (float) – the PPO clipping parameter for the policy update. Default is 0.2.
dual_clip (Optional[float]) – the PPO dual clipping parameter (None for no dual clipping). Default is None.
value_clip (bool) – whether to clip the value function update. Default is False.
advantage_normalization (bool) – whether to normalize the advantages. Default is True.
recompute_advantage (bool) – whether to recompute the advantages during the optimization process. Default is False.
use_lagrangian (bool) – whether to use the Lagrangian constraint optimization. Default is True.
lagrangian_pid (List) – the PID coefficients for the Lagrangian constraint optimization. Default is [0.05, 0.0005, 0.1].
rescaling (bool) – whether use the rescaling trick for Lagrangian multiplier, see Alg. 1 in http://proceedings.mlr.press/v119/stooke20a/stooke20a.pdf
gamma (float) – the discount factor for future rewards. Default is 0.99.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 99999.
reward_normalization (bool) – whether to normalize the rewards. Default is False.
deterministic_eval (bool) – whether to use deterministic actions during evaluation. Default is True.
action_scaling (bool) – whether to scale actions based on the action space. Default is True.
action_bound_method (str) – the method used to handle out-of-bound actions. Default is “clip”.
lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – the learning rate scheduler. Default is None.

See also

Please refer to BaseAgent and OnpolicyAgent for more details of usage.

name = 'PPOLagAgent'¶

policy: BasePolicy¶

class fsrl.agent.TRPOLagAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, lr: float = 0.0005, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), unbounded: bool = False, last_layer_scale: bool = False, target_kl: float = 0.001, backtrack_coeff: float = 0.8, max_backtracks: int = 10, optim_critic_iters: int = 20, gae_lambda: float = 0.95, advantage_normalization: bool = True, use_lagrangian: bool = True, lagrangian_pid: ~typing.Tuple = (0.05, 0.0005, 0.1), rescaling: bool = True, gamma: float = 0.99, max_batchsize: int = 99999, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: OnpolicyAgent

Trust Region Policy Optimization (TRPO) with PID Lagrangian agent.

More details, please refer to https://arxiv.org/abs/1502.05477 (TRPO) and https://arxiv.org/abs/2007.03964 (PID Lagrangian).

Parameters:

env (gym.Env) – The environment to train and evaluate the agent on.
logger (BaseLogger) – A logger instance to log training and evaluation statistics, default to a dummy logger.
cost_limit (float) – the constraint limit(s) for the Lagrangian optimization (default: 10).
device (str) – The device to use for training and inference, default to “cpu”.
thread (int) – The number of threads to use for training, ignored if device is “cuda”, default to 4.
seed (int) – The random seed for reproducibility, default to 10.
lr (float) – The learning rate, default to 5e-4.
target_kl (float) – the target KL divergence for the line search (default: 0.001).
hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers for the policy and value networks, default to (128, 128).
unbounded (bool) – Whether the action space is unbounded, default to False.
last_layer_scale (bool) – Whether to scale the last layer output for the policy network, default to False.
backtrack_coeff (float) – the coefficient for backtracking during the line search (default: 0.8).
max_backtracks (int) – the maximum number of backtracks allowed during the line search (default: 10).
optim_critic_iters (int) – the number of optimization iterations for the critic network (default: 20).
gae_lambda (float) – the GAE lambda value (default: 0.95).
advantage_normalization (bool) – whether to normalize advantage (default: True).
use_lagrangian (bool) – whether to use the Lagrangian constraint optimization (default: True).
lagrangian_pid (List) – the PID coefficients for the Lagrangian constraint optimization (default: [0.05, 0.0005, 0.1]).
rescaling (bool) – whether use the rescaling trick for Lagrangian multiplier, see Alg. 1 in http://proceedings.mlr.press/v119/stooke20a/stooke20a.pdf
gamma (float) – the discount factor for future rewards (default: 0.99).
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 99999.
reward_normalization (bool) – whether to normalize rewards (default: False).
deterministic_eval (bool) – whether to use deterministic action selection during evaluation (default: True).
action_scaling (bool) – whether to scale the actions according to the action space bounds (default: True).
action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods) (default: “clip”).
lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer (default: None).

See also

Please refer to BaseAgent and OnpolicyAgent for more details of usage.

name = 'TRPOLagAgent'¶

policy: BasePolicy¶

learn(train_envs: Env | BaseVectorEnv, test_envs: Env | BaseVectorEnv | None = None, epoch: int = 300, episode_per_collect: int = 20, step_per_epoch: int = 10000, repeat_per_collect: int = 4, buffer_size: int = 100000, testing_num: int = 2, batch_size: int = 99999, reward_threshold: float = 450, save_interval: int = 4, resume: bool = False, save_ckpt: bool = True, verbose: bool = True, show_progress: bool = True) → None[source]¶: See learn() for details.

class fsrl.agent.FOCOPSAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, actor_lr: float = 0.0005, critic_lr: float = 0.001, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), unbounded: bool = False, last_layer_scale: bool = False, auto_nu: bool = True, nu: float = 0.01, nu_max: float = 2.0, nu_lr: float = 0.01, l2_reg: float = 0.001, delta: float = 0.02, eta: float = 0.02, tem_lambda: float = 0.95, gae_lambda: float = 0.95, max_grad_norm: float | None = 0.5, advantage_normalization: bool = True, recompute_advantage: bool = False, gamma: float = 0.99, max_batchsize: int = 100000, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: OnpolicyAgent

First Order Constrained Optimization in Policy Space (FOCOPS) agent.

More details, please refer to https://arxiv.org/pdf/2002.06506.pdf

Parameters:

env (gym.Env) – The environment to train and evaluate the agent on.
logger (BaseLogger) – A logger instance to log training and evaluation statistics, default to a dummy logger.
cost_limit (float) – the constraint threshold. Default value is 10.
device (str) – The device to use for training and inference, default to “cpu”.
thread (int) – The number of threads to use for training, ignored if device is “cuda”, default to 4.
seed (int) – The random seed for reproducibility, default to 10.
actor_lr (float) – the learning rate of the actor network, default to 5e-4.
critic_lr (float) – the learning rate of the critic network, default to 1e-3.
hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers for the policy and value networks, default to (128, 128).
unbounded (bool) – Whether the action space is unbounded, default to False.
last_layer_scale (bool) – whether to scale the last layer output for the policy network, default to False.
auto_nu (bool) – whether to automatically tune “nu”, the cost coefficient. Default value is True.
nu (Union[float, Tuple[float, float, torch.Tensor]]) – cost coefficient. It can also be a tuple representing [nu_max, nu_lr, nu]. Default value is 0.01.
nu_max (float) – the max value of the cost coefficient if auto_nu is True. Default value is 2.
nu_lr (float) – the learning rate of nu if auto_nu is True. Default value is 0.01.
l2_reg (float) – L2 regularization rate. Default value is 1e-3.
delta (float) – early stop KL bound. Default value is 0.02.
eta (float) – KL bound for indicator function. Default value is 0.02.
tem_lambda (float) – inverse temperature lambda. Default value is 0.95.
gae_lambda (float) – GAE (Generalized Advantage Estimation) lambda for advantage computation. Default value is 0.95.
max_grad_norm (Optional[float]) – maximum gradient norm for gradient clipping, if specified. Default value is 0.5.
advantage_normalization (bool) – normalize advantage if True. Default value is True.
recompute_advantage (bool) – recompute advantage using the updated value function. Default value is False.
gamma (float) – the discount factor for future rewards. Default value is 0.99.
max_batchsize (int) – maximum batch size for the optimization. Default value is 99999.
reward_normalization (bool) – normalize the rewards if True. Default value is False.
deterministic_eval (bool) – whether to use deterministic action selection during evaluation. Default value is True.
action_scaling (bool) – whether to scale the actions according to the action space bounds. Default value is True.
action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods). Default value is “clip”.
lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer. Default value is None.

See also

Please refer to BaseAgent and OnpolicyAgent for more details of usage.

name = 'FOCOPSAgent'¶

policy: BasePolicy¶

class fsrl.agent.CPOAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, lr: float = 0.001, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), unbounded: bool = False, last_layer_scale: bool = False, target_kl: float = 0.01, backtrack_coeff: float = 0.8, damping_coeff: float = 0.1, max_backtracks: int = 10, optim_critic_iters: int = 10, l2_reg: float = 0.001, gae_lambda: float = 0.95, advantage_normalization: bool = True, gamma: float = 0.99, max_batchsize: int = 99999, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]¶

Bases: OnpolicyAgent

A CPO (Constrained Policy Optimization) agent.

Parameters:

env (gym.Env) – The environment to train and evaluate the agent on.
logger (BaseLogger) – A logger instance to log training and evaluation statistics, default to a dummy logger.
cost_limit (float) – The maximum constraint cost allowed, default to 10.
device (str) – The device to use for training and inference, default to “cpu”.
thread (int) – The number of threads to use for training, ignored if device is “cuda”, default to 4.
seed (int) – The random seed for reproducibility, default to 10.
lr (float) – The learning rate, default to 1e-3.
hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers for the policy and value networks, default to (128, 128).
unbounded (bool) – Whether the action space is unbounded, default to False.
last_layer_scale (bool) – Whether to scale the last layer output for the policy network, default to False.
target_kl (float) – The target KL divergence for the policy update, default to 0.01.
backtrack_coeff (float) – The coefficient for backtracking, default to 0.8.
damping_coeff (float) – The coefficient for the damping, default to 0.1.
max_backtracks (int) – The maximum number of backtracking steps, default to 10.
optim_critic_iters (int) – The number of iterations to optimize the critic, default to 10.
l2_reg (float) – The L2 regularization coefficient, default to 0.001.
gae_lambda (float) – The lambda parameter for generalized advantage estimation, default to 0.95.
advantage_normalization (bool) – Whether to normalize advantages, default to True.
gamma (float) – The discount factor for future rewards and costs, default to 0.99.
max_batchsize (int) – The maximum batch size for computing advantages etc, default to 99999.
reward_normalization (bool) – Whether to normalize rewards, default to False.
deterministic_eval (bool) – Whether to use deterministic actions during evaluation, default to True.
action_scaling (bool) – Whether to scale the action space, default to True.
action_bound_method (str) – The method to bound actions (“clip” or “tanh”), default to “clip”.
lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – A learning rate scheduler, default to None.

See also

Please refer to BaseAgent and OnpolicyAgent for more details of usage.

name = 'CPOAgent'¶

policy: BasePolicy¶

learn(train_envs: Env | BaseVectorEnv, test_envs: Env | BaseVectorEnv | None = None, epoch: int = 300, episode_per_collect: int = 20, step_per_epoch: int = 10000, repeat_per_collect: int = 4, buffer_size: int = 100000, testing_num: int = 2, batch_size: int = 99999, reward_threshold: float = 450, save_interval: int = 4, resume: bool = False, save_ckpt: bool = True, verbose: bool = True, show_progress: bool = True) → None[source]¶: See learn() for details.