fsrl.agent

The default MLP agent package.

class fsrl.agent.BaseAgent(*args, **kwargs)[source]

Bases: ABC

The base class for a default agent.

A agent class should have the following parts:

  • __init__(): initialize the agent, including the policy,

    networks, optimizers, and so on;

  • learn(): start training given the learning parameters;

  • evaluate(): evaluate the agent multiple episodes;

  • state_dict: the agent state dictionary that can be

    saved as checkpoints;

Example of usage:

# initialize the CVPO agent
agent = CVPOAgent(env, other_algo_params) # train multiple epochs
agent.learn(training_envs, other_training_params)

# test after the training is finished agent.eval(testing_envs)

# test with agent's state_dict agent.eval(testing_envs, agent.state_dict)

All of the agent classes must inherit BaseAgent.

name = 'BaseAgent'
abstract learn(*args, **kwargs) None[source]

Train the policy on a set of training environments.

evaluate(test_envs: Env | BaseVectorEnv, state_dict: dict | None = None, eval_episodes: int = 10, render: bool = False, train_mode: bool = False) Tuple[float, float, float][source]

Evaluate the policy on a set of test environments.

Parameters:
  • test_envs (Union[gym.Env, BaseVectorEnv]) – A single environment or a vectorized environment to evaluate the policy on.

  • state_dict (Optional[dict]) – An optional dictionary containing the state params of the agent to be evaluated., defaults to None

  • eval_episodes (int) – The number of episodes to evaluate, defaults to 10

  • render (bool) – Whether to render the environment during evaluation, defaults to False

  • train_mode (bool) – Whether to set the policy to training mode during evaluation, defaults to False

Return Tuple:

rewards, episode lengths, and constraint costs obtained during evaluation.

property state_dict

Return the policy’s state_dict.

class fsrl.agent.OffpolicyAgent[source]

Bases: BaseAgent

The base class for an off-policy agent.

The learn(): function is customized to work with the off-policy trainer. See BaseAgent for more details.

name = 'OffpolicyAgent'
learn(train_envs: Env | BaseVectorEnv, test_envs: Env | BaseVectorEnv | None = None, epoch: int = 300, episode_per_collect: int = 5, step_per_epoch: int = 3000, update_per_step: float = 0.1, buffer_size: int = 100000, testing_num: int = 2, batch_size: int = 256, reward_threshold: float = 450, save_interval: int = 4, resume: bool = False, save_ckpt: bool = True, verbose: bool = True, show_progress: bool = True) None[source]

Train the policy on a set of training environments.

Parameters:
  • train_envs (Union[gym.Env, BaseVectorEnv]) – A single environment or a vectorized environment to train the policy on.

  • test_envs (Union[gym.Env, BaseVectorEnv]) – A single environment or a vectorized environment to evaluate the policy on, default to None.

  • epoch (int) – The number of training epochs, defaults to 300.

  • episode_per_collect (int) – The number of episodes to collect before each policy update, defaults to 5.

  • step_per_epoch (int) – The number of environment steps per epoch, defaults to 3000.

  • update_per_step (float) – The ratio of policy updates to environment steps, defaults to 0.1.

  • buffer_size (int) – The maximum size of the replay buffer, defaults to 100000.

  • testing_num (int) – The number of episodes to use for evaluation, defaults to 2.

  • batch_size (int) – The batch size for each policy update, defaults to 256.

  • reward_threshold (float) – The reward threshold for early stopping, defaults to 450.

  • save_interval (int) – The interval (in epochs) for saving the policy model, defaults to 4.

  • resume (bool) – Whether to resume training from the last checkpoint, defaults to False.

  • save_ckpt (bool) – Whether to save the policy model, defaults to True.

  • verbose (bool) – Whether to print progress information during training, defaults to True.

  • show_progress (bool) – Whether to show the tqdm training progress bar, defaults to True

policy: BasePolicy
class fsrl.agent.OnpolicyAgent[source]

Bases: BaseAgent

The base class for an on-policy agent.

The learn(): function is customized to work with the on-policy trainer. See BaseAgent for more details.

name = 'OnpolicyAgent'
learn(train_envs: Env | BaseVectorEnv, test_envs: Env | BaseVectorEnv | None = None, epoch: int = 300, episode_per_collect: int = 20, step_per_epoch: int = 10000, repeat_per_collect: int = 4, buffer_size: int = 100000, testing_num: int = 2, batch_size: int = 512, reward_threshold: float = 450, save_interval: int = 4, resume: bool = False, save_ckpt: bool = True, verbose: bool = True, show_progress: bool = True) None[source]

Train the policy on a set of training environments.

Parameters:
  • train_envs (Union[gym.Env, BaseVectorEnv]) – A single environment or a vectorized environment to train the policy on.

  • test_envs (Union[gym.Env, BaseVectorEnv]) – A single environment or a vectorized environment to evaluate the policy on, defaults to None.

  • epoch (int) – The number of training epochs, defaults to 300

  • episode_per_collect (int) – The number of episodes collected per data collection, defaults to 20

  • step_per_epoch (int) – The number of steps per training epoch, defaults to 10000

  • repeat_per_collect (int) – The number of repeats of policy update for one episode collection, defaults to 4

  • buffer_size (int) – The size of the replay buffer, defaults to 100000

  • testing_num (int) – The number of episodes to evaluate during testing, defaults to 2

  • batch_size (int) – The batch size for training, default is 99999 for TRPOLagAgent CPOLagAgent, and is 512 for others

  • reward_threshold (float) – The threshold for stopping training when the mean reward exceeds it, defaults to 450

  • save_interval (int) – The number of epochs to save the policy, defaults to 4

  • resume (bool) – Whether to resume training from the saved checkpoint, defaults to False

  • save_ckpt (bool) – Whether to save the policy model, defaults to True

  • verbose (bool) – Whether to print the training information, defaults to True

  • show_progress (bool) – Whether to show the tqdm training progress bar, defaults to True

policy: BasePolicy
class fsrl.agent.CVPOAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, estep_iter_num: int = 1, estep_kl: float = 0.02, estep_dual_max: float = 20, estep_dual_lr: float = 0.02, sample_act_num: int = 16, mstep_iter_num: int = 1, mstep_kl_mu: float = 0.005, mstep_kl_std: float = 0.0005, mstep_dual_max: float = 0.5, mstep_dual_lr: float = 0.1, actor_lr: float = 0.0005, critic_lr: float = 0.001, gamma: float = 0.98, n_step: int = 2, tau: float = 0.05, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), double_critic: bool = False, conditioned_sigma: bool = True, unbounded: bool = False, last_layer_scale: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]

Bases: OffpolicyAgent

Constrained Variational Policy Optimization (CVPO) agent.

More details, please refer to https://arxiv.org/abs/2201.11927.

Parameters:
  • env (gym.Env) – The Gym environment to train the agent on.

  • logger (BaseLogger) – The logger to use for the agent (default is DummyLogger).

  • cost_limit (int) – The cost limit of the task.

  • device (str) – The device to use for training (default is ‘cpu’).

  • thread (str) – The number of threads to use for training when using the CPU (default is 4).

  • seed (int) – The random seed to use for training (default is 10).

  • estep_iter_num (int) – the number of iterations for the E-step. (default=1)

  • estep_kl (float) – the KL divergence threshold for the E-step. (default=0.02)

  • estep_dual_max (float) – the maximum value for the dual variable in the E-step. (default=20)

  • estep_dual_lr (float) – the learning rate for the dual variable in the E-step. (default=0.02)

  • sample_act_num (int) – the number of actions to sample for the E-step. (default=16)

  • mstep_iter_num (int) – the number of iterations for the M-step. (default=1)

  • mstep_kl_mu (float) – the KL divergence threshold for the M-step (mean). (default=0.005)

  • mstep_kl_std (float) – the KL divergence threshold for the M-step (standard deviation). (default=0.0005)

  • mstep_dual_max (float) – the maximum value for the dual variable in the M-step. (default=0.5)

  • mstep_dual_lr (float) – the learning rate for the dual variable in the M-step. (default=0.1)

  • actor_lr (float) – The learning rate of the actor network (default is 5e-4).

  • critic_lr (float) – The learning rate of the critic network (default is 1e-3).

  • gamma (float) – The discount factor (default is 0.98).

  • n_step (int) – The number of steps to look ahead when computing returns (default is 2).

  • tau (float) – The critics soft update coefficient (default is 0.05).

  • hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers in the actor and critic networks (default is (128, 128)).

  • double_critic (bool) – Whether to use two critic networks instead of one (default is False).

  • conditioned_sigma (bool) – Whether the variance of the Gaussian policy is conditioned on the state (default is True).

  • unbounded (bool) – Whether to use an unbounded output layer for the actor network (default is False).

  • last_layer_scale (bool) – Whether to scale the last layer of the actor network (default is False).

  • deterministic_eval (bool) – Whether to use a deterministic policy during evaluation (default is True).

  • action_scaling (bool) – Whether to scale actions by the maximum action value (default is True).

  • action_bound_method (str) – The method to use for action bounds (‘clip’ or ‘tanh’) (default is ‘clip’).

  • torch.optim.lr_scheduler.LambdaLR – The learning rate scheduler (default is None).

See also

Please refer to BaseAgent and OffpolicyAgent for more details of usage.

name = 'CVPOAgent'
policy: BasePolicy
class fsrl.agent.SACLagAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, actor_lr: float = 0.0005, critic_lr: float = 0.001, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), auto_alpha: bool = True, alpha_lr: float = 0.0003, alpha: float = 0.002, tau: float = 0.05, n_step: int = 2, use_lagrangian: bool = True, lagrangian_pid: ~typing.Tuple[float, ...] = (0.05, 0.0005, 0.1), rescaling: bool = True, gamma: float = 0.99, conditioned_sigma: bool = True, unbounded: bool = True, last_layer_scale: bool = False, deterministic_eval: bool = False, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]

Bases: OffpolicyAgent

Soft Actor-Critic (SAC) with PID Lagrangian agent.

More details, please refer to https://arxiv.org/abs/1801.01290 (SAC) and https://arxiv.org/abs/2007.03964 (PID Lagrangian).

Parameters:
  • env (gym.Env) – The environment to train and evaluate the agent on.

  • logger (BaseLogger) – A logger instance to log training and evaluation statistics, default to a dummy logger.

  • cost_limit (float) – the constraint limit(s) for the Lagrangian optimization. (default: 10)

  • device (str) – The device to use for training and inference, default to “cpu”.

  • thread (int) – The number of threads to use for training, ignored if device is “cuda”, default to 4.

  • seed (int) – The random seed for reproducibility, default to 10.

  • actor_lr (float) – The learning rate of the actor network (default: 5e-4).

  • critic_lr (float) – The learning rate of the critic network (default: 1e-3).

  • hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers for the policy and value networks, default to (128, 128).

  • auto_alpha (bool) – whether to automatically tune “alpha”, the temperature. (default: True)

  • alpha_lr (float) – the learning rate of learning “alpha” if auto_alpha is True. (default: 3e-4)

  • alpha (float) – initial temperature for entropy regularization. (default: 0.005)

  • tau (float) – target smoothing coefficient for soft update of target networks. (default: 0.05)

  • n_step (int) – number of steps for multi-step learning. (default: 2)

  • use_lagrangian (bool) – whether to use the Lagrangian constraint optimization. (default: True)

  • lagrangian_pid (List) – the PID coefficients for the Lagrangian constraint optimization. (default: [0.05, 0.0005, 0.1])

  • rescaling (bool) – whether use the rescaling trick for Lagrangian multiplier, see Alg. 1 in http://proceedings.mlr.press/v119/stooke20a/stooke20a.pdf

  • gamma (float) – the discount factor for future rewards. (default: 0.99)

  • conditioned_sigma (bool) – Whether the variance of the Gaussian policy is conditioned on the state (default: True).

  • unbounded (bool) – Whether the action space is unbounded. (default: False)

  • last_layer_scale (bool) – Whether to scale the last layer output for the policy network. (default: False)

  • deterministic_eval (bool) – whether to use deterministic action selection during evaluation. (default: True)

  • action_scaling (bool) – whether to scale the actions according to the action space bounds. (default: True)

  • action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods). (default: “clip”)

  • lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer. (default: None)

See also

Please refer to BaseAgent and OffpolicyAgent for more details of usage.

name = 'SACLagAgent'
policy: BasePolicy
class fsrl.agent.DDPGLagAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, actor_lr: float = 0.0001, critic_lr: float = 0.001, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), tau: float = 0.005, exploration_noise: float = 0.1, n_step: int = 3, use_lagrangian: bool = True, lagrangian_pid: ~typing.Tuple[float, ...] = (0.5, 0.001, 0.1), rescaling: bool = True, gamma: float = 0.99, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]

Bases: OffpolicyAgent

Deep Deterministic Policy Gradient (DDPG) with PID Lagrangian agent.

More details, please refer to https://arxiv.org/abs/1509.02971 (DDPG) and https://arxiv.org/abs/2007.03964 (PID Lagrangian).

Parameters:
  • env (gym.Env) – The environment to train and evaluate the agent on.

  • logger (BaseLogger) – A logger instance to log training and evaluation statistics, default to a dummy logger.

  • cost_limit (float) – The maximum constraint cost allowed, default to 10.

  • device (str) – The device to use for training and inference, default to “cpu”.

  • thread (int) – The number of threads to use for training, ignored if device is “cuda”, default to 4.

  • seed (int) – The random seed for reproducibility, default to 10.

  • actor_lr (float) – The learning rate of the actor network (default is 5e-4).

  • critic_lr (float) – The learning rate of the critic network (default is 1e-3).

  • hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers in the actor and critic networks (default is (128, 128)).

  • tau (float) – the soft update coefficient for updating target networks. Default is 0.05.

  • exploration_noise (Optional[BaseNoise]) – the noise instance for exploration. Default is GaussianNoise(sigma=0.1).

  • n_step (int) – the number of steps for multi-step bootstrap targets. Default is 2.

  • use_lagrangian (bool) – whether to use the Lagrangian constraint optimization. Default is True.

  • lagrangian_pid (List) – the PID coefficients for the Lagrangian constraint optimization. Default is [0.05, 0.0005, 0.1].

  • rescaling (bool) – whether use the rescaling trick for Lagrangian multiplier, see Alg. 1 in http://proceedings.mlr.press/v119/stooke20a/stooke20a.pdf

  • gamma (float) – the discount factor for future rewards. Default is 0.99.

  • deterministic_eval (bool) – whether to use deterministic action selection during evaluation. Default is True.

  • action_scaling (bool) – whether to scale the actions according to the action space bounds. Default is True.

  • action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods). Default is “clip”.

  • lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer. Default is None.

See also

Please refer to BaseAgent and OffpolicyAgent for more details of usage.

name = 'DDPGLagAgent'
policy: BasePolicy
class fsrl.agent.PPOLagAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, lr: float = 0.0005, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), unbounded: bool = False, last_layer_scale: bool = False, target_kl: float = 0.02, vf_coef: float = 0.25, max_grad_norm: float | None = None, gae_lambda: float = 0.95, eps_clip: float = 0.2, dual_clip: float | None = None, value_clip: bool = False, advantage_normalization: bool = True, recompute_advantage: bool = False, use_lagrangian: bool = True, lagrangian_pid: ~typing.Tuple = (0.05, 0.0005, 0.1), rescaling: bool = True, gamma: float = 0.99, max_batchsize: int = 99999, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]

Bases: OnpolicyAgent

Proximal Policy Optimization (PPO) with PID Lagrangian agent.

More details, please refer to https://arxiv.org/abs/1707.06347 (PPO) and https://arxiv.org/abs/2007.03964 (PID Lagrangian).

Parameters:
  • env (gym.Env) – The environment to train and evaluate the agent on.

  • logger (BaseLogger) – A logger instance to log training and evaluation statistics, default to a dummy logger.

  • cost_limit (float) – the constraint limit(s) for the Lagrangian optimization. Default is 10.

  • device (str) – The device to use for training and inference, default to “cpu”.

  • thread (int) – The number of threads to use for training, ignored if device is “cuda”, default to 4.

  • seed (int) – The random seed for reproducibility, default to 10.

  • lr (float) – The learning rate, default to 5e-4.

  • hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers for the policy and value networks, default to (128, 128).

  • unbounded (bool) – Whether the action space is unbounded, default to False.

  • last_layer_scale (bool) – Whether to scale the last layer output for the policy network, default to False.

  • target_kl (float) – the target KL divergence for the PPO update. Default is 0.02.

  • vf_coef (float) – the value function coefficient for the loss function. Default is 0.25.

  • max_grad_norm (Optional[float]) – the maximum gradient norm for gradient clipping (None for no clipping). Default is None.

  • gae_lambda (float) – the Generalized Advantage Estimation (GAE) parameter. Default is 0.95.

  • eps_clip (float) – the PPO clipping parameter for the policy update. Default is 0.2.

  • dual_clip (Optional[float]) – the PPO dual clipping parameter (None for no dual clipping). Default is None.

  • value_clip (bool) – whether to clip the value function update. Default is False.

  • advantage_normalization (bool) – whether to normalize the advantages. Default is True.

  • recompute_advantage (bool) – whether to recompute the advantages during the optimization process. Default is False.

  • use_lagrangian (bool) – whether to use the Lagrangian constraint optimization. Default is True.

  • lagrangian_pid (List) – the PID coefficients for the Lagrangian constraint optimization. Default is [0.05, 0.0005, 0.1].

  • rescaling (bool) – whether use the rescaling trick for Lagrangian multiplier, see Alg. 1 in http://proceedings.mlr.press/v119/stooke20a/stooke20a.pdf

  • gamma (float) – the discount factor for future rewards. Default is 0.99.

  • max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 99999.

  • reward_normalization (bool) – whether to normalize the rewards. Default is False.

  • deterministic_eval (bool) – whether to use deterministic actions during evaluation. Default is True.

  • action_scaling (bool) – whether to scale actions based on the action space. Default is True.

  • action_bound_method (str) – the method used to handle out-of-bound actions. Default is “clip”.

  • lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – the learning rate scheduler. Default is None.

See also

Please refer to BaseAgent and OnpolicyAgent for more details of usage.

name = 'PPOLagAgent'
policy: BasePolicy
class fsrl.agent.TRPOLagAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, lr: float = 0.0005, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), unbounded: bool = False, last_layer_scale: bool = False, target_kl: float = 0.001, backtrack_coeff: float = 0.8, max_backtracks: int = 10, optim_critic_iters: int = 20, gae_lambda: float = 0.95, advantage_normalization: bool = True, use_lagrangian: bool = True, lagrangian_pid: ~typing.Tuple = (0.05, 0.0005, 0.1), rescaling: bool = True, gamma: float = 0.99, max_batchsize: int = 99999, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]

Bases: OnpolicyAgent

Trust Region Policy Optimization (TRPO) with PID Lagrangian agent.

More details, please refer to https://arxiv.org/abs/1502.05477 (TRPO) and https://arxiv.org/abs/2007.03964 (PID Lagrangian).

Parameters:
  • env (gym.Env) – The environment to train and evaluate the agent on.

  • logger (BaseLogger) – A logger instance to log training and evaluation statistics, default to a dummy logger.

  • cost_limit (float) – the constraint limit(s) for the Lagrangian optimization (default: 10).

  • device (str) – The device to use for training and inference, default to “cpu”.

  • thread (int) – The number of threads to use for training, ignored if device is “cuda”, default to 4.

  • seed (int) – The random seed for reproducibility, default to 10.

  • lr (float) – The learning rate, default to 5e-4.

  • target_kl (float) – the target KL divergence for the line search (default: 0.001).

  • hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers for the policy and value networks, default to (128, 128).

  • unbounded (bool) – Whether the action space is unbounded, default to False.

  • last_layer_scale (bool) – Whether to scale the last layer output for the policy network, default to False.

  • backtrack_coeff (float) – the coefficient for backtracking during the line search (default: 0.8).

  • max_backtracks (int) – the maximum number of backtracks allowed during the line search (default: 10).

  • optim_critic_iters (int) – the number of optimization iterations for the critic network (default: 20).

  • gae_lambda (float) – the GAE lambda value (default: 0.95).

  • advantage_normalization (bool) – whether to normalize advantage (default: True).

  • use_lagrangian (bool) – whether to use the Lagrangian constraint optimization (default: True).

  • lagrangian_pid (List) – the PID coefficients for the Lagrangian constraint optimization (default: [0.05, 0.0005, 0.1]).

  • rescaling (bool) – whether use the rescaling trick for Lagrangian multiplier, see Alg. 1 in http://proceedings.mlr.press/v119/stooke20a/stooke20a.pdf

  • gamma (float) – the discount factor for future rewards (default: 0.99).

  • max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 99999.

  • reward_normalization (bool) – whether to normalize rewards (default: False).

  • deterministic_eval (bool) – whether to use deterministic action selection during evaluation (default: True).

  • action_scaling (bool) – whether to scale the actions according to the action space bounds (default: True).

  • action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods) (default: “clip”).

  • lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer (default: None).

See also

Please refer to BaseAgent and OnpolicyAgent for more details of usage.

name = 'TRPOLagAgent'
policy: BasePolicy
learn(train_envs: Env | BaseVectorEnv, test_envs: Env | BaseVectorEnv | None = None, epoch: int = 300, episode_per_collect: int = 20, step_per_epoch: int = 10000, repeat_per_collect: int = 4, buffer_size: int = 100000, testing_num: int = 2, batch_size: int = 99999, reward_threshold: float = 450, save_interval: int = 4, resume: bool = False, save_ckpt: bool = True, verbose: bool = True, show_progress: bool = True) None[source]

See learn() for details.

class fsrl.agent.FOCOPSAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, actor_lr: float = 0.0005, critic_lr: float = 0.001, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), unbounded: bool = False, last_layer_scale: bool = False, auto_nu: bool = True, nu: float = 0.01, nu_max: float = 2.0, nu_lr: float = 0.01, l2_reg: float = 0.001, delta: float = 0.02, eta: float = 0.02, tem_lambda: float = 0.95, gae_lambda: float = 0.95, max_grad_norm: float | None = 0.5, advantage_normalization: bool = True, recompute_advantage: bool = False, gamma: float = 0.99, max_batchsize: int = 100000, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]

Bases: OnpolicyAgent

First Order Constrained Optimization in Policy Space (FOCOPS) agent.

More details, please refer to https://arxiv.org/pdf/2002.06506.pdf

Parameters:
  • env (gym.Env) – The environment to train and evaluate the agent on.

  • logger (BaseLogger) – A logger instance to log training and evaluation statistics, default to a dummy logger.

  • cost_limit (float) – the constraint threshold. Default value is 10.

  • device (str) – The device to use for training and inference, default to “cpu”.

  • thread (int) – The number of threads to use for training, ignored if device is “cuda”, default to 4.

  • seed (int) – The random seed for reproducibility, default to 10.

  • actor_lr (float) – the learning rate of the actor network, default to 5e-4.

  • critic_lr (float) – the learning rate of the critic network, default to 1e-3.

  • hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers for the policy and value networks, default to (128, 128).

  • unbounded (bool) – Whether the action space is unbounded, default to False.

  • last_layer_scale (bool) – whether to scale the last layer output for the policy network, default to False.

  • auto_nu (bool) – whether to automatically tune “nu”, the cost coefficient. Default value is True.

  • nu (Union[float, Tuple[float, float, torch.Tensor]]) – cost coefficient. It can also be a tuple representing [nu_max, nu_lr, nu]. Default value is 0.01.

  • nu_max (float) – the max value of the cost coefficient if auto_nu is True. Default value is 2.

  • nu_lr (float) – the learning rate of nu if auto_nu is True. Default value is 0.01.

  • l2_reg (float) – L2 regularization rate. Default value is 1e-3.

  • delta (float) – early stop KL bound. Default value is 0.02.

  • eta (float) – KL bound for indicator function. Default value is 0.02.

  • tem_lambda (float) – inverse temperature lambda. Default value is 0.95.

  • gae_lambda (float) – GAE (Generalized Advantage Estimation) lambda for advantage computation. Default value is 0.95.

  • max_grad_norm (Optional[float]) – maximum gradient norm for gradient clipping, if specified. Default value is 0.5.

  • advantage_normalization (bool) – normalize advantage if True. Default value is True.

  • recompute_advantage (bool) – recompute advantage using the updated value function. Default value is False.

  • gamma (float) – the discount factor for future rewards. Default value is 0.99.

  • max_batchsize (int) – maximum batch size for the optimization. Default value is 99999.

  • reward_normalization (bool) – normalize the rewards if True. Default value is False.

  • deterministic_eval (bool) – whether to use deterministic action selection during evaluation. Default value is True.

  • action_scaling (bool) – whether to scale the actions according to the action space bounds. Default value is True.

  • action_bound_method (str) – the method for handling actions that exceed the action space bounds (“clip” or other custom methods). Default value is “clip”.

  • lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – learning rate scheduler for the optimizer. Default value is None.

See also

Please refer to BaseAgent and OnpolicyAgent for more details of usage.

name = 'FOCOPSAgent'
policy: BasePolicy
class fsrl.agent.CPOAgent(env: ~gymnasium.core.Env, logger: ~fsrl.utils.logger.base_logger.BaseLogger = <fsrl.utils.logger.base_logger.BaseLogger object>, cost_limit: float = 10, device: str = 'cpu', thread: int = 4, seed: int = 10, lr: float = 0.001, hidden_sizes: ~typing.Tuple[int, ...] = (128, 128), unbounded: bool = False, last_layer_scale: bool = False, target_kl: float = 0.01, backtrack_coeff: float = 0.8, damping_coeff: float = 0.1, max_backtracks: int = 10, optim_critic_iters: int = 10, l2_reg: float = 0.001, gae_lambda: float = 0.95, advantage_normalization: bool = True, gamma: float = 0.99, max_batchsize: int = 99999, reward_normalization: bool = False, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: ~torch.optim.lr_scheduler.LambdaLR | None = None)[source]

Bases: OnpolicyAgent

A CPO (Constrained Policy Optimization) agent.

Parameters:
  • env (gym.Env) – The environment to train and evaluate the agent on.

  • logger (BaseLogger) – A logger instance to log training and evaluation statistics, default to a dummy logger.

  • cost_limit (float) – The maximum constraint cost allowed, default to 10.

  • device (str) – The device to use for training and inference, default to “cpu”.

  • thread (int) – The number of threads to use for training, ignored if device is “cuda”, default to 4.

  • seed (int) – The random seed for reproducibility, default to 10.

  • lr (float) – The learning rate, default to 1e-3.

  • hidden_sizes (Tuple[int, ...]) – The sizes of the hidden layers for the policy and value networks, default to (128, 128).

  • unbounded (bool) – Whether the action space is unbounded, default to False.

  • last_layer_scale (bool) – Whether to scale the last layer output for the policy network, default to False.

  • target_kl (float) – The target KL divergence for the policy update, default to 0.01.

  • backtrack_coeff (float) – The coefficient for backtracking, default to 0.8.

  • damping_coeff (float) – The coefficient for the damping, default to 0.1.

  • max_backtracks (int) – The maximum number of backtracking steps, default to 10.

  • optim_critic_iters (int) – The number of iterations to optimize the critic, default to 10.

  • l2_reg (float) – The L2 regularization coefficient, default to 0.001.

  • gae_lambda (float) – The lambda parameter for generalized advantage estimation, default to 0.95.

  • advantage_normalization (bool) – Whether to normalize advantages, default to True.

  • gamma (float) – The discount factor for future rewards and costs, default to 0.99.

  • max_batchsize (int) – The maximum batch size for computing advantages etc, default to 99999.

  • reward_normalization (bool) – Whether to normalize rewards, default to False.

  • deterministic_eval (bool) – Whether to use deterministic actions during evaluation, default to True.

  • action_scaling (bool) – Whether to scale the action space, default to True.

  • action_bound_method (str) – The method to bound actions (“clip” or “tanh”), default to “clip”.

  • lr_scheduler (Optional[torch.optim.lr_scheduler.LambdaLR]) – A learning rate scheduler, default to None.

See also

Please refer to BaseAgent and OnpolicyAgent for more details of usage.

name = 'CPOAgent'
policy: BasePolicy
learn(train_envs: Env | BaseVectorEnv, test_envs: Env | BaseVectorEnv | None = None, epoch: int = 300, episode_per_collect: int = 20, step_per_epoch: int = 10000, repeat_per_collect: int = 4, buffer_size: int = 100000, testing_num: int = 2, batch_size: int = 99999, reward_threshold: float = 450, save_interval: int = 4, resume: bool = False, save_ckpt: bool = True, verbose: bool = True, show_progress: bool = True) None[source]

See learn() for details.