Skip to content

Ppo

Bases: HParams

Whether to anneal the learning rate linearly to 0 at the end of training.

Number of environment frames to train for.

PPO clip parameter.

Whether to clip the value loss in the PPO loss.

Entropy coefficient in the total loss.

Lambda parameter of the TD(lambda) return.

Starting learning rate.

Maximum gradient norm for clipping.

Whether to normalise the advantages in the PPO loss.

Number of parallel environments to run.

Number of epochs to train for.

Number of minibatches to split the data into for training.

Number of steps to run in each environment per update.

Value function coefficient in the total loss.