In the context of reinforcement learning (RL), the term “batch size” can have a different nuance compared to supervised learning, but it still refers to a collection of samples. Let’s break this down in more detail.
In supervised learning, the batch size refers to the number of samples (data points) processed before the model’s parameters are updated. These samples typically come from a labeled dataset (i.e., inputs paired with correct outputs). Each sample represents an independent observation.
In reinforcement learning, the concept of a “sample” is a bit more complex. Unlike supervised learning, where each sample is typically a fixed input-output pair, in RL, a sample generally refers to an experience or trajectory from interacting with the environment. This could include:
These individual experiences are typically stored in a replay buffer (in methods like DQN) or as part of the trajectory in policy gradient methods.
In reinforcement learning, the batch size typically refers to the number of samples of experiences that are processed together during training. However, the exact interpretation can vary slightly based on the RL algorithm:
In algorithms like Deep Q-Networks (DQN), experiences are collected as the agent interacts with the environment. These experiences are often stored in a replay buffer.
During training, the agent samples a batch of these experiences (say, 32 or 64 samples) from the replay buffer to compute updates to the Q-network.
So here, batch size refers to the number of (state, action, reward, next state) tuples sampled from the replay buffer for each gradient update.
In policy gradient methods, an agent collects multiple trajectories (sequences of experiences) by interacting with the environment. After a set number of trajectories (or steps), a batch of these trajectories is used to update the policy.
The batch size in this context can refer to the number of trajectories or the number of timesteps across all collected trajectories that are used for the policy update.
These methods often process batches of trajectories or time steps at once before computing gradient updates to the actor and critic networks.