Daniel Coelho - Ph.D. Blog

In these months, I realized that all papers that proposed agents trained with end-to-end RL weren't in fact end-to-end. They separate the perception training from the planning, which misclassifies them as end-to-end. Based on the work from the previous months, I think we can be the first proposing an agent that can drive in complex urban environments training with end-to-end RL. The common approach, is to use larget image encoders (ResNet, Inception) and train the perception module separately using a giant dataset. Then they use the compact image representation as the RL state, and train the RL algorithm. To avoid training the perception separately, we have to reduce the size of the image encoder due to the weak sign of the RL loss.

Based on the results from November 2022, we can conclude that reducing the size of the encoder is not enough, and therefore additional training techniques are required.

The rest of the blog contains techniques that were tested, as well as the main conclusions.

Deep Deterministic Policy Gradient (DDPG) with augmented data

Paper: Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning

This paper replaces the Soft Actor Critic (SAC) by the Deep Deterministic Policy Gradient (DDPG) RL algorithm, and also changes the exploration behaviour. With this paper, the exploration is added using a gaussian distribution using the mean of the policy, and a sigma value that will linearly decay over time. Using this sigma, we have more control over the exploration, which is the main limitation of SAC. Changing from SAC to DDPG led to faster and more effective learning:

SAC

DDPG

Now, based on the videos, the problem to solve is the intersections.

Waypoint encoder as a neural network

Instead of giving a navigation command to the agent, we can give the next 10 waypoints from the global planner, and then encode that information in the state using a neural network:

Traffic Lights

The main problem now is the traffic lights. The traffic lights occupy about 25 pixels in the image, and the agent struggles to associate those pixels with the punishment received when moving through a red light. To force the latent representation of the image to containg traffic light features, a classification traffic light decoder was created:
Although this helped, this alone wasn't sufficient. After some failed experiments, I found that using a learning rate of 1e-4, as is common in DDPG, wasn't the optimal choice. Based on my experiments, in AD, the representation learning poses additional challenges than ATARI games or humanoid tasks. For this, the learning rate of the encoder should be 1e-3 in order to reach to meaningful representations as soon as possible. Since the encoder and the critic are progagated using the same loss, they should have the same learning rate. However, I'm still not sure about the learning rate of the actor. I'm currently running some experiments to figure this out. In addition, I also implemented a method to regularize the encoder's gradients (Stabilizing Off-Policy Deep Reinforcement Learning from Pixels). The combination of all this features, led to the following results:

Failed attempts

Here I present a list of the aprroaches that I implemented and didn't worked out.

Using image encoders with attention: paper
Using visual transfromers as image encoders: paper
Bisimulation Learning: paper
Critic with dropouts: paper
Using supervised learning on the predicted speed

What's Next?

Train our approach in an environment with vehicles and pedestrians. Train all approaches that are end-to-end RL to compare with. These approaches are: SAC, SAC+AE, CURL, DRQ, DRQV2, ALIX.

In addition, implement the following new features:

LSTMs to process temporal information
Waypoint encoder using convolution layers
Prioritized experience replay
Auto Encoder
Contrastive Learning
Decoder to predict the distance to the next vehicle

Designing Reinforcement Learning Agents

Building the first autonomous driving agent trained with end-to-end reinforcement learning from pixels that succeeds in urban environments.

Deep Deterministic Policy Gradient (DDPG) with augmented data

Waypoint encoder as a neural network

Traffic Lights

Failed attempts

What's Next?