Pre-knowledge review

DQN

loss function:
$I = (r + \gamma max_{a’} Q(s’, a’, w_{-} - Q(s, a, w)))^2$
Deep Q Learning implement the q-network to approximate Q function in place of a huge state table. In the training stage, our target data may gain from Q-network with old weight to update the new weight, which is very similar to supervised learning, while DQN is not.
algorithm

A2C

loss
algorithm

DDPG

Naive analysis may start from the decomposition three phrases: deep + Deterministic + Policy Gradient (from )

Deep

In fact, in DQN, we have two weights, or two network, one to predict the future Q value with old weight, and one to be updated. Also, we may perform the memory replay.

Deterministic

Deterministic Action taken rather than sample from a distribution.

Policy Gradient

actor
Policy Gradient
critic
Value-based
algorithm