Deep Deterministic Policy Gradient

Pre-knowledge review

DQN

  • loss function:
    $I = (r + \gamma max_{a’} Q(s’, a’, w_{-} - Q(s, a, w)))^2$
    Deep Q Learning implement the q-network to approximate Q function in place of a huge state table. In the training stage, our target data may gain from Q-network with old weight to update the new weight, which is very similar to supervised learning, while DQN is not.

  • algorithm

A2C

  • loss

  • algorithm

DDPG

Naive analysis may start from the decomposition three phrases: deep + Deterministic + Policy Gradient (from )

Deep

In fact, in DQN, we have two weights, or two network, one to predict the future Q value with old weight, and one to be updated. Also, we may perform the memory replay.

Deterministic

Deterministic Action taken rather than sample from a distribution.

Policy Gradient

  • actor
    Policy Gradient

  • critic
    Value-based

  • algorithm