
这个是hugo的deep learning的学习tutorial,需要慢慢刷掉,当然cmu 10707 的slides很多也是参考这个的(Russlan自己说)整理的,你如果找不到CMU 10707的视频你可以看这个,也是极好的。

然后,这里我想整理下最最基本的neural network的公式的推导,自己再推一遍,感觉就不虚:


  • layer pre-activation for $k > 0$ $(h_{0}(x) = x)$
    $a^{(k)}(x) = b^{(k)} + W^{(k)}h^{(k - 1)}(x)$
    感觉这么记忆,不会记乱,k就是第几层neuron,然后$h^{(k - 1)}(x)$是前一层的输出,然后weights $W^{(k)}$,bias $b^{(k)}$都是这层的,虽然有前一层输出作为的输入,但是这些还是算这一层的。
  • hidden layer activation (k from 1 to L)
    $h^{(k)}(x) = g(a^{(k)}(x))$

  • output layer activation $k = L + 1$:
    $h^{(L + 1)}(x) = o(a^{(L + 1)}(x)) = f(x)$
    和hidden不一样就是这个是最后一层了,然后这层的activation function也会和之前hidden layers有所不同,如果是分类的话,往往是softmax

  • softmax activation function at the output:
    $o(a) = softmax(a) = [\frac{exp(a_1)}{\sum_c exp(a_c)}…\frac{a_C}{\sum_c exp(a_c)}]^T$

    • 原因之一:希望特征对概率的影响是乘性的
    • 原因之二:多类分类问题的目标函数常常选为cross-entropy,…(推完整个,回来补)
  • activation function:

    • sigmoid:

      • formula:
        $\sigma(x) = \frac{1}{1 + e^{-x}}$,$\sigma(x)’ = \frac{e^x}{(1 + e^x)^2} = (1 - \sigma(x)) \sigma(x)$
      • shortcomings:
        • gradient vanish
        • symmetric
        • time cosuming to compute exp
    • tanh:

      • formula:
        $tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{2}{1 + e^{-2}} - 1 = 2 \sigma(2x) - 1$,$tanh(x)’ = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
      • 感觉就是解决了原点对称的问题
    • relu:

      • formula:$f(x) = max(0, x)$
      • 优点:
      • 缺点:
    • leaky relu:

      • formula:$f(x) = max(\epsilon x, x)$
      • 优点:
    • maxout:

      • formula: 对 relu 和 leaky relu的一般归纳:$f(x) = max(w_1^T x + b_1, w_2^T x + b_2)$
      • 优点:

loss function:

  • stochastic gradient descent (SGD):

    • initialize $\theta$ ($\theta = {W^{(1)}, b^{(1)}…,W^{(L + 1)}}$)
    • algorithm:
      for N iterations: (One epoch)
      for each training example $(x_{(t)}, y_{(t)})$   
          $\delta = -\nabla_{\theta}l(f(x_{(t)}, \theta), \y_{(t)}) - \lambda\nabla_{(\theta)}$
          $\theta \leftarrow \theta + \alpha \delta$
    • SGD 的优缺点
      • 缺点:
        • 选择合适的learning rate比较困难 - 对所有的参数更新使用同样的learning rate。对于稀疏数据或者特征,有时我们可能想更新快一些对于不经常出现的特征,对于常出现的特征更新慢一些,这时候SGD就不太能满足要求了
        • 相对BGD noise会比较大
    • batch gradient descent (BGD) 的对比
      $\theta \leftarrow \theta + \frac{1}{m}\sum_{i}(y_i - f(x;\theta)(x_i))$ (MSE)

      • 缺点:m很大的时候,train的会比较慢
      • 优点:比SGD稳定
    • mini-batch GD:
      就是这两个的折中,就像强化学习里面的,TD,Monta Carlo之间的n step-TD

      • advantages:
        • give a accurate estimate of average loss
        • can leverage matrix operations, which cost less than BGD
  • what neural network estimates?
    $f(x)_{c} = P(y=c|x)$, where c means which class.

  • what to optimize?
    maximize log likelihood —- minize negative log likelihood: $P(y_i=c|x_i)$,given $(x_i, y_i)$
    cross-entropy: p, q (p one-hot, q distribution of the P(y=c|x))
    $l(f(x), y) = -\sum_c1(y=c)log f(x)_c = - log f(x)_y$

loss gradient output:

loss gradient at output

  • partial derivative:
    $\frac{\partial - logf(x)_y}{\partial f(x)_c} = \frac{-1^{(y = c)}}{f(x)^y}$

  • gradient:
    $\nabla_{f(x)} -logf(x)_y= \frac{-1}{f(x)_y} [1^{(y=0)}…1^{(y=C-1)}]^T = \frac{-e(y)}{f(x)^y}$

loss gradient at output pre-activation

  • partial derivative:
    $\frac{\partial - logf(x)_y}{\partial a^{(L+1)}(x)_c} = (1^{(y = c)}} - f(x)^y)$

  • gradient:
    $\nabla_{a^{(L+1)}(x)_c}[- logf(x)_y}] = -(e^{(y)}} - f(x)^y)$

  • proof:


Compute output gradient (before activation)

$\nabla_{a^{(L+1)}(x)} -logf(x)_y \leftarrow - (e(y)-f(x))$

for k from L + 1 to 1

  • compute gradients of hidden layer parameter
    $\nabla_{W^{(k)}} -logf(x)^y \leftarrow $ $\nabla_{a^{(k)}(x)} -log f(x)^y h^{(k-1)}(x)^T$
    $\nabla_{b^{(k)}} -logf(x)^y \leftarrow $ $\nabla_{a^{(k)}(x)} -log f(x)^y$
  • compute gradient of hidden layer below
  • compute gradient of hidden layer below
    $\nabla_{h^{(k-1)}(x)} -logf(x)^y \leftarrow $ $W^{(k)T} \nabla_{a^{(k)}(x)} -log f(x)^y$
  • compute gradient of hidden layer below (before activation)
    $\nabla_{a^{(k-1)}(x)} -logf(x)^y \leftarrow $ $(\nabla_{h^{(k-1)}(x)} -log f(x)^y) \odot […,g’(a^{(k-1)}(x)_j),…]$


  • L1 & L2 regularization

    • L1 $\frac{\lambda}{2m} \sum |w|^2$
      L2 regularization is also known as weight decay as it forces the weights to decay towards zero (but not exactly zero).
    • L2 $\frac{\lambda}{2m} \sum |w|$
      Unlike L2, the weights may be reduced to zero here. Hence, it is very useful when we are trying to compress our model. Otherwise, we usually prefer L2 over it. Sparse solution.
  • Dropout
    So what does dropout do? At every iteration, it randomly selects some nodes and removes them along with all of their incoming and outgoing connections as shown below.

    So each iteration has a different set of nodes and this results in a different set of outputs. It can also be thought of as an ensemble technique in machine learning.

    Ensemble models usually perform better than a single model as they capture more randomness. Similarly, dropout also performs better than a normal neural network model.

    This probability of choosing how many nodes should be dropped is the hyperparameter of the dropout function. As seen in the image above, dropout can be applied to both the hidden layers as well as the input layers.

  • Batch Normalization

    • idea
      is that since it’s benefit to training if the input data is normalized, so why not normalize in hidden layers to solve the internal covariance shift.
    • denormalization
      to avoid extra effect of normalization, the denormalization parameters are helpful to adjust

Implementation of simple neuron network