pytorch 入门

Posted on 2018-06-20 | In tech

最近在看domain adaptation的代码，需要学习pytorch，然后参考莫烦python的代码，自己整理一遍:

用 Numpy 还是 Torch

Torch 自称为神经网络界的 Numpy, 因为他能将 torch 产生的 tensor 放在 GPU 中加速运算 (前提是你有合适的 GPU), 就像 Numpy 会把 array 放在 CPU 中加速运算. 所以神经网络的话, 当然是用 Torch 的 tensor 形式数据最好咯. 就像 Tensorflow 当中的 tensor 一样.

当然, 我们对 Numpy 还是爱不释手的, 因为我们太习惯 numpy 的形式了. 不过 torch 看出来我们的喜爱, 他把 torch 做的和 numpy 能很好的兼容. 比如这样就能自由地转换 numpy array 和 torch tensor 了:

import torch
import numpy as np

np_data = np.arange(6).reshape((2, 3))
torch_data = torch.from_numpy(np_data)
tensor2array = torch_data.numpy()
print(
    '\nnumpy array:', np_data,          # [[0 1 2], [3 4 5]]
    '\ntorch tensor:', torch_data,      # 0 1 2 \n 3 4 5 [torch.LongTensor of 2 x 3]
    '\ntensor to array': tensor2array,  # [[0 1 2], [3 4 5]]
)

Torch 中的数学运算

简单运算

其实 torch 中 tensor 的运算和 numpy array 的如出一辙, 我们就以对比的形式来看. 如果想了解 torch 中其它更多有用的运算符，API就是你要去的地方.

# abs 绝对值的计算
data = [-1, -2, 1, 2]
tensor = torch.FloatTensor(data)
print(
    '\nabs',
    '\nnumpy: ', np.abs(data),          # [1 2 1 2]
    '\ntorch: ', torch.abs(tensor)      # [1 2 1 2]
)

# sin   三角函数 sin
print(
    '\nsin',
    '\nnumpy: ', np.sin(data),      # [-0.84147098 -0.90929743  0.84147098  0.90929743]
    '\ntorch: ', torch.sin(tensor)  # [-0.8415 -0.9093  0.8415  0.9093]
)

# mean  均值
print(
    '\nmean',
    '\nnumpy: ', np.mean(data),         # 0.0
    '\ntorch: ', torch.mean(tensor)     # 0.0
)

矩阵的运算

除了简单的计算, 矩阵运算才是神经网络中最重要的部分. 所以我们展示下矩阵的乘法. 注意一下包含了一个 numpy 中可行, 但是 torch 中不可行的方式.

# matrix multiplication 矩阵点乘
data = [[1,2], [3,4]]
tensor = torch.FloatTensor(data)  # 转换成32位浮点 tensor
# correct method
print(
    '\nmatrix multiplication (matmul)',
    '\nnumpy: ', np.matmul(data, data),     # [[7, 10], [15, 22]]
    '\ntorch: ', torch.mm(tensor, tensor)   # [[7, 10], [15, 22]]
)

# !!!!  下面是错误的方法 !!!!
data = np.array(data)
print(
    '\nmatrix multiplication (dot)',
    '\nnumpy: ', data.dot(data),        # [[7, 10], [15, 22]] 在numpy 中可行
    '\ntorch: ', tensor.dot(tensor)     # torch 会转换成 [1,2,3,4].dot([1,2,3,4) = 30.0
)

新版本中(>=0.3.0), 关于 tensor.dot() 有了新的改变, 它只能针对于一维的数组. 所以上面的有所改变.

tensor.dot(tensor)     # torch 会转换成 [1,2,3,4].dot([1,2,3,4) = 30.0

# 变为
torch.dot(tensor.dot(tensor)

Variable

什么是Variable

这个感觉和tensorflow里面的一样，就是存有变化的数值的地方，然后这个变化的值就是tensor。然后，这里Variable可能有点像tensorflow里面的placeholder吧

import torch
from torch.autograd import Variable

tensor = torch.FloatTensor(([1, 2], [3, 4]))
# requires_grad 是参不参与误差的反向传播，要不要计算梯度
variable = Variable(tensor, requires_grad=True)

print(tensor)
"""
 1  2
 3  4
[torch.FloatTensor of size 2x2]
"""

print(variable)
"""
Variable containing:
 1  2
 3  4
[torch.FloatTensor of size 2x2]
"""

Variable 计算，梯度

我们再对比一下 tensor 的计算和 variable 的计算.

t_out = torch.mean(tensor*tensor)       # x^2
v_out = torch.mean(variable*variable)   # x^2
print(t_out)
print(v_out)    # 7.5

这里我们应该也看不出Variable和一般tensor的不同，和tensorflow类似，Variable参与计算时，也是在打一个computational graph （原来是将所有的计算步骤 (节点) 都连接起来, 最后进行误差反向传递的时候, 一次性将所有 variable 里面的修改幅度 (梯度) 都计算出来, 而 tensor 就没有这个能力啦，毕竟，tensor 只是一个值而已。）

v_out = torch.mean(variable*variable) 就是在计算图中添加的一个计算步骤, 计算误差反向传递的时候有他一份功劳, 我们就来举个例子:

v_out.backward()    # 模拟 v_out 的误差反向传递

# 下面两步看不懂没关系, 只要知道 Variable 是计算图的一部分, 可以用来传递误差就好.
# v_out = 1/4 * sum(variable*variable) 这是计算图中的 v_out 计算步骤
# 针对于 v_out 的梯度就是, d(v_out)/d(variable) = 1/4*2*variable = variable/2

print(variable.grad)    # 初始 Variable 的梯度
'''
 0.5000  1.0000
 1.5000  2.0000
'''

获取 Variable 里面的数据

直接print(variable)只会输出 Variable 形式的数据, 在很多时候是用不了的(比如想要用 plt 画图), 所以我们要转换一下, 将它变成 tensor 形式.

print(variable)     #  Variable 形式
"""
Variable containing:
 1  2
 3  4
[torch.FloatTensor of size 2x2]
"""

print(variable.data)    # tensor 形式
"""
 1  2
 3  4
[torch.FloatTensor of size 2x2]
"""

print(variable.data.numpy())    # numpy 形式
"""
[[ 1.  2.]
 [ 3.  4.]]
"""

Activation

什么是 Activation

为什么需要非线性的函数？

线性的话，你多少层都一样
非线性的话，可以拟合不同的函数

Torch 中的激励函数

Torch 中的激励函数有很多, 不过我们平时要用到的就这几个. relu, sigmoid, tanh, softplus. 那我们就看看他们各自长什么样啦.

import torch
import torch.nn.functional as F     # 激励函数都在这
from torch.autograd import Variable

# 做一些假数据来观看图像
x = torch.linspace(-5, 5, 200)  # x data (tensor), shape=(100, 1)
x = Variable(x)

接着就是做生成不同的激励函数数据:

x_np = x.data.numpy()   # 换成 numpy array, 出图时用

# 几种常用的 激励函数
y_relu = F.relu(x).data.numpy()
y_sigmoid = F.sigmoid(x).data.numpy()
y_tanh = F.tanh(x).data.numpy()
y_softplus = F.softplus(x).data.numpy()
# y_softmax = F.softmax(x)  softmax 比较特殊, 不能直接显示, 不过他是关于概率的, 用于分类

接着我们开始画图, 画图的代码也在下面:

import matplotlib.pyplot as plt  # python 的可视化模块, 我有教程 (https://morvanzhou.github.io/tutorials/data-manipulation/plt/)

plt.figure(1, figsize=(8, 6))
plt.subplot(221)
plt.plot(x_np, y_relu, c='red', label='relu')
plt.ylim((-1, 5))
plt.legend(loc='best')

plt.subplot(222)
plt.plot(x_np, y_sigmoid, c='red', label='sigmoid')
plt.ylim((-0.2, 1.2))
plt.legend(loc='best')

plt.subplot(223)
plt.plot(x_np, y_tanh, c='red', label='tanh')
plt.ylim((-1.2, 1.2))
plt.legend(loc='best')

plt.subplot(224)
plt.plot(x_np, y_softplus, c='red', label='softplus')
plt.ylim((-0.2, 6))
plt.legend(loc='best')

plt.show()

搭神经网络

深度学习篇

Posted on 2018-06-17 | In deep learning

虽然，强化学习、深度学习和迁移学习都是属于机器学习的，但是感觉blog里面，放在一起还是有点杂乱，所以还是分出来，慢慢整理，
我应该主要整理cmu 10707 introduction to deep learning的东西，也会参考11785 introduction to deep learning

就是用自己的理解整理一遍吧，之前面试总是面挂在这里，希望之后不会，还有对自己目前research也有点帮助吧。毕竟基础不牢，地动山摇。

DNN

Posted on 2018-06-17 | In deep learning

介绍

这个是hugo的deep learning的学习tutorial，需要慢慢刷掉，当然cmu 10707 的slides很多也是参考这个的(Russlan自己说)整理的，你如果找不到CMU 10707的视频你可以看这个，也是极好的。

然后，这里我想整理下最最基本的neural network的公式的推导，自己再推一遍，感觉就不虚：

网络的基本结构：

layer pre-activation for $k > 0$ $(h_{0}(x) = x)$
$a^{(k)}(x) = b^{(k)} + W^{(k)}h^{(k - 1)}(x)$
感觉这么记忆，不会记乱，k就是第几层neuron，然后$h^{(k - 1)}(x)$是前一层的输出，然后weights $W^{(k)}$，bias $b^{(k)}$都是这层的，虽然有前一层输出作为的输入，但是这些还是算这一层的。
hidden layer activation (k from 1 to L)
$h^{(k)}(x) = g(a^{(k)}(x))$
反正这个就是这一层的输出，然后，这层的weights，bias最终都会由这层的输出所终结。
output layer activation $k = L + 1$:
$h^{(L + 1)}(x) = o(a^{(L + 1)}(x)) = f(x)$
和hidden不一样就是这个是最后一层了，然后这层的activation function也会和之前hidden layers有所不同，如果是分类的话，往往是softmax
softmax activation function at the output:
$o(a) = softmax(a) = [\frac{exp(a_1)}{\sum_c exp(a_c)}…\frac{a_C}{\sum_c exp(a_c)}]^T$
如何理解softmax呢，为什么多分类问题里面要用softmax呢，而不是别的呢？
知乎对此有一定的讨论，我这里引用下王赟(yun第一声)大神的解答：
- 原因之一：希望特征对概率的影响是乘性的
- 原因之二：多类分类问题的目标函数常常选为cross-entropy，…(推完整个，回来补)
activation function:
- sigmoid：
  - formula:
    $\sigma(x) = \frac{1}{1 + e^{-x}}$，$\sigma(x)’ = \frac{e^x}{(1 + e^x)^2} = (1 - \sigma(x)) \sigma(x)$
  - shortcomings:
    - gradient vanish
    - symmetric
    - time cosuming to compute exp
- tanh:
  - formula:
    $tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{2}{1 + e^{-2}} - 1 = 2 \sigma(2x) - 1$，$tanh(x)’ = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
  - 感觉就是解决了原点对称的问题
- relu:
  - formula：$f(x) = max(0, x)$
  - 优点：
    Relu会使一部分神经元的输出为0，这样就造成了网络的稀疏性，并且减少了参数的相互依存关系，缓解了过拟合问题的发生
    计算量小
  - 缺点：
    部分neuro会死亡
- leaky relu：
  - formula：$f(x) = max(\epsilon x, x)$
  - 优点:
    解决了neuron会死亡的问题
- maxout：
  - formula: 对 relu 和 leaky relu的一般归纳：$f(x) = max(w_1^T x + b_1, w_2^T x + b_2)$
  - 优点:
    计算简单，不会死亡，不会饱和

loss function:

stochastic gradient descent (SGD):
随机梯度下降应该是最最基础的梯度下降的方法了，
- initialize $\theta$ ($\theta = {W^{(1)}, b^{(1)}…，W^{(L + 1)}}$)
- algorithm:
  for N iterations: (One epoch)
```
for each training example $(x_{(t)}, y_{(t)})$   
    $\delta = -\nabla_{\theta}l(f(x_{(t)}, \theta), \y_{(t)}) - \lambda\nabla_{(\theta)}$
    $\omiga_{(\theta)}$
    $\theta \leftarrow \theta + \alpha \delta$
```
- SGD 的优缺点：
  - 缺点：
    - 选择合适的learning rate比较困难 - 对所有的参数更新使用同样的learning rate。对于稀疏数据或者特征，有时我们可能想更新快一些对于不经常出现的特征，对于常出现的特征更新慢一些，这时候SGD就不太能满足要求了
    - 相对BGD noise会比较大
- batch gradient descent (BGD) 的对比：
  所谓batch就是一起算，你看公式就知道：
  $\theta \leftarrow \theta + \frac{1}{m}\sum_{i}(y_i - f(x;\theta)(x_i))$ (MSE)
  - 缺点：m很大的时候，train的会比较慢
  - 优点：比SGD稳定
- mini-batch GD:
  就是这两个的折中，就像强化学习里面的，TD，Monta Carlo之间的n step-TD
  - advantages:
    - give a accurate estimate of average loss
    - can leverage matrix operations, which cost less than BGD

what neural network estimates?
$f(x)_{c} = P(y=c|x)$， where c means which class.
what to optimize?
maximize log likelihood —- minize negative log likelihood: $P(y_i=c|x_i)$，given $(x_i, y_i)$
cross-entropy: p, q (p one-hot, q distribution of the P(y=c|x))
$l(f(x), y) = -\sum_c1(y=c)log f(x)_c = - log f(x)_y$

loss gradient output:

loss gradient at output

partial derivative:
$\frac{\partial - logf(x)_y}{\partial f(x)_c} = \frac{-1^{(y = c)}}{f(x)^y}$
这里，y要和c一样才有值，因为这里cross-entropy里面用了one-hot，只有在同一维度下面，求偏导才有值。
gradient:
然后，我们推广到，求梯度
$\nabla_{f(x)} -logf(x)_y= \frac{-1}{f(x)_y} [1^{(y=0)}…1^{(y=C-1)}]^T = \frac{-e(y)}{f(x)^y}$

loss gradient at output pre-activation

partial derivative:
首先还是标量的形式，
$\frac{\partial - logf(x)_y}{\partial a^{(L+1)}(x)_c} = (1^{(y = c)}} - f(x)^y)$
这里，y要和c一样才有值，因为这里cross-entropy里面用了one-hot，只有在同一维度下面，求偏导才有值。
gradient:
然后，我们类比到向量上面，
$\nabla_{a^{(L+1)}(x)_c}[- logf(x)_y}] = -(e^{(y)}} - f(x)^y)$
proof:
这里的proof不完整，只是推了一个维度的，完整的可以参考ece一个师兄的知乎的矩阵求导术，之后回来在补推一下。

Backpropagation

Compute output gradient (before activation)

$\nabla_{a^{(L+1)}(x)} -logf(x)_y \leftarrow - (e(y)-f(x))$

for k from L + 1 to 1

compute gradients of hidden layer parameter
$\nabla_{W^{(k)}} -logf(x)^y \leftarrow $ $\nabla_{a^{(k)}(x)} -log f(x)^y h^{(k-1)}(x)^T$
$\nabla_{b^{(k)}} -logf(x)^y \leftarrow $ $\nabla_{a^{(k)}(x)} -log f(x)^y$
compute gradient of hidden layer below
$\nabla_{b^{(k)}} -logf(x)^y \leftarrow $ $\nabla_{a^{(k)}(x)} -log f(x)^y$
compute gradient of hidden layer below
$\nabla_{h^{(k-1)}(x)} -logf(x)^y \leftarrow $ $W^{(k)T} \nabla_{a^{(k)}(x)} -log f(x)^y$
compute gradient of hidden layer below (before activation)
$\nabla_{a^{(k-1)}(x)} -logf(x)^y \leftarrow $ $(\nabla_{h^{(k-1)}(x)} -log f(x)^y) \odot […,g’(a^{(k-1)}(x)_j),…]$

Regularization

L1 & L2 regularization
- L1 $\frac{\lambda}{2m} \sum |w|^2$
  L2 regularization is also known as weight decay as it forces the weights to decay towards zero (but not exactly zero).
- L2 $\frac{\lambda}{2m} \sum |w|$
  Unlike L2, the weights may be reduced to zero here. Hence, it is very useful when we are trying to compress our model. Otherwise, we usually prefer L2 over it. Sparse solution.
Dropout
So what does dropout do? At every iteration, it randomly selects some nodes and removes them along with all of their incoming and outgoing connections as shown below.

So each iteration has a different set of nodes and this results in a different set of outputs. It can also be thought of as an ensemble technique in machine learning.

Ensemble models usually perform better than a single model as they capture more randomness. Similarly, dropout also performs better than a normal neural network model.

This probability of choosing how many nodes should be dropped is the hyperparameter of the dropout function. As seen in the image above, dropout can be applied to both the hidden layers as well as the input layers.
Batch Normalization
- idea
  is that since it’s benefit to training if the input data is normalized, so why not normalize in hidden layers to solve the internal covariance shift.
- denormalization
  to avoid extra effect of normalization, the denormalization parameters are helpful to adjust

Implementation of simple neuron network

周末做什么

Posted on 2018-06-16 | In 日常

很多人会烦周末了，做什么呢？
一些人也许会去出去和朋友浪，一些人也许会花周末的时间一直打游戏，当然也有大神会在周末继续工作。

我感觉把，周末最需要的是休息吧，作息规律，妥善的饮食，都是重要的。我一直觉得所谓休息不是放纵自己的欲望，一个人幸福也许是来自他/她对自己欲望的驾驭，熬夜刷剧不叫休息，熬夜看世界杯也不叫休息吧，感觉就是放纵，算是对身体的一种伤害，休息应该是不是伤害身体的。娱乐当然需要呀，晚上好好睡觉，白天起来刷剧不是更爽吗。所以，感觉休息是身体的调整吧。当然，还有精神的放松，以及卸下压力，卸下你这一周的重担。

当然，我感觉你觉得休息够了，你自然可以考虑工作的事情，个人发展的事情，也都挺好的，感觉这个时候更重要的是反思吧，反思这一周，所谓：学而不思则罔吧。你也可以推进一下你需要的长期的要做的事情，你可以整理一下你这一周的得失，以及对下一周的展望。

总之，周末就是一个节点，一个驿站，周而复始，你就会达到你所向往的未来。

进阶之路

Posted on 2018-06-14 | In 思考

这篇已经看了好几遍了，但是每读每新：
作者：田渊栋
链接：https://www.zhihu.com/question/30022694/answer/224543003
来源：知乎
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

追求数目没有意义。读文章一般两个目的：

看大家在做什么，找方向。这时候一般读读Abstract和Introduction，对领域有初步了解，知道它主要关于什么，搞清一些概念的含义和联系。这时候不懂没关系，多看几篇文章就懂了。一般聪明的人这一步可以做得很快。
搞清细节找一个自己感兴趣的方向精读，把里面的课题思路和推理细节搞明白，并且还要顺藤摸瓜找到其它大量的相关文献继续读下去。标准是在脑里能有对这个领域有清楚的脉络，能做到独立完成大部分推导和证明。一个靠谱的检查方法是给同组的人或者导师做个讲座，看他们能听懂不。很多时候自己以为懂了，其实和别人一说马上就露出马脚。同时讨论也可以激发新思路，说不定就能找到下一篇文章的出发点。这一步往往会花费一个科研人员大量时间，也是业余和职业科研的关键区别所在。总之分配给每篇文章的时间天差地别。烂文几秒钟就可以放弃，而经典文章还需要每过一阵子回头再去看一看想一想。至于如何评判文章质量，那就得要靠长年科研积累出来的品味了。接下来的两个阶段就不是光看论文可以看出来的。
写代码实现别人的工作，并且改进每篇文章都会有意或者无意抬高自己贬低别人，都存在一些有意或者无意隐藏的细节，这些不亲手做是看不到的。所以得要动手花时间去实现别人的方法，想方设法达到别人的效果，然后反过来再看看文章。时间长了马上就会学到故意隐藏的蛛丝马迹，理解别人留白的道理。光看文章的话，这类经验的积累要慢很多。一般说的“纸上谈兵”就是指这一步没做。我在15年1月刚去Facebook AI Research的时候，在深度学习上还没有实际操作经验。交给我的第一件事情是复现VGG在ImageNet上的性能，那时还没有BatchNorm，跑5个有2个能开始收敛的就不错了，最后花了几周搞定了。整个过程让我学到不少经验。
总结经验，融会贯通，找到并且遵循自己的方法论重复3很多次之后，可能会觉得自己比较有经验了。别人问起的时候也能侃侃而谈，但说的往往是一些分散且孤立的经验。并且你会发现自己很容易遗忘这些经验，这个并不是因为记忆力不好，而是因为思路不系统。这个就需要反复思考反复提炼，从而形成自己的方法论。有了方法论之后，心里就有大方向而不会随便乱试乱撞，效率就会高很多，并且能在一个科研方向上挖很深坚持很久，而不是哪个课题热做哪个。在指导别人的时候也可以做到有的放矢。在这个基础上再看文献，往往就会读懂很多一开始读不懂的东西。比如说为什么作者要强调A而否认B，那是因为他相信A后面的哲学和方法论。

如果你发现自己提炼不了，或者本来知识就是凌乱的，那么要么就是(1)境界未到，要么就是(2)领域还没有成熟，目前的知识点只是零碎的拼凑。(1)要靠自己练，(2)则预示着大机遇，一个研究者牛不牛就看他是不是可以在别人都放弃的地方找到新的规律。

一般完成1是新闻及科普的水平，2到3是博士生低年级至高年级的水平，精通3到初入4是博后的水平，精通4则是研究员和教授的水准。另外，从1到4并没有特别固定的顺序，可能你在某个领域是4，另一个领域还只是1或2的程度；或者你在4中获得的经验能反过来帮助1和2（这个很常见）；或者一上来就可以跳过2做3，然后等3有了结果之后再去补2，等等。当然，一步跳到4那是民科的水平。

然后，看看我自己，应该还初入2, 3这个阶段吧，需要继续努力呀！！

批量重命名图片文件

Posted on 2018-06-13 | In tech

转自灰羽吖大大CSDN的博客：
其实只要对os这个包熟悉便不难，对于人脸识别项目，有些图片可能来自其他途径，这些图片常用作测试，但是对于外来图片存在命名问题。这篇就讲一下怎么实现批量重命名图片等其他文件

代码：

import os

path_name='/home/huiyu/PycharmProjects/faceCodeByMe/testdata'
#path_name :表示你需要批量改的文件夹
i=0
for item in os.listdir(path_name):#进入到文件夹内，对每个文件进行循环遍历
    os.rename(os.path.join(path_name,item),os.path.join(path_name,(str(i)+'.jpg')))#os.path.join(path_name,item)表示找到每个文件的绝对路径并进行拼接操作
    i+=1

Learning to Adapt Structured Output Space for Semantic Segmentation

Posted on 2018-06-13 | In transfer learning

这篇论文对应的代码是最近在跑的，需要好好看看。

abstract

Convolutional neural network-based approaches for semantic segmentation rely on supervision with pixel-level ground truth, but may not generalize well to unseen image domains. As the labeling process is tedious and labor intensive, developing algorithms that can adapt source ground truth labels to the target domain is of great interest. In this paper, we propose an adversarial learning method for domain adaptation in the context of semantic segmentation. Considering semantic segmentations as structured outputs that contain spatial similarities between the source and target domains, we adopt adversarial learning in the output space. To further enhance the adapted model, we construct a multi-level adversarial network to effectively perform output space domain adaptation at different feature levels. Extensive experiments and ablation study are conducted under various domain adaptation settings, including synthetic-to-real and cross-city scenarios. We show that the proposed method performs favorably against the state-of-the-art methods in terms of accuracy and visual quality.

introduction

model

1) a segmentation model to predict output results
2) a discriminator to distinguish whether the input is from the source or target segmentation output.
contributions
propose a domain adaptation method for pixel-level semantic segmentation via adversarial learning
demonstrate that adaptation in the output (segmentation) space can effectively align scene layout and local context between source and target images
a multi-level adversarial learning scheme is developed to adapt features at different levels of the segmentation model, which leads to improved performance.

structure

感觉也不是那么难理解，就是有点像u-net，它这里的话就是在不同的layer里面用GAN，就是所谓的multi-，
然后主要是output space上面，因为这篇文章发现，不管两个domain的图多么不一样，他们在output space总是有很多相似的地方。

model overview

这个结构有两个模块：生成器$G$和判别器$D_i$ （i 表示是第几层的判别器）。images通过生成器出来的是源域segmentation的概率分布$P_s$

loss function

$L(I_s, I_t) = L_{seg}(I_s) + \lambda L_{adv}(I_t)$

$L_{seg}(I_s)$
cross-entropy loss using ground truth annotations in the source domain
$L_{adv}$
对抗损失，用来使得源域的预期的数据分布和目标域相近
$\lambda_{abv}$
这个weight用来平衡这两个loss

Output space adaptation

Single-level adversarial learning

Discriminator Training

cross entropy
- 定义：
  给定两个分布，p，q，它们在给定样本集上面的交叉熵的定义如下
  $CEH(p, q) = E_p[-log q] = - \sum_{x \in X}p(x)q(x) = H(p) + D_{KL}(p||q)$，当p的熵给定时，交叉熵和KL散度是一致是的，一定程度上可以用来描述，这两个分布的距离。
- 讨论：讲到cross entropy，为什么用cross entropy loss 于分类呢？(Jackon解答)
  - 比起一般的classification error 作为loss，它很更精细准确的去描述model的优劣
  - 比起MSE，来说，它是一个凸优化的问题
segmentation softmax output:
$P = G(I) \in R^{HxWxC}$, 这里C是种类数，这里C是2，来自源域或者来自目标域
cross-entropy loss：
我们将P传到全卷积的判别器D里面：$L_d(P) = - \sum_{h, w}((1 - z)log(D(P)^{(h,w,0)})) + zlog(D(P)^{(h,w,1)})$

Segmentation Network Training

segmentation loss:
在源域的话我们正常训练，还是由cross-entropy loss来定义：$L_{seg}(I_s) = -\sum_{h, w}\sum_{c \in C}Y_s^{h,w,c}log(P_s^{(h,w,c)})$
adversarial loss：
在目标域，我们的对抗损失是：$L_{adv}(I_t) = -\sum_{h,w}log(D(G(I_t)))^{(h,w,1)}$，这个损失是用来欺骗判别器的，使得两者的预期的概率的一致

Multi-level Adversarial Learning

multi-level loss
就是在low-level的feature space里面加上上面的loss，也不是很难理解：
$L_{I_s, I_t} = \sum_i \lambda_i^{seg}(I_s) + \sum_i \lambda^i_{adv}L_{adv}^i(I_t)$，i表示第几层网络。
whole picture
有了上面的对loss的介绍后，我们的问题其实就是一个min-max的优化问题：
$max_D min_G L(I_s, I_t)$

自省

Posted on 2018-06-13 | In 思考

重温了一下知乎上田渊栋大大的感悟：常常看来提醒自己：

作者：田渊栋
链接：https://zhuanlan.zhihu.com/p/26178137
来源：知乎
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

思危，思退，思变

首先要自律。这是最基本的。如果一个人不能控制自己的行为，那无法走出自己的路。像说到要做到，有规律地锻炼身体，勤奋努力，这些都是重要的。

接下来要跳出舒适区。举个例子，勤奋本来就是舒适区的一种。“勤能补拙”这个词是很好的，首先它说明现在处于“拙”的状态中，需要继续努力；其次，勤只能用来“补”拙，而不能让拙变巧，不是根本的解决方案。勤奋是一种惶惶然的状态，而不是一种满足的状态，勤奋意味着自己不如别人，于是得要花更多的时间去补救，别人干八小时就够了，自己得花十几个小时才能赶上。这种状态是不长久的，碰到出些小毛小病，或者家里有事，那就补不过来，就要掉队了。勤奋的用处是试错，是让自己在落后时可以多花点时间找到正确的方法，以达到和别人相当甚至更高的效率，从而提高自己的能力。别人是科班出身，我半路杀进来，当然要多花时间去补；别人学习效率高，我的效率不高，于是得要多花点时间去探索更好的方法。勤奋是暂态，它最终目的是找到更好的方法及时补上以离开这个状态，而不是以一直维持这个状态为荣。

不能按部就班，要随时作好把棋盘翻过来的准备，世事变化很快，以前的所有努力，不管经历过什么样的辛劳，全都是沉没成本，在必要时候都需要扔掉的，或许做了很多年方向A，时势告诉你情况不妙，要换成方向B，那就得坚决换。以前或许这个不常见，但是以后这样的事情会越来越多。干了十年方向A，人工智能把方向A吃掉了，然后马上转做方向B，做了五年，人工智能再把B吃掉，然后继续，如此往复。很多时候转变不是一朝一夕，而是靠滴水穿石的功夫，今天长进一点，明天长进一点，跟着领域一起变，若是一个人跑得比别人快，他就会在市场上稀缺并因此获得相对的安全。做研究的人都习惯这个，每天看新东西，每天打开思路，时刻承认自己老旧了几个月或者几周，马上拍拍屁股跟上。学会了这些再去教徒弟是饿不死师傅的，反而让师傅变得更厉害，因为师傅主动跳出来接受打脸，学得比徒弟快。对很多人来说这个比较难，特别是一直顺风顺水的。但若是一直不敢看外面的世界，那迟早有一天会被逼进去面对。与其被逼，不如提早一些主动跳进去。历史无数次以血的教训告诉了大家，适者生存乃永恒之铁律，人类在千万物种中杀出血路成为地球的主宰，也必将背负着这样的命运走向未来。一个人逃避，这个人会被淘汰，一个领域逃避，这个领域会被淘汰，一个国家逃避，这个国家就会被淘汰。

任何时候，自己一定是有错的，最可怕的不是自己错了，而是不知道自己哪里错，并且在错的方向上越走越远。如果周围有厉害的同事，这种感觉尤其强烈，碰上了随便讨论两句，就知道自己哪些知识不足，暗地里记下马上回去补。为此，主动发言积极讨论是很重要的，思维有碰撞才知道问题在哪里。我有时候觉得自己一直在悬崖边上走，也许之前走得还行，但那都过去了，下一步随时有可能踏进崖边的泥地即将摔倒。踏错了不要紧，及时发现自己错了收脚就行；怕的是一直走安稳的道，连悬崖长什么样都不知道了。前辈和老师们说的话，也非常有可能是错的，而身为后辈的我们，大任在于如何找出他们的错误来。找出了的话，能力就得到了提升。而自信，往往就是通过这种方式磨练出来的——为什么自己和别人不同？因为选了一条不一样的路。

再往上走，主次是要分清的。重要的要抓牢，不重要的要放手。有人读过很多文章看过很多书，勤勉自律好学爱问样样不缺，但门门都不精；有人事事亲为，务求完美，大事上往往把握不了。短木板理论是有问题的，大部分岗位不需要全才，要的是一专多能，要的是某方向很牛非常牛，相同程度的可以掰指头数过来，其它的过线就行，甚至不达标也无人关心。如果你不是专家，高薪聘请没有意义；如果你是专家，求全责备没有意义。人一天只有24小时，所以知道哪里要放弃是很重要的，很多时候，没有牺牲就没有得到，要得到就得付出代价，事前权衡利弊，事后愿赌服输。输了不要紧，再来一次。当然在现实中并没有那么惨烈的权衡，而往往是找到了自己的方向，自然而然地就向这个方向发展下去，这时候主要的阻力，就在于登顶之难而非选择之痛。然而，即便生于风平浪静的和平年代，觉悟依然要有，或许将来有一天，得要做出这样的决断。

最后，不要在优越感中停止自己的脚步。名利于人最可怕的莫过于此，分明刚刚启程，但欢呼声让你觉得已然冲线，本来要万里长征，却变成了百米短跑，接受完了鲜花之后，就再也看不见远处的风景。其实境界到了或是未到，只有自己知道。跋山涉水，风餐露宿，鼓掌的是别人，度化的是自己。

work and life balance

Posted on 2018-06-12 | In 日常

计划开始推进的第二天，感觉自己有点点紧张，也感受到了压力，找工作和research，就像以前的due一样，加到自己身上。坚持就好了，承担更多的事情，自然要承受更大的压力，对应需要适当的娱乐和运动来平衡，自己一直在坚持游泳，坚持就好了，也许可以去听听音乐会。心里的信念和自己所向往的东西，一直支撑着自己往前走。记得田渊栋大大说：写下来，往前走。对，就是这样子。

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Posted on 2018-06-12 | In transfer learning

前面几篇主要都是采用了风格迁移的思想，这篇论文主要就是讲这个：

abstract

Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. Our goal is to learn a mapping G : X → Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss. Because this mapping is highly under-constrained, we cou- ple it with an inverse mapping F : Y → X and introduce a cycle consistency loss to enforce F(G(X)) ≈ X (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of our approach.

unpaired

他们要解决的unpaired translation，换句话说就是没有，很多label的，如下图所示：

loss

感觉这篇论文最核心的思想就是这个cycle consistency loss，大意是，做translation，从一个句子，从英语翻译到法语，在从法语翻译过来，应该和原句一样才对，反之亦然。cycle consistency loss主要是采用这个思想：
$G: X \rightarrow Y, F: Y \rightarrow X, F(G(x)) \approx x, G(F(y))) \approx y$

首先我们在重温一下GANs，它最核心的一点是一个adversarial loss来使得生成的图片和真实的图片难以被区分。
结构如下：

然后我们在来看看这篇论文的loss，就会很清晰了，不就是是两个GAN吗，然后结合了这个cycle consistency loss

adversarial loss

$L_{GAN}(G, D_{Y}, X, Y) = E_{y \sim p_{data}(y)}[log D_{Y}(y)] + E_{x \sim p_{data}(x)}[log(1 - D_{Y}(G(x)))]$
$L_{GAN}(F, D_{X}, X, Y) = E_{x \sim p_{data}(x)}[log D_{X}(x)] + E_{y \sim p_{data}(y)}[log(1 - D_{X}(F(y)))]$

cycle consistency loss

$L_{cyc}(G, F) = E_{x \sim p_{data}(x)}[|F(G(x)) - x |] + E_{y \sim p_{data}(y)}[|G(F(y)) - y |]$

其实总的idea不是很难理解，看懂这个在看别的论文就会好很多，很多论文都是基于cycle-GAN来改的。

Chu Lin

去人迹罕至的地方，留下自己的足迹。

用 Numpy 还是 Torch

Torch 中的数学运算

简单运算

矩阵的运算

Variable

什么是Variable

Variable 计算，梯度

获取 Variable 里面的数据

Activation

什么是 Activation

Torch 中的激励函数

搭神经网络

介绍

网络的基本结构：

loss function:

loss gradient output:

loss gradient at output

loss gradient at output pre-activation

Backpropagation

Compute output gradient (before activation)

for k from L + 1 to 1

Regularization

Implementation of simple neuron network

abstract

introduction

model

contributions

structure

model overview

loss function

Output space adaptation

Single-level adversarial learning

Discriminator Training

Segmentation Network Training

Multi-level Adversarial Learning

abstract

unpaired

loss

adversarial loss

cycle consistency loss