域适应文献记录
# 分布域对齐
# Return of Frustratingly Easy Domain Adaptation[2015]
- 在计算机视觉中引入CORAL
- 探讨了与特征归一化、流形法和最大化均值差异MMD的差异性
- However, these approaches only align the bases of the subspaces, not the distribution of the projected points. They also require expensive subspace projection and hyperparameter selection.【其他非监督方法只关注了子空间基的对齐,但没有关注数据在子空间上的分布】
- An alternative approach would be whitening the target and then re-coloring it with the source covariance. However, as demonstrated in (Harel and Mannor 2011; Fernando et al. 2013) and our experiments, transforming data from source to target space gives better performance.This might be due to the fact that by transforming the source to target space the classifier was trained using both the label information from the source and the unlabelled structure from the target.【解释了为什么不从target转换到source】
- For a linear classifier$f_{\vec{w}}(I)=\vec{w}^{T} \phi(I)$, we can apply an equivalent transformation to the parameter vector $\vec{w}^{T}$ instead of the features $u$. This results in added efficiency when the number of classifiers is small but the number and dimensionality of target examples is very high.【将对于特征的变换转化为对分类器模型参数的变换,适用于分类器少而数据维度高的情况】
- Relationship to Feature Normalization: In this example, although the features are normalized to have zero mean and unit variance in each dimension, the differences in correlations present in the source and target domains cause the distributions to be different.
- CORAL avoids subspace projection, which can be costly and requires selecting the hyper-parameter that controls the dimensionality of the subspace.【CORAL优于低维流形的原因】
- Intuitively, symmetric transformations find a space that “ignores” the differences between the source and target domain while asymmetric transformations try to “bridge” the two domains.【source与target间如果是对称转换,就像是在减小差异一样,如果是非对称转换,就像是连接两个域,MMD是对称转换】
- 在神经网络中,每一层特征都有会域迁移的问题,batch normalization尽管将每一层进行标准化,但抹去了两个域的数据分布特点,所以CORAL也可以用于神经网络
# Deep CORAL: Correlation Alignment for Deep Domain Adaptation[2016]
- However, it relies on a linear transformation and is not end-to-end trainable: it needs to first extract features, apply the transformation, and then train an SVM classifier in a separate step.【CORAL的缺点】
- In this work, we extend CORAL to incorporate it directly into deep networks by constructing a differentiable loss function that minimizes the difference between source and target correlations–the CORAL loss.【将原先的线性转换变为非线性转换】
- Our proposed Deep CORAL approach is similar to DDC, DAN, and ReverseGrad in the sense that a new loss (CORAL loss) is added to minimize the difference in learned feature covariances across domains, which is similar to minimizing MMD with a polynomial kernel. 【类似于最小化MMD,本文最小化协方差差异】
- However, it is more powerful than DDC (which aligns sample means only), much simpler to optimize than DAN and ReverseGrad, and can be integrated into different layers or architectures seamlessly.【MMD属于一阶统计量对齐,CORAL高级点,二阶统计量对齐】
- As mentioned before, the final deep features need to be both discriminative enough to train a strong classifier and invariant to the difference between source and target domains.【该网络既要分类准确,也需有域鲁棒性】
延伸阅读:
- DLID [1] trains a joint source and target CNN architecture with two adaptation layers. DDC [23] applies a single linear kernel to one layer to minimize Maximum Mean Discrepancy (MMD) while DAN [13] minimizes MMD with multiple kernels applied to multiple layers. ReverseGrad [5] and Domain- Confusion [22] add a binary classifier to explicitly confuse the two domains.【几种神经网络算法】
# Discriminative Feature Alignment: Improving Transferability of Unsupervised Domain Adaptation by Gaussian-guided Latent Alignment[2020]
# 域不变特征
# 🛠️Unsupervised Domain Adaptation by Backpropagation[2014]
项目地址:https://github.com/fungtion/DANN_py3
DANN结构的提出
将特征提取和域适应一起训练,最小化说话人分类误差,最大化域分类误差
gradient reversal layer:在正向传播期间保持输入不变,并在反向传播期间通过将梯度乘以一个负标量来反转梯度
![](https://i.loli.net/2021/06/17/1JlRBFkuLw3SfDN.png)
本文认为特征是高维的,直接度量两个分布的差异是不现实的,不如通过域判别器的损失值来反映域差异
为了得到域不变的特征,需要特征提取器能够最大化域判别器损失值,而预判别器需要最小化域判别器损失值
本文还论证了SGD可以得到损失函数的鞍点,损失函数为:
$$\begin{gathered} E\left(\theta_{f}, \theta_{y}, \theta_{d}\right)=\sum_{i=1 . . N \atop d_{i}=0} L_{y}\left(G_{y}\left(G_{f}\left(\mathrm{x}{i} ; \theta{f}\right) ; \theta_{y}\right), y_{i}\right)- \ \lambda \sum_{i=1 . . N} L_{d}\left(G_{d}\left(G_{f}\left(\mathrm{x}{i} ; \theta{f}\right) ; \theta_{d}\right), y_{i}\right)= \ =\sum_{i=1 . . N} L_{y}^{i}\left(\theta_{f}, \theta_{y}\right)-\lambda \sum_{i=1 . . N} L_{d}^{i}\left(\theta_{f}, \theta_{d}\right) \end{gathered}$$
模型更新:
$$\begin{gathered} \left(\hat{\theta}{f}, \hat{\theta}{y}\right)=\arg \min {\theta{f}, \theta_{y}} E\left(\theta_{f}, \theta_{y}, \hat{\theta}{d}\right) \ \hat{\theta}{d}=\arg \max {\theta{d}} E\left(\hat{\theta}{f}, \hat{\theta}{y}, \theta_{d}\right) \end{gathered}$$
GRL只有一个超参数$\lambda$,这是训练前就确定的
GRL的实现:定义torch.autograd.Function的子类,自己定义某些操作,且定义反向求导函数_tsq292978891的博客-CSDN博客 (opens new window)
写代码的时候,GRL直接放在特征提取层后面
# Deep Domain Confusion: Maximizing for Domain Invariance[2014]
![](https://i.loli.net/2021/06/17/YFrxLANo3KXsmB4.png)
adaptation layer+domain confusion loss:基于maximum mean discrepancy(MMD)得到的
domain confusion可以用于选择适应层的维数,也可以在预训练的CNN架构中为一个新的适应层选择有效的位置,并微调
与DANN的区别在于,DANN是用域判别器观察两个分布的差异,这里是用MMD观察
$$\mathcal{L}=\mathcal{L}{C}\left(X{L}, y\right)+\lambda \operatorname{MMD}^{2}\left(X_{S}, X_{T}\right)$$
用MMD决定哪一层使用激活来最小化域分布距离,哪层MMD最小就把适应层放在后面;用贪婪搜索决定适应层的维数,说白了就是多用几个维数并进行预训练,选择效果最好(MMD最小)的那个
CNN的网络参数是共享的
# Simultaneous Deep Transfer Across Domains and Tasks[2015]
- 本文同时优化域不变性以促进域转移,并使用软标签分布匹配损失在任务间传递信息
# Adversarial Discriminative Domain Adaptation[2017]
Adversarial Discriminative Domain Adaption 阅读笔记_sinat_29381299的博客-CSDN博客 (opens new window)
Adversarial Discriminative Domain Adaptation阅读笔记 - 简书 (jianshu.com) (opens new window)
根据先验去做域适应能得到很好的可视化效果,但是域只能有很小的变化;先前判别式方法可以处理较大的域差异,给模型使用固定的权重,但是没有探索过基于GAN的损失;
本文提出的ADDA结合了判别性建模,解绑的权重共享(对称/非对称)和GAN loss
本文认为,对输入图像的分布进行生成式建模是没有必要的,因为最终的任务是学习一个有鉴别性的表示
非对称映射比对称映射能更好地模拟低层次特征的差异
GAN相较于其他生成式方法的优点是,在训练阶段不需要复杂的采样或推断;缺点是难以训练
source domain有标签,所以通过监督式的loss可以得到映射,但是target domain没有,所以需要将映射参数化
通常target mapping与source mapping结构是一致的,但是大家会对mapping提出各种约束,希望可以在映射后,source与target的距离会最小化,同时还要满足target是可分的
对两个mapping间的约束可以细化到网络结构的每一层,比如两个映射的每一层都一样
本文认为对称映射(共享参数)效果不好,因为同一个网络需要处理不同域的数据;非对称映射就是指,只固定一部分参数
所有的对抗判别器使用标准的分类损失,但是在训练映射器时,各有不同
虽然可以用GRL同时训练生成器和判别器,但是本文认为GRL使得生成器的梯度容易消失;更常见的作法是,
$$\begin{array}{r} \mathcal{L}{\mathrm{adv}{D}}\left(\mathbf{X}{s}, \mathbf{X}{t}, M_{s}, M_{t}\right)= \ -\mathbb{E}{\mathbf{x}{s} \sim \mathbf{X}{s}}\left[\log D\left(M{s}\left(\mathbf{x}{s}\right)\right)\right] \ -\mathbb{E}{\mathbf{x}{t} \sim \mathbf{X}{t}}\left[\log \left(1-D\left(M_{t}\left(\mathbf{x}_{t}\right)\right)\right)\right] \end{array}$$
$$\mathcal{L}{\mathrm{adv}{M}}=-\mathcal{L}{\mathrm{adv}{D}}$$
$$(√)\mathcal{L}{\mathrm{adv}{M}}\left(\mathbf{X}{s}, \mathbf{X}{t}, D\right)=-\mathbb{E}{\mathbf{x}{t} \sim \mathrm{X}{t}}\left[\log D\left(M{t}\left(\mathrm{x}_{t}\right)\right)\right]$$
本文中,source和target的映射是单独的,只用对抗式的方法训练target映射,source映射固定不动
当两个分布都在变化时,将导致振荡——当映射收敛到其最优值时,鉴别器会轻易地反转其预测的响应符号