PLDA模型文献记录
# LDA
# Local Pairwise Linear Discriminant Analysis for Speaker Verification[2018]
对LDA做优化,而不是PLDA
最大化本地向量对协方差,表示了目标类和邻近的非目标类的局部结构,而不是用类间协方差,这表示了数据的整体结构
对于一个特定说话人,临近的非目标语音(本地迷惑向量)对于学习决策边界来说,更加重要,所以不应该强调每个样本具有同等重要性【14-16】
LDA是一个优化问题,找到子空间V使得类间散度与类内散度的比值最大化,【14-15】使用的是NDA,用到了局部加权的思想
本文专注于与目标说话人最相似的非目标项,如何选取?取n个距离目标类均值s最近的非目标类样本,计算得到均值s',将正负目标类均值作为代表,计算local pairwise散度
$$\begin{aligned} \mathrm{S}{\mathrm{lp}}=& \sum{s=1}^{S} \frac{1}{2}\left[\left(\boldsymbol{\mu}{s}-\frac{\boldsymbol{\mu}{s}+\boldsymbol{\mu}{\bar{s}}}{2}\right)\left(\boldsymbol{\mu}{s}-\frac{\boldsymbol{\mu}{s}+\boldsymbol{\mu}{\bar{s}}}{2}\right)^{t}\right.\ &\left.+\left(\boldsymbol{\mu}{\bar{s}}-\frac{\boldsymbol{\mu}{s}+\boldsymbol{\mu}{\bar{s}}}{2}\right)\left(\mu{\bar{s}}-\frac{\boldsymbol{\mu}{s}+\boldsymbol{\mu}{\bar{s}}}{2}\right)^{t}\right] \ =& \frac{1}{4} \sum_{s=1}^{S}\left(\boldsymbol{\mu}{s}-\boldsymbol{\mu}{\bar{s}}\right)\left(\boldsymbol{\mu}{s}-\boldsymbol{\mu}{\bar{s}}\right)^{t} \end{aligned}$$
如何选取合适的负样本个数n呢?
如何计算向量距离?归一化后计算内积【1,18】,It should be noted that a larger inner product indicates a smaller distance.
NDA是计算每个正样本的邻近负样本,但LPLDA计算该正样本类的邻近负样本,计算量减少
为了选出实验中最合适的参数k1与k2,作者用RGB图直观的表示,非常好,不用折线图也不用堆砌数据
# GAN
# MTGAN: Speaker Verification through Multitasking Triplet Generative Adversarial Networks[2018]
- ASV属于zero-shot tasks,训练集与测试集无关
# 判别分析
# DISCRIMINATIVELY TRAINED PROBABILISTIC LINEAR DISCRIMINANT ANALYSIS FOR SPEAKER VERIFICATION
- 2cov模型:说话人和信道变量分别用类间和类内全秩协方差矩阵表示
- 一般情况下,说话人和信道变量不会用全秩方差矩阵表示,而是分解为低秩正交矩阵及其转置【如何将一个对称矩阵分解成一个矩阵与其转置的乘积?有没有matlab指令?_百度知道 (baidu.com) (opens new window)】,这使得说话人变量可以被限制在一个由降秩矩阵的列张成的子空间里
- 通过矩阵表示将打分表达式转为一个权重向量和一个扩展的Ivector向量对
$$\begin{aligned} s &=\mathbf{w}^{T} \varphi\left(\phi_{1}, \phi_{2}\right) \ &=\left[\begin{array}{c} \operatorname{vec}(\boldsymbol{\Lambda}) \ \operatorname{vec}(\boldsymbol{\Gamma}) \ \mathbf{c} \ k \end{array}\right]^{T}\left[\begin{array}{c} \operatorname{vec}\left(\phi_{1} \phi_{2}^{T}+\phi_{2} \phi_{1}^{T}\right) \ \operatorname{vec}\left(\phi_{1} \phi_{1}^{T}+\phi_{2} \phi_{2}^{T}\right) \ \phi_{1}+\phi_{2} \ 1 \end{array}\right] \end{aligned}$$
- 对上式取sigmoid就可以得到正假设的概率,此时正负假设的先验是一致的
$$p\left(\mathcal{H}{s} \mid \phi{1}, \phi_{2}\right)=\sigma(s)=(1+\exp (-s))^{-1}$$
- 🚩设法在不对ivector分布精确建模的前提下,训练出权重向量
- 逻辑回归中,目标函数时为了最大化所有正确分类的训练样本的log似然的累积,等价地,这也可以用最小化交叉熵错误函数表示,这是所有训练样本的累积,而这个函数可以用逻辑回归损失函数表示,其中t={-1, 1}
$$E(\mathbf{w})=\sum_{n=1}^{N} \alpha_{n} E_{L R}\left(t_{n} s_{n}\right)+\frac{\lambda}{2}|\mathbf{w}|^{2}$$
$$E_{L R}(t s)=\log (1+\exp (-t s))$$
- $\alpha$是用来给样本对加权的,本来正样本对的数量远小于负样本对,对判别器的影响就会变得很弱,加权后就能任意调节影响力了
- 同样的,还使用了SVM中的hingle loss function,该函数可以最大化类间隔,理论上是逻辑回归损失函数的分段逼近
$$E_{S V}(t s)=\max (0,1-t s)$$
针对两个目标函数,对$E(\mathbf{w})$进行梯度推导(略)
将所有ivector按列组成一个大矩阵,随机抽取i与j组成测试对,t-标签,a-测试对权重,d-损失函数的推导,梯度更新可以表示为矩阵形式,其中G的各元素g=d·a
$$\nabla E(\mathbf{w})=\left[\begin{array}{c} \nabla_{\Lambda} L \ \nabla_{\Gamma} L \ \nabla_{c} L \ \nabla_{k} L \end{array}\right]=\left[\begin{array}{c} 2 \cdot \operatorname{vec}\left(\Phi \mathrm{G} \Phi^{T}\right) \ 2 \cdot \operatorname{vec}\left(\Phi\left[\Phi^{T} \circ\left(\mathbf{G} 11^{T}\right)\right]\right) \ 2 \cdot 1^{T}\left[\Phi^{T} \circ\left(\mathbf{G} 11^{T}\right)\right] \ \mathbf{1}^{T} \mathbf{G} 1 \end{array}\right]+\lambda \mathbf{w}$$
原先需要先计算出得分,再更新梯度,现在只需要一整个测试矩阵,就可以直接更新梯度了
对比了4种PLDA:传统、逻辑回归、SVM、HT-PLDA【bsxfan/meta-embeddings: Meta-embeddings are a probabilistic generalization of embeddings in machine learning. (github.com) (opens new window)】
基于SVM的目标函数效果更好,且最好加上WCCN
实际上HT-PLDA的效果最好,但是计算成本也很高
# Neural-based
# NPLDA: A Deep Neural PLDA Model for Speaker Verification[2020]
- 提出Neural PLDA,并与DPLDA及pairwise Gaussian backend作比较
![](https://gitee.com/zhouwenjun2020/blog_pictures/raw/master/20210320111155.png)
- 输入一对embeddings (enrollment, test),输出得分以判别target/nontarget
- 使用minimum detection cost (minDCF) 作为损失函数【传统的损失函数(二元交叉熵)会导致过拟合】
- LDA降维用线性仿射层,长度归一化用非线性激活层,PLDA又是一个仿射层,最后一层用于打分
- 用传统PLDA参数初始化;第0次迭代的结果作为基线模型;训练数据的性别需要均衡
相似文献:
- [x] PAIRWISE DISCRIMINATIVE NEURAL PLDA FOR SPEAKER VERIFICATION
- [x] LEAP System for SRE19 CTS Challenge - Improvements and Error Analysis
# MTGAN: Speaker Verification through Multitasking Triplet Generative Adversarial Networks[2018]
# mixPLDA
# Mixture of PLDA for Noise Robust I-Vector Speaker Verification
They found that NDA is more effective than the conventional LDA under noisy and channel degraded conditions.
However, to deal with cross channel tasks or tasks with varying noise and reverberation levels, the assumption of single Gaussian is rather limited.
In essence, the SNR is used as additional information to guide the clustering and dimension reduction process so that more prominent clusters in the ivector space can be formed.
SNR variability negatively affects PLDA speaker recognition accuracy, but its effect can be mitigated by explicitly modelling the SNR-dependent speaker subspace through mixture of PLDA.
Being close to the upper bound for the PC and close to 1:0 the lower bound for the PE suggest that i-vectors with variable noise levels have clustering tendency, which means different noise levels shift the i-vectors to different positions in the i-vector space.
A key difference between this model and the one described in Section IV is that the posteriors of mixtures depend on the SNR of utterances instead of the i-vectors.
In other words, the same combination weights will be used regardless of the characteristics of the test utterances. This leads to a very inflexible mixture of PLDA.[SI-mPLDA引入了固定的权重,而枉顾测试数据的特征]
Generative Model of PLDA
$S_{\mathrm{PLDA}}\left(\mathrm{x}{s}, \mathrm{x}{t}\right)=\frac{p\left(\mathrm{x}{s}, \mathrm{x}{t} \mid \text { same-speaker }\right)}{p\left(\mathrm{x}{s} \mid \operatorname{Spk} s\right) p\left(\mathrm{x}{t} \mid \operatorname{Spk} t\right)}=\frac{\mathcal{N}\left(\left[\mathrm{x}{s}^{\top} \mathrm{x}{t}^{\top}\right]^{\top} \mid\left[\mathrm{m}^{\top} \mathrm{m}^{\top}\right]^{\top}, \hat{\mathrm{V}} \hat{\mathrm{V}}^{\top}+\hat{\Sigma}\right)}{\mathcal{N}\left(\mathrm{x}{s} \mid \mathrm{m}, \mathrm{VV}^{\top}+\Sigma\right) \mathcal{N}\left(\mathrm{x}{t} \mid \mathrm{m}, \mathrm{VV}^{\top}+\Sigma\right)}$
SNR-INDEPENDENT MIXTURE OF PLDA
$S_{\text {SI-mPLDA }}\left(\mathrm{x}{s}, \mathrm{x}{t}\right)=\frac{\sum_{k_{s}=1}^{K} \sum_{k_{t}=1}^{K} \varphi_{k_{s}} \varphi_{k_{t}} \mathcal{N}\left(\left[\mathrm{x}{s}^{\top} \mathrm{x}{t}^{\mathrm{T}}\right]^{\top} \mid\left[\mathrm{m}{k{s}}^{\top} \mathrm{m}{k{t}}^{\top}\right]^{\top}, \hat{\mathrm{V}}{k{s} k_{t}} \hat{\mathrm{V}}{k{s} k_{t}}^{\top}+\hat{\Sigma}{k{s} k_{t}}\right)}{\left[\sum_{k_{s}=1}^{K} \varphi_{k_{s}} \mathcal{N}\left(\mathrm{x}{s} \mid \mathrm{m}{k_{s}}, \mathrm{~V}{k{s}} \mathrm{~V}{k{s}}^{\top}+\Sigma_{k_{s}}\right)\right]\left[\sum_{k_{t}=1}^{K} \varphi_{k_{t}} \mathcal{N}\left(\mathrm{x}{t} \mid \mathrm{m}{k_{t}}, \mathrm{~V}{k{t}} \mathrm{~V}{k{t}}^{\top}+\Sigma_{k_{t}}\right)\right]}$
Essentially, the model incorporates supervised learning to the mixture of factor analysers [24].【为什么说引入了监督学习?】
SNR-DEPENDENT MIXTURE OF PLDA
$S_{\text {SD-mPLDA }}\left(\mathrm{x}{s}, \mathrm{x}{t}\right)=\frac{\sum_{k_{s}=1}^{K} \sum_{k_{t}=1}^{K} \gamma_{\ell_{s}, \ell_{t}}\left(y_{k_{s}}, y_{k_{t}}\right) \mathcal{N}\left(\left[\mathrm{x}{s}^{\top} \mathrm{x}{t}^{\top}\right]^{\top} \mid\left[\mathrm{m}{k{s}}^{\top} \quad \mathbf{m}{k{t}}^{\top}\right]^{\top}, \hat{\mathbf{V}}{k{s} k_{t}} \hat{\mathbf{V}}{k{s} k_{t}}^{\top}+\hat{\Sigma}{k{s} k_{t}}\right)}{\left[\sum_{k_{s}=1}^{K} \gamma_{\ell_{s}}\left(y_{k_{s}}\right) \mathcal{N}\left(\mathrm{x}{s} \mid \mathrm{m}{k_{s}}, \mathrm{~V}{k{s}} \mathrm{~V}{k{s}}^{\top}+\Sigma_{k_{s}}\right)\right]\left[\sum_{k_{t}=1}^{K} \gamma_{\ell_{t}}\left(y_{k_{t}}\right) \mathcal{N}\left(\mathrm{x}{t} \mid \mathrm{m}{k_{t}}, \mathrm{~V}{k{t}} \mathrm{~V}{k{t}}^{\top}+\Sigma_{k_{t}}\right)\right]}$
延伸阅读:
- SNR-invariant PLDA modeling in nonparametric subspace for robust speaker verification
- [SI-mPLDA理论基础]The EM algorithm for mixtures of factor analyzers
- [本文来源]SNR-dependent mixture of PLDA for noise robust speaker verification
- [多种PLDA讨论]Unifying probabilistic linear discriminant analysis variants in biometric authentication
# Performance Evaluation of Mixtures of PLDA and Conventional PLDA for a Small-Set Speaker
Verification System
SI-mPLDA:Ivector是均值的后验分布,但不应该局限于单高斯分布,因此改用混合高斯分布表示
SD-mPLDA:每个SNR Level都有一个因子分析FA,用一组GMM表示某一FA对应的SNR的分布(SNR不是固定值,而在某一值附近波动)
Dev——随机加5-20dB噪声,babble, car, office noise;
Enroll——随机加5-20dB噪声,babble, car, office noise;
Test——加0/6/10/20dB噪声,babble, car, office and airplane noise
# Supervized Mixture of PLDA Models for Cross-Channel Speaker Verification
# The EM Algorithm for Mixtures of Factor Analyzers[demo]
In this paper we present an EM learning algorithm for a method which combines one of the basic forms of dimensionality reduction|factor analysis|with a basic method for clustering the Gaussian mixture model. What results is a statistical method which concurrently performs clustering and, within each cluster, local dimensionality reduction.
The diagonality of $\phi$ is one of the key assumptions of factor analysis: The observed variables are independent given the factors.
The diag operator sets all the off-diagonal elements of a matrix to zero.
扩展阅读:
# Mixture of PLDA Models in I-Vector Space for Gender-Independent Speaker Recognition
本文还考虑了交叉性别实验(注册为男性,测试为女性)[这么说,我一直做的是交叉性别实验?]
注册数据的性别是已知的吗?
In a speaker detection trial, two speech segments are given, each assumed to have been produced by a single speaker and the question is asked whether the segments were produced by the same speaker, or by two different speakers.[意味着注册阶段也是没有性别标签的]
Note that by definition, target trials cannot have mixed gender, but non-target trials may be male, female or mixed.
In this paper, we are interested in the case where there may be mixed non-target trials and where no gender labels are provided.
# Identity Vector Extraction Using Shared Mixture of PLDA for Short-Time Speaker Recognition
However, short-time utterances have sparse statistics and the i-vectors extracted from these statistics are less reliable.
In our assumption, shared mixture of Gaussian PLDA is supposed to robust remodel i-vectors and each Gaussian component only represents one speaker subspace.
A 20 ms Hamming window with a 10 ms frame shift was used for frame feature extraction.
$\omega_{i j}=\sum_{m=1}^{M} c_{m}\left(\mu_{m}+V_{m} y_{i}+\epsilon_{i j}\right)$
实验表明,男性用2个GMM女性用4个GMM最好,混合度再上升,难以收敛,使得表现效果不好
延伸阅读:
17,2,18
11-13 FP-PLDA(Full Posterior PLDA)
# PRML
# 12.2.4 Factor analysis
Its definition differs from that of probabilistic PCA only in that the conditional distribution of the observed variable x given the latent variable z is taken to have a diagonal rather than an isotropic covariance