论文句式学习
Domain adaptation problems are typically setup such that there is a large amount of out-of-domain data and a small amount of labeled or unlabeled in-domain data available to train machine learning systems.
An important step in the speaker verification pipeline involves deriving a compact, low-dimensional representation from the audio of a speaker.
Fundamentally, i-vector is the compact and fixed-length vector representation of a recording of arbitrary duration.
On the other hand, neural speaker embeddings show a distinct advantage over i-vectors when we consider short recordings [18, 20].
Metric used for performance evaluation is equal error rates (EER).
In this section we provide the details of our experimental setup as well as our speaker verification results.【实验部分的开场白】
Both theoretical (Ben-David et al. 2007; Blitzer, Dredze, and Pereira 2007) and practical results (Saenko et al. 2010; Torralba and Efros 2011) have shown that the test error of supervised methods generally increases in proportion to the “difference” between the distributions of training and test examples.
Specifically, it learns the within-class variability, that characterizes distortions, and the between-class variability, which characterizes speaker information.
This opens the door for using larger in-domain datasets since the cost of labeling is eliminated.【无监督的优势】
We now discuss the three key components of the approach: clustering technique, determination of number of clusters, and the adaptation mechanism.
The challenging problem of domain mismatch arises when a speaker recognition system is used in a different domain (e.g., different languages, demographic etc.) than that of the training data.
It is impractical to re-train the system for each and every domain as the effort at collecting large labelled data sets is expensive and time consuming.
PLDA adaptation is preferable in practice since the same feature extraction and speaker embedding front-end could be used while domain adapted PLDA backbends are used to cater for the condition in each specific deployment.
In the case of unsupervised adaptation (i.e., no labels are given), the major challenge is how the adaptation could be performed on the within and between class covariance matrices given that only the total covariance matrix could be estimated directly from the indomain data.
No class (i.e., speaker) label is used and therefore it belongs to the class of unsupervised adaptation techniques.
《Deep CORAL: Correlation Alignment for Deep Domain Adaptation》
- CORAL [18] is a simple unsupervised domain adaptation method that aligns the second-order statistics of the source and target distributions with a linear transformation.
- Instead of collecting labeled data and training a new classifier for every possible scenario, unsupervised domain adaptation methods try to compensate for the degradation in performance by transferring knowledge from labeled source domains to unlabeled target domains.
《An investigation of domain adaptation in speaker embedding space for speaker recognition》
- Mismatch conditions (Hansen and Hasan, 2015) can be divided into two categories, extrinsic (channel, noise, etc.) and intrinsic (duration, language, and speaker traits including stress, emotion, Lombard effect, vocal effort, accent).
- These results confirm that SVDA can measurably improve speaker recognition performance for SRE-16 and SRE-18 tasks respectively by +15% and +8% in terms of min-Cprimary; and for EER (opens new window) the gains are +14% and +16% respectively, using i-vector speaker embeddings as the baseline.
- From an alternative viewpoint, domain mismatch compensation methods can be categorized into supervised or unsupervised techniques as well. When in-domain data are unlabeled, pseudo labeling can be integrated into the system to provide for supervised adaptation.
- At the same time, the methods presented in this paper are applicable to other tasks which suffer from the domain mismatch problem; such as, language recognition or dialect identification tasks where data are recorded under mismatch conditions.
- In the i-Vector framework, discriminant analysis and dimension reduction techniques such as SVDA and LDA are shown to be more effective in compensating the domain mismatch rather than the PLDA.
《Mixture of PLDA for Noise Robust I-Vector Speaker Verification》
- This paper extends the SNR-dependent mixture of PLDA in [35] by the following four aspects:
- A denotes the degree of membership of the vector $x_i$ in cluster k
- A value close to 1 for PC or a value close to 0 for PE indicates perfect clustering; on the other hand, PC closes to 1=K or PE closes to logK indicates the absence of clustering tendency [38].
- The speakers for training the PLDA models and for SID tests are mutually exclusive.
- Note that throughout the paper, we used a simplified variant of PLDA, commonly called Gaussian PLDA [13] or simplified PLDA.
- However, to deal with cross channel tasks or tasks with varying noise and reverberation levels, the assumption of single Gaussian is rather limited.
- Given target-speaker’s i-vector xs and test i-vector xt, the likelihood ratio score is...
- The performance of ... are shown in ... refer to ...