专业英语_英文报告大纲
需要完成的工作:
- [x] 2022.11.15 完成引言前两段
- [x] 2022.11.16 完成除了Fit-RNN、摘要以外的部分
- [x] 2022.11.17 完成Fit-RNN、摘要
- [x] 2022.11.18 完成细节修改
- [x] 2022.11.19 完成翻译
- [x] 2022.11.20 完成PPT、准备5个问题
- [x] 2022.11.21 写文稿,D-day!
Page | Manuscript |
---|---|
1 | Thank you, Mr. chairman. Today, I will introduce our recent work on text-independent speaker recognition using folded-in-time neural network. |
2 | The outline of my talk as follows. In the first part, I will introduce what is speaker recognition and why we use folded-in-time neural network. Next, I will show how to derive[dɪˈraɪv] a feed-forward network from a delay differential[ˌdɪfəˈrenʃ(ə)l] equation with modulated feedback terms. In the third part, slight modification is made to extend the folded-in-time concept to recurrent[rɪˈkʌrənt] neural networks. Then, experiments are conducted on the open audio dataset TIMIT, and the results will be analyzed in this part. Finally, I will give a brief summary of the talk. |
3 | As one of the pattern[ˈpætə(r)n] recognition tasks, speaker recognition identifies persons from their voice. And it can be classified as text-dependent and text-independent ones. As the name implies, in text-independent speaker recognition, the text content['kɒntent] is random and changeable. So, it's more robust[rəʊˈbʌst] to attacks like recording replay and speech synthesis['sɪnθəsɪs]. Of course, more challenging. In our work, we focus on the close-set recognition. All speakers have occurred in training phase[feɪz]. As we said in the last talk, it follows the basic flow in deep learning training process. 【x】That is, feature extraction, signal forward propagation, loss calculating, error back-propagation, and gradient['ɡreɪdiənt] and parameter[pəˈræmɪtə(r)] update. |
4 | 【x】In recent years, deep learning and neural networks have made rapid progress in pattern recognition, including speaker recognition. As we all know, unlike image signals, the audio signals have temporal['temp(ə)rəl] information. So, different network structures have their unique ways to add them into the network. Feedforward neural networks process discrete frame-level speech features directly, or stack nearby frames as one-dimension input. The time-delay neural network merges the left and right speech frames layer by layer. Recurrent neural networks incorporate temporal information in the network structure. |
5 | In this picture, LSTM is a variant[ˈveəriənt] of RNN. We can see that, it considers not only the input at the current moment, but also the state of the hidden nodes at the previous moment. So, this makes the network have memory capability[keɪpə'bɪləti]. |
6 | Currently, the Graphics Processing unit (GPU) is the mainstream platform for deep neural networks. Since it's particularly suitable for parallel computing. But, GPU is a von Neumann structure. This means its computational module and storage unit are separated. Frequent IO operations make its energy efficiency and processing speed drop. So, more researchers focus on alternative hardware platforms based on non-Von Neumann structures, such as photonic neural networks. Currently, three typical structures have been realized. They are feedforward neural networks, reservoir computing, and spiking neural networks. In this picture, a feedforward neural network is constructed based on light diffraction[dɪ'frækʃ(ə)n]. However, some optical devices are unstable and difficult to fine tune. So, it's difficult to realize large-scale network topologies. Therefore, how to complete the network construction with fewer components has become one of the important challenges. |
7 | In 2011, the proposal of dynamic reservoir[ˈrezə(r)ˌvwɑː(r)] computing introduce the delay-dynamical system to the neural network. It uses one nonlinear neuron and one delay loop to realize reservoir computing. Reservoir computing is very similar to recurrent neural network. But the input and hidden weights of reservoir computing is fixed. As shown in the picture, with time-multiplexing['mʌltɪpleksɪŋ], the signals are serialized[ˈsɪəriəlaɪz] and fed into a single neuron. And then, returned to the neuron via a delay loop. This allows the system to respond to the input signal over a long period[ˈpɪəriəd] of time. So, it's usually used in speech recognition and chaotic time sequence prediction. The picture in the lower right corner shows its hardware implementation. With such space-time alternations, delay-dynamical system can also unfold neurons into arbitrary[ˈɑː(r)bɪtrəri] topologies[təˈpɒlədʒi]. However, similar to dynamic reservoir computing, the connection weights between virtual nodes are fixed. So, the learning and expressive ability of the network is limited. |
8 | To eliminate this limitation, folded-in-time DNN adds feedback modulation on the delay loop. When fold a feedforward network into a neuron and multiple delay loops, the nodes within a layer cannot compute in parallel. So, folded-in-time DNN simulate each node in the hidden layer in turn. The picture in the upper left corner shows the composition of the system. The driving signal of the neuron receives the signals from input signal, and modulation signals. The node state is achieved using a nonlinear function on the driving signal, such as the sin function. The length of the delay loops determine the connection direction between nodes. We can see the picture on the right, the distance between virtual nodes is theta, and T is the time needed to calculate the node state in one hidden layer. The blue lines indicate that the delay of loop is less than T. And the orange ones mean that the delay of loop is larger than T. |
9 | Since we apply it to the speaker recognition, in the following presentation, I will introduce it from the perspective of audio signal processing. The speech feature MFCC is extracted, and multiply it with random input weight. The input signal will be normalized in zero to one, and it should maintain constant in the interval theta. Then, the signal is feed into the delay-dynamical system. Solve the delay differential equation to get the node state. It is worth noting that, the non-zero values of hidden weight matrix are in the diagonal direction. The processing of output layer is the same as traditional method. |
10 | The feedback module on the delay loop act as weights between nodes. Same as DNN, folded-in-time DNN use stochastic[stə'kæstɪk] gradient['ɡreɪdiənt] descent[dɪ'sent] to update the training parameters. Since the coupling between nearby nodes, the chain rule need to be modified. In short, the error is passed back node by node. Except for the last hidden layer, the gradient of the current node state come from the gradient on the node state in the next layer and the gradient on the driving signal in the next interval. |
11 | The parameter settings for error back propagation are shown in this page. |
12 | We also extend the folding concept to the recurrent neural network. We can see from the picture. Take a single-layer RNN as an example, the connection of nearby time steps is equivalent to the connection of inter-layers in DNN. The delay loops connect nodes from different time step. The model parameters are shared in all time steps. Different from Folded-in-time DNN, It allows for the information input in all intervals. And, the modulation functions must be T-periodic[ˌpɪəriˈɒdɪk]. |
13 | We simulate the speaker recognition task on the open dataset TIMIT. Ten speakers are selected, 5 men and 5 women. For a fair comparison, the input sample of all the networks are grouped by 30 frames sequentially, as a bin. For RNN-based network, the frames in one bin are fed into the network by time step. For DNN-based network, the statistics are calculated for each bin. |
14 | The impact of two important parameters are explored, that is the virtual node distance theta and the number of delay loops. As theta increases, the network's update speed increases, the recognition accuracy gradually rises until it becomes flat, and the best result is obtained with fewer iterations. When theta is above zero point five, the system responds adequately to the input signal. The recognition accuracy gradually surpasses DNN as theta increases. When theta is below four, coupling between adjacent nodes is an important part of the gradient update. This makes the errors are passed in the correct direction. However, with the increase of theta, the local connection become unnecessary. So, adjusting theta can achieve a balance between computation time and the length of delay loops. |
15 | Then, we select different length of loops by Gaussian sampling. The results in this table show that, the increase in the number of delay loops does not lead to a significant improvement in the recognition accuracy. So, sparsity connection is advocated. With only 10 loops, the accuracy is only slightly decreased. But deleting one loop will delete all the value in the diagonal direction on the weight matrix. So, how to find a better sparse form is one of the problems to be solved. Now, we have found the suitable value of virtual node distance and delay loops for the later experiments. |
16 | In this part, we examine the applicability of Folded-in-time DNN on speaker recognition, including the accuracy and confidence of the predictions. The confusion matrix above shows the distribution of the number of all speech bins in the predicted labels for each speaker. The false predictions occur mostly within the same gender but rarely between the genders. We also calculates the margin between the best guess and the closest competitor to further examine the confidence of the decision. The picture below indicates that women are more difficult to identify than men in same-gender recognition. These results are consistent with the expected result for speaker recognition. |
17 | In the last part, we examine the performance of folded-in-time RNN. The one-layer RNN has the same model parameters as the two-layer equal-width DNN. The table shows the bin accuracy and utterance accuracy in unfolded and folded conditions of DNN and RNN. The results indicate that the node states of hidden layer can better characterize the temporal information of audio than the statistics. In addition, similar to Fit-DNN, the folded RNN has the advantage of local connection between virtual nodes. Both bin accuracy[ˈækjʊrəsi] and utterance['ʌt(ə)rəns] accuracy are improved over RNN. |
18 | Now, let's do a brief summary, and the specific numeric values can be found in the paper. We applied Fit-DNN to closed-set text-independent speaker recognition. An equivalent DNN network was implemented with one nonlinear neuron and several delay loops. A slight change was made to extend the folding approach to RNNs. The main conclusions are as follows: The local connection between adjacent nodes in Fit-DNN increases the complexity of network topology. There is no clear relationship between the recognition accuracy and the number of delay loops in Fit-DNN. Sparse connection is promoted. The recognition results based on Folded-in-time DNN are consistent with the expected result for speaker recognition task. For audio with temporal information, the folded-in-time recurrent neural network outperforms Fit-DNN. Similar to Fit-DNN, local connections exist between neighboring virtual nodes. So, it also outperforms traditional RNNs. |
19 | I hope that was clear. Now, let's welcome Doctor Wang Xiaofei to give her presentation. Due to time limit, the section of question end. |
问题准备
Q1: In page 12, how the continuous audio signal is converted into discrete feature MFCC?
A1: Since the audio signal is a quasi-steady state signal, the signal is often divided into frames during processing, with each frame being about 20ms-30ms in length, and the voice signal is considered as a steady state signal during this interval. Only the steady-state information can be processed by signal processing, so it should be framed first. And then it is passed to a set of filter that approximate the response of the human ear.
Q2: Can the folded-in-time concept extend to convolutional neural networks or some other classic type?
A2: Unlikely at this time. Only use one neuron and some delay loops cannot realize the pooling layer in convolutional neural networks. May need to convert optical signal to electrical signal, do some processing and then convert to optical signal. I’m not very sure. Actually, DNN, CNN, and RNN can be used as components of complex network architectures. We can use a neuron and delay loops as a component and join them together, like Lego. But how the signal is passed between them has not been thought. Maybe further work can be done later.
Q3: Has the idea of folded-in-time RNN been compared with dynamic reservoir computing? Or can it be a bi-directional RNN?
A3: About the first question, we have done comparisons on a small scale, 50 speakers in TIMIT. But haven't had time to summarize this part of the work. It’s better than dynamic reservoir computing, since the input and hidden weights in dynamic reservoir computing are fixed. But we don’t know how it will behave at large scale. We will continue afterwards. The second question, how to implement a bi-directional RNN. Emmmm, we have not implemented it yet, but it is theoretically possible. Maybe two neurons are needed. Maybe change the connection of the delay loops. Sorry, I can't answer definitively yet.
Q4: In page 8, why the non-zero values of weight matrix are in the diagonal direction? Can dynamic sparsity be used in your work?
A4: The location of non-zero elements are determined by the delay loops. …[自由发挥吧]
Q5: In page 14, why use gaussian sampling to select the length of delay loops?
A5: The author of Fit-DNN has compared two methods for choosing the delays, gaussian sampling and equal interval sampling. It shows that, the influence of the chosen method on the quality of the results is small and seems to be insignificant.