[paper] ecapa_tdnn_summary!
Abstract
X-vector achitecture is TDNN. Statistics Pooling project utterance into speaker embeddings
ECAPA-TDNN has 3 enhancement to X-vector
1. Skip connection (Resnet) and Squeeze-and-Excitation (SENet) modules are introduced to model channel interdependencies.
The SE block expands the temporal context of frame layer by rescaling the channels according to global properties of the recording
2. Aggregating and Propagating features of different hierarchical levels. Originally, features operate separately in layers
3. Statistics Pooling is replaced with Channel-dependent frame attention.
Introduction
bottleneck -> low dimensional speaker embedding
Consine Similarity or PLDA training compares 2 embeddings
Additive Angular margin optimization, ResNet aprroaches in frame-level layer, temporal self-attention are introduced in X-Vector
Statistics Pooling (projects variable length input into a fixed-length representation)
AAM (good at image detection) ResNet (enable the back-propagation faster and avoid vanishing gradient) self-attention (focus on important frames)
DNN speaker recognition systems (Baseline)
Extended-TDNN x-vector
The initial frame layer of X-vector has 1 dimensional dilated convolutional layer interleaved with dense layers
Residual Connection
frame layer -> attentive statistics pooling layer -> 2 FC layer
dilated layer (gradually build up temporal context)
pooling layer (calculates mean and standard deviation of the final frame layer features)
FC layer (one creates bottleneck)
ResNet-based r-vector
ResNet18 and ResNet34 architecture
The convolutional frame layer process features as a 2-dimensional signal
Proposed ECAPA-TDNN architecture
frame-level and pooling-level enhancements
3.1. Channel- and context-dependent statistics pooling
enhance **X-vector + self-attention** by extending temporal attention mechnism into the channel dimension (to focus on speaker characteritics, not on time instances)
W is a parameter that belongs to R*C, where C denotes Channels. ReLU() transforms this into channel-dependent self-attention score
Then scalar score "e" is normalized over all frames (channel-wise across time)
The normalized "e", which is "a", represents self-attention score of each frame given the channel (attention reduces parameter and risks of overfitting)
And calculate the weighted statistics (mean and standard deviation) of channel c by multiplying self-attention score
Weighted Statistics are concatenated to create final output of pooling layer
(now self-attention can look at global properties)
3.2. 1-Dimensional Squeeze-Excitation Res2Blocks
enhance **temporal context of x-vector 15 frames** by rescaling the frame-level features with SE blocks
(SE block may model global channel interdependencies)
Squeeze - calculate mean vector of z of the frame-level features across time domain
Excitation - ReLU and Sigmoid to create s, which makes a bottleneck layer C*R (channel and dimensionality) s has weights between 0 and 1, then s is applied to original input through channel-wise multiplication
Using SE-block after each dilated convolution is recommended
SE-Res2Block suggests "dense layer + dilated convolution + dense layer" with a context of 1 frame
1st dense layer reduce dimension, 2nd one restore the dimension
Then Res2Net module enhances the central convolutional layer to process multi-scale features by constructing hierarchical residual like connections within
3.3. Multi-layer feature aggregation and summation (MFA)
enhance **X-vector using last frame-layer as pooling** by concatenating all feature maps of SE-Res2Blocks
Or, using SE-Res2BLock outputs and initial convolutional layer as input for each frame layer block
(summation of the feature maps instead of concatenation to reduce parameter)
Introduction
bottleneck -> low dimensional speaker embedding
Consine Similarity or PLDA training compares 2 embeddings
Additive Angular margin optimization, ResNet aprroaches in frame-level layer, temporal self-attention are introduced in X-Vector
Statistics Pooling (projects variable length input into a fixed-length representation)
AAM (good at image detection) ResNet (enable the back-propagation faster and avoid vanishing gradient) self-attention (focus on important frames)
DNN speaker recognition systems (Baseline)
Extended-TDNN x-vector
The initial frame layer of X-vector has 1 dimensional dilated convolutional layer interleaved with dense layers
Residual Connection
frame layer -> attentive statistics pooling layer -> 2 FC layer
dilated layer (gradually build up temporal context)
pooling layer (calculates mean and standard deviation of the final frame layer features)
FC layer (one creates bottleneck)
ResNet-based r-vector
ResNet18 and ResNet34 architecture
The convolutional frame layer process features as a 2-dimensional signal
Proposed ECAPA-TDNN architecture
frame-level and pooling-level enhancements
3.1. Channel- and context-dependent statistics pooling
enhance **X-vector + self-attention** by extending temporal attention mechnism into the channel dimension (to focus on speaker characteritics, not on time instances)
W is a parameter that belongs to R*C, where C denotes Channels. ReLU() transforms this into channel-dependent self-attention score
Then scalar score "e" is normalized over all frames (channel-wise across time)
The normalized "e", which is "a", represents self-attention score of each frame given the channel (attention reduces parameter and risks of overfitting)
And calculate the weighted statistics (mean and standard deviation) of channel c by multiplying self-attention score
Weighted Statistics are concatenated to create final output of pooling layer
(now self-attention can look at global properties)
3.2. 1-Dimensional Squeeze-Excitation Res2Blocks
enhance **temporal context of x-vector 15 frames** by rescaling the frame-level features with SE blocks
(SE block may model global channel interdependencies)
Squeeze - calculate mean vector of z of the frame-level features across time domain
Excitation - ReLU and Sigmoid to create s, which makes a bottleneck layer C*R (channel and dimensionality) s has weights between 0 and 1, then s is applied to original input through channel-wise multiplication
Using SE-block after each dilated convolution is recommended
SE-Res2Block suggests "dense layer + dilated convolution + dense layer" with a context of 1 frame
1st dense layer reduce dimension, 2nd one restore the dimension
Then Res2Net module enhances the central convolutional layer to process multi-scale features by constructing hierarchical residual like connections within
3.3. Multi-layer feature aggregation and summation (MFA)
enhance **X-vector using last frame-layer as pooling** by concatenating all feature maps of SE-Res2Blocks
Or, using SE-Res2BLock outputs and initial convolutional layer as input for each frame layer block
(summation of the feature maps instead of concatenation to reduce parameter)
Consine Similarity or PLDA training compares 2 embeddings Additive Angular margin optimization, ResNet aprroaches in frame-level layer, temporal self-attention are introduced in X-Vector
Statistics Pooling (projects variable length input into a fixed-length representation)
AAM (good at image detection) ResNet (enable the back-propagation faster and avoid vanishing gradient) self-attention (focus on important frames)