使用全局自注意Teager能量倒谱系数检测重放欺骗语音

陈铭; 陈雪勤

doi:10.12395/0371-0025.2023106

使用全局自注意Teager能量倒谱系数检测重放欺骗语音

陈铭,
陈雪勤^,

1.
苏州大学电子信息学院　苏州　215006

通讯作者: 陈雪勤, chenxueqin@suda.edu.cn

中图分类号: 43.72, 43.60

Detection of replay spoof speech using global self-attentive Teager energy features

Ming CHEN,
Xueqin CHEN^,

1.
School of Electronic Information Engineering, Soochow University　Suzhou　215006

Corresponding author: Xueqin CHEN, chenxueqin@suda.edu.cn

MSC: 43.72, 43.60

摘要: 提出了一种基于能量的前端特征提取方法, 旨在应对自动说话人验证系统中面临的重放攻击威胁。该方法实现了全频段上的可变分辨率, 以充分利用重放语音与真实语音在子带能量上的高鉴别非线性信息。首先, 通过采用F-ratio方法统计分析了多种录音和播放设备。接着, 根据统计结果在全频段上设计了一组滤波器, 旨在捕获高鉴别能量信息。最后, 利用Teager能量算子计算子带滤波信号的能量, 提出了全局自注意Teager能量倒谱系数(GSTECC)。为了验证所提方法的有效性, 采用高斯混合模型作为分类器, 在ASVspoof 2017 V2和ASVspoof 2021 PA数据库上进行了一系列测试实验。实验结果表明, 相对于其他先进特征提取方法, 所提GSTECC特征在检测重放攻击方面表现出更优异的性能。
- 43.72 /
- 43.60
Abstract: This paper proposes an energy-based front-end feature extraction method to address the threat of replay attacks in automatic speaker verification systems. This method achieves variable resolution over the entire frequency band to fully utilize the highly discriminative nonlinear information in sub-band energy between replayed speech and real speech. First, statistical analysis of various recording and playback devices is carried out by adopting the F-ratio method. Then, according to the statistical results, a set of filters on the whole frequency band is designed to capture high discriminative energy information. Finally, the Teager energy operator is used to calculate the energy of the sub-band filtered signal, and the global self-attentive Teager energy cepstral coefficients (GSTECC) is proposed. In order to verify the effectiveness of the proposed method, the Gaussian mixture model is used as the classifier, and a series of test experiments are conducted on the ASVspoof 2017 V2 and ASVspoof 2021 PA databases. Experimental results show that the proposed GSTECC feature performs better in detecting replay attacks compared to other advanced feature extraction methods.
- Speaker verification /
- Replay attack detection .

图 1 录音重放语音检测模块

下载: 全尺寸图片幻灯片

图 2 10种录放设备下的F-ratio模式曲线 (a) F-ratio曲线; (b) 0~0.006幅度区间的放大曲线

下载: 全尺寸图片幻灯片

图 3 全局自注意权重

下载: 全尺寸图片幻灯片

图 4 基于非线性全局频率尺度的全局自注意滤波器组 (a) 非线性全局频率尺度变换; (b) 全局自注意滤波器组

下载: 全尺寸图片幻灯片

图 5 全局自注意Teager能量倒谱系数特征提取流程

下载: 全尺寸图片幻灯片

图 6 全局自注意滤波器组数量变化在验证集上的等错误率

下载: 全尺寸图片幻灯片

图 7 各特征参数的DET曲线

下载: 全尺寸图片幻灯片

图 8 各特征参数经CMVN后的DET曲线

下载: 全尺寸图片幻灯片

图 9 不同特征在不同录音设备下的个体等错误率

下载: 全尺寸图片幻灯片

图 10 不同特征在不同播放设备下的个体等错误率

下载: 全尺寸图片幻灯片

图 11 不同特征在不同威胁级别的录放设备下的等错误率

下载: 全尺寸图片幻灯片

表 1 录音设备和重放设备详情

录音设备编号	录音设备详情	重放设备编号	重放设备详情
R01	Zoom H6 handy recorder	P01	All-in-one PC speakers
R02	BQ Aquaris M5 smartphone	P05	Beyerdynamic DT 770 PRO headphones
R03	Low-quality headset	P06	Dell laptop internal speakers
R04	Nokia Lumia 635 smartphone	P07	Dynaudio BM5A speaker
R05	Røde NT2 microphone	P08	HP Laptop internal speakers
R06	Røde smartLav + microphone	P09	VIFA M10MD-39-08 speaker
R07	Samsung Galaxy 7s smartphone	—	—

下载: 导出CSV

表 2 ASVspoof 2017 V2 数据库和ASVspoof 2021 PA数据库

数据库	数据集	真实语音	重放语音
ASVspoof 2017 V2	训练集	1507	1507
	验证集	760	950
	测试集	1298	12008
ASVspoof 2021 PA	测试集	94068	627264

下载: 导出CSV

表 3 不同特征的重放攻击检测等错误率(%)

特征	测试集
CQCC	29.35
MFCC	35.75
IMFCC	31.68
LFCC	33.43
AFCC	24.22
TECC	25.26
GSTECC	20.20

下载: 导出CSV

表 4 不同特征使用CMVN后的重放攻击检测等错误率(%)

特征	测试集
CQCC + CMVN	19.28
MFCC + CMVN	29.47
IMFCC + CMVN	18.54
LFCC + CMVN	18.14
AFCC + CMVN	17.67
TECC + CMVN	13.27
GSTECC + CMVN	11.14

下载: 导出CSV

表 5 不同特征的重放攻击检测EER(%)

特征	测试集
CQCC	38.07
MFCC	46.07
IMFCC	42.54
LFCC	46.97
GSTECC	36.24

下载: 导出CSV

[1]	Evans N W, Kinnunen T, Yamagishi J. Spoofing and countermeasures for automatic speaker verification. Interspeech, Lyon, France, 2013: 925−929
[2]	Alegre F, Janicki A, Evans N. Re-assessing the threat of replay spoofing attacks against automatic speaker verification. International Conference of the Biometrics Special Interest Group, IEEE, Darmstadt, Germany, 2014: 1−6
[3]	Tapkir P A, Kamble M R, Patil H A, et al. Replay spoof detection using power function based features. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, IEEE, Honolulu, HI, USA, 2018: 1019−1029
[4]	Kamble M R, Tak H, Patil H A. Amplitude and frequency modulation-based features for detection of replay spoof speech. Speech Commun., 2020; 125: 114−127 doi: 10.1016/j.specom.2020.10.003
[5]	Kamble M R, Patil H A. Detection of replay spoof speech using teager energy feature cues. Comput. Speech Lang., 2021; 65: 101140 doi: 10.1016/j.csl.2020.101140
[6]	Therattil A, Gupta P, Chodingala P K, et al. Teager energy based-detection of one-point and two-point replay attacks: Towards cross-database generalization. The Speaker and Language Recognition Workshop (Odyssey 2022), Beijing, China, 2022: 47−54
[7]	Patil A T, Acharya R, Patil H A, et al. Improving the potential of enhanced Teager energy cepstral coefficients (ETECC) for replay attack detection. Comput. Speech Lang., 2022; 72: 101281 doi: 10.1016/j.csl.2021.101281
[8]	Todisco M, Delgado H, Evans N. Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Comput. Speech Lang., 2017; 45: 516−535 doi: 10.1016/j.csl.2017.01.001
[9]	Alluri K R, Achanta S, Kadiri S R, et al. SFF anti-spoofer: IIIT-H submission for automatic speaker verification spoofing and countermeasures challenge 2017. Interspeech, Stockholm, Sweden, 2017: 107−111
[10]	汤爽, 张二华, 唐振民. 基于小波包的回放语音检测算法. 计算机与数字工程, 2022; 50(2): 238−242 doi: 10.3969/j.issn.1672-9722.2022.02.003
[11]	Font R, Espín J M, Cano M J. Experimental analysis of features for replay attack detection-results on the ASVspoof 2017 Challenge. Interspeech, Stockholm, Sweden, 2017: 7−11
[12]	Li L, Chen Y, Wang D, et al. A study on replay attack and anti-spoofing for automatic speaker verification. Interspeech, Stockholm, Sweden, 2017: 92−96
[13]	Liu M, Wang L, Dang J, et al. Replay attack detection using magnitude and phase information with attention-based adaptive filters. IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 2019: 6201−6205
[14]	陈树丽, 张学帅, 张鹏远, 等. 静音掩蔽和频域分段的音频指纹检索算法. 声学学报, 2022; 47(4): 531−540 doi: 10.15949/j.cnki.0371-0025.2022.04.011
[15]	Liu M, Wang L, Dang J, et al. Replay attack detection using variable-frequency resolution phase and magnitude features. Comput. Speech Lang., 2021; 66: 101161 doi: 10.1016/j.csl.2020.101161
[16]	郭星辰, 俞一彪. 具有仿冒攻击检测的鲁棒性说话人识别. 计算机科学, 2022; 49(S1): 531−536 doi: 10.11896/jsjkx.210500147
[17]	俞一彪, 袁冬梅, 薛峰. 一种适于说话人识别的非线性频率尺度变换. 声学学报, 2008; 33(5): 450−455 doi: 10.15949/j.cnki.0371-0025.2008.05.014
[18]	Xu L, Yang J, You C H, et al. Device features based on linear transformation with parallel training data for replay speech detection. IEEE/ACM Trans. Audio Speech Lang. Process., 2023; 31: 1574−1586 doi: 10.1109/taslp.2023.3267610
[19]	姜涛, 韩纪庆, 郑铁然. 基于高斯混合模型移动因子补偿的说话人识别方法. 声学学报, 2011; 36(6): 658−664 doi: 10.15949/j.cnki.0371-0025.2011.06.009
[20]	Delgado H, Todisco M, Sahidullah M, et al. ASVspoof 2017 Version 2.0: Meta-data analysis and baseline enhancements. The Speaker and Language Recognition Workshop (Odyssey 2018), Les Sables d'Olonne, France, 2018: 296−303

图( 11) 表( 5)

计量

文章访问数: 737
HTML全文浏览数: 737
PDF下载数: 4
施引文献: 0

全文HTML

引言

自动说话人验证(ASV)系统旨在根据说话人的声音验证其身份, 是一项重要的生物特征识别技术。然而在实际应用中, ASV系统面临着欺骗语音攻击的潜在威胁, 攻击大致可分为四种类型: 录音重放、人声模仿、语音转换和语音合成。其中, 录音重放是最常见、最易实施且威胁最大的一类攻击。该攻击方法仅需要简单的录音设备(如手机、录音笔等)记录原始说话者的声音, 然后通过播放设备进行声音重放, 无需任何专业技术知识即可实施^[1-2]。随着高保真设备的便携化和普及化, 这种攻击方式严重威胁到ASV系统的安全性。因此, 开发能够检测重放欺骗语音的对策是至关重要的。近些年来, 许多前端特征已被提出用于检测重放语音, 并取得了一定效果。

捕捉重放语音的非线性特征是提高检测性能的一个研究方向。其中, Tapkir等基于幂函数的非线性提出幂归一化倒谱系数(PNCC)和Q-Log归一化倒谱系数(QLNCC)^[3]。Kamble等基于非线性的Teager能量算子提出了一系列特征, 包括Teager能量倒谱系数(TECC)和增强Teager能量倒谱系数(ETECC)等。这一类特征在捕捉混响和噪声抑制方面展现出不错的能力^[4-7], 但该类特征并未重点研究相关频段上分辨率的重要性。

通过调节目标位置的分辨率来捕捉真实语音和重放语音之间差异性是特征提取的另一研究方向。其中, 常数Q倒谱系数(CQCC)作为ASVspoof 2017挑战赛的基线特征, 能更准确地捕捉真实语音和回放语音之间的差异信息^[8]。该特征基于常数Q变换, 在低频处采用较高的频率分辨率, 高频处采用较高的时间分辨率。文献[9-10]进一步证实了在关键频段采用高分辨率对于重放检测有积极意义。文献[11]对梅尔频率倒谱系数(MFCC)、线性频率倒谱系数(LFCC)、逆梅尔频率倒谱系数(IMFCC)等特征做了重放检测的性能比较, 结果显示不同频率尺度下所设计的特征得到的检测效果不同且差异较大。

上述特征提取方法基于非线性和关键频段提高分辨率的思路在重放语音检测中发挥了不错的效果。然而其中的多分辨能力大体上以高低频段来划分, 缺乏严谨的理论依据。本文运用F-ratio深入分析了若干种录放设备条件下的重放语音与真实语音的频谱关系, 在此基础上归纳提出全局自注意权重用来表示全频段中各个频率点的鉴别权重, 从而给出了基于非线性、多频带进行特征设计的理论依据。以全局自注意权重为依据设计了全局自注意滤波器组, 用来提取真实语音和重放语音之间的高鉴别频段信息, 同时利用Teager能量算子来捕捉重放语音的非线性失真。最终, 提出了一种有效检测重放欺骗语音的全局自注意Teager能量倒谱系数(Global Self-attentive Teager Energy Cepstrum Coeffi-cients, GSTECC)特征。

4. 结论

本文设计了一种全局自注意滤波器组, 用于甄选出具有较强鉴别能力的子带信号, 并运用Teager能量算子捕捉这些子带信号的非线性能量, 进而提出了GSTECC特征。基于ASVspoof 2017 V2数据库的实验结果显示，GSTECC特征相对于其他常见特征具有更出色的性能。此外, 在测试集中的不同录放配置下进行了不同特征的检测效果评估, 与其他特征相比, GSTECC特征在不同威胁条件下的EER也相对较低。另外, 本文还在ASVspoof 2021PA数据库上进行了泛化性测试实验, 结果表明, GSTECC特征同样表现出更好的检测性能。

参考文献 (20)

使用全局自注意Teager能量倒谱系数检测重放欺骗语音

通讯作者: 陈雪勤, chenxueqin@suda.edu.cn

Detection of replay spoof speech using global self-attentive Teager energy features

Corresponding author: Xueqin CHEN, chenxueqin@suda.edu.cn

计量

使用全局自注意Teager能量倒谱系数检测重放欺骗语音

通讯作者: 陈雪勤, chenxueqin@suda.edu.cn

English Abstract

Detection of replay spoof speech using global self-attentive Teager energy features

Corresponding author: Xueqin CHEN, chenxueqin@suda.edu.cn

全文HTML

1.1. 重放语音信号的形成过程

1.2. 使用F-ratio分析不同录放设备的影响

1.3. 全局自注意权重

2.1. 全局自注意滤波器组

2.1.1. 非线性全局频率尺度变换

2.1.2. 全局自注意滤波器组设计

2.2. 基于Teager能量算子的特征提取

3.1. 数据库

3.2. 分类模型

3.3. ASVspoof 2017 V2数据库的实验结果

3.3.1. 滤波器数量的选择

3.3.2. 特征归一化前后性能比较

3.3.3. 多种录放设备下的性能比较

3.4. ASVspoof 2021 PA数据库的实验结果

目录

使用全局自注意Teager能量倒谱系数检测重放欺骗语音

通讯作者: 陈雪勤, chenxueqin@suda.edu.cn

Detection of replay spoof speech using global self-attentive Teager energy features

Corresponding author: Xueqin CHEN, chenxueqin@suda.edu.cn

计量

出版历程

使用全局自注意Teager能量倒谱系数检测重放欺骗语音

通讯作者: 陈雪勤, chenxueqin@suda.edu.cn

English Abstract

Detection of replay spoof speech using global self-attentive Teager energy features

Corresponding author: Xueqin CHEN, chenxueqin@suda.edu.cn

全文HTML

1.1. 重放语音信号的形成过程

1.2. 使用F-ratio分析不同录放设备的影响

1.3. 全局自注意权重

2.1. 全局自注意滤波器组

2.1.1. 非线性全局频率尺度变换

2.1.2. 全局自注意滤波器组设计

2.2. 基于Teager能量算子的特征提取

3.1. 数据库

3.2. 分类模型

3.3. ASVspoof 2017 V2数据库的实验结果

3.3.1. 滤波器数量的选择

3.3.2. 特征归一化前后性能比较

3.3.3. 多种录放设备下的性能比较

3.4. ASVspoof 2021 PA数据库的实验结果

目录