Show simple item record

dc.contributor.authorSholokhov, Alexey
dc.contributor.authorKinnunen, Tomi
dc.contributor.authorVestman, Ville
dc.contributor.authorAik Lee, Kong
dc.date.accessioned2019-12-12T13:40:23Z
dc.date.available2019-12-12T13:40:23Z
dc.date.issued2019
dc.identifier.urihttps://erepo.uef.fi/handle/123456789/7869
dc.description.abstractHow secure automatic speaker verification (ASV) technology is? More concretely, given a specific target speaker, how likely is it to find another person who gets falsely accepted as that target? This question may be addressed empirically by studying naturally confusable pairs of speakers within a large enough corpus. To this end, one might expect to find at least some speaker pairs that are indistinguishable from each other in terms of ASV. To a certain extent, such aim is mirrored in the standardized ASV evaluation benchmarks, for instance, the series of speaker recognition evaluation (SRE) organized by the National Institute of Standards and Technology (NIST). Nonetheless, arguably the number of speakers in such evaluation benchmarks represents only a small fraction of all possible human voices, making it challenging to extrapolate performance beyond a given corpus. Furthermore, the impostors used in performance evaluation are usually selected randomly. A potentially more meaningful definition of an impostor — at least in the context of security-driven ASV applications — would be closest (most confusable) other speaker to a given target. We put forward a novel performance assessment framework to address both the inadequacy of the random-impostor evaluation model and the size limitation of evaluation corpora by addressing ASV security against closest impostors on arbitrarily large datasets. The framework allows one to make a prediction of the safety of given ASV technology, in its current state, for arbitrarily large speaker database size consisting of virtual (sampled) speakers. As a proof-of-concept, we analyze the performance of two state-of-the-art ASV systems, based on i-vector and x-vector speaker embeddings (as implemented in the popular Kaldi toolkit), on the recent VoxCeleb 1, and 2 corpora, containing a total of 7365 speakers. We fix the number of target speakers to 1000, and generate up to N = 100, 000 virtual impostors sampled from the generative model. The model-based false alarm rates are in a reasonable agreement with empirical false alarm rates and, as predicted, increase substantially (values up to 98%) with N = 100, 000 impostors. Neither the i-vector or x-vector system is immune to increased false alarm rate at increased impostor database size, as predicted by the model.
dc.language.isoenglanti
dc.publisherElsevier BV
dc.relation.ispartofseriesComputer speech and language
dc.relation.urihttp://dx.doi.org/10.1016/j.csl.2019.101024
dc.rightsCC BY-NC-ND 4.0
dc.subjectspeaker verification
dc.subjectpopulation size
dc.subjectsecurity
dc.subjectfalse alarm rate
dc.subjectrandom impostor
dc.subjectclosest impostor
dc.subjectBayesian score modeling
dc.subjectVoxCeleb
dc.titleVoice biometrics security: Extrapolating false alarm rate via hierarchical Bayesian modeling of speaker verification scores
dc.description.versionfinal draft
dc.contributor.departmentSchool of Computing, activities
uef.solecris.id66486541en
dc.type.publicationTieteelliset aikakauslehtiartikkelit
dc.relation.doi10.1016/j.csl.2019.101024
dc.description.reviewstatuspeerReviewed
dc.relation.articlenumber101024
dc.relation.issn0885-2308
dc.relation.volume60
dc.rights.accesslevelopenAccess
dc.type.okmA1
uef.solecris.openaccessEi
dc.rights.copyright© Elsevier Ltd.
dc.type.displayTypearticleen
dc.type.displayTypeartikkelifi
dc.rights.urlhttps://creativecommons.org/licenses/by-nc-nd/4.0/


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record