To solve the task of surgical mask detection from audio recordings
in the scope of Interspeech’s ComParE challenge, we introduce
a phonetic recognizer which is able to differentiate between clear
and mask samples.
A deep recurrent phoneme recognition model is first trained on
spectrograms from a German corpus to learn the spectral properties
of different speech sounds. Under the assumption that each phoneme
sounds differently among clear and mask speech, the model is then used
to compute frame-wise phonetic labels for the challenge data, including
information about the presence of a surgical mask. These labels served
to train a second phoneme recognition model which is finally able to
differentiate between mask and clear phoneme productions. For a single
utterance, we can compute a functional representation and learn a random
forest classifier to detect whether a speech sample was produced with
or without a mask.
Our method performed better than the baseline methods on both
validation and test set. Furthermore, we could show how wearing a mask
influences the speech signal. Certain phoneme groups were clearly affected
by the obstruction in front of the vocal tract, while others remained
almost unaffected.