Noisy distractions aren’t much fun when you’re on a video call. Skype’s new noise suppression feature can knock it out!

22/02/2021 | Skype Blogs | Noise cancellation

Whether it’s people talking, kids playing, the construction workers next door, or your pet dog scaring that imaginary monster away from the front porch, background noise can be really distracting when you’re trying to talk on Skype.

Well, no more!

We are pleased to announce the release of our latest background noise suppression feature in the Skype desktop app.* Originally developed for Microsoft Teams, this new feature is designed to silence just about everything except for your voice when you’re meeting on Skype.

To see it in action, check out the video below.

To activate this handy new feature in your Skype desktop app, see this article in our support section.

The Technology Behind the Feature

For those who want to understand the tech behind the feature: it works by analysing your audio feed and then using specially trained deep neural networks to filter out the noise without affecting the speaker’s voice. While traditional noise suppression algorithms can address simple, consistent noises like that of a fan, these neural networks can learn the difference between speech and more complex, inconsistent noises such as typing on a keyboard, the crunch of a food wrapper, and, yes, that howling dog.

This technology relies on machine learning (ML) to learn the difference between clean speech and noise and is frequently referred to as artificial intelligence (AI). A representative dataset is used to train the ML model to work in most of the situations our Skype users experience. There needs to be enough diversity in the dataset in terms of the clean speech, noise types, and the environments from which our users are joining online calls.

To achieve this diversity, the dataset we use contains approximately 760 hours of clean speech data and 180 hours of noise data. To comply with Microsoft’s strict privacy standards, no customer data was collected for this dataset. Instead, we used either publicly available data or crowdsourcing to collect specific scenarios. For clean speech we used a balance of female and male speech. We also collected data from 10+ languages, including tonal languages, to help ensure the model does not change the meaning of a sentence by distorting the tone of the words. For noise data, we included 150 noise types to cover the diverse scenarios our users may run into from keyboard typing, to running water, to snoring, and more. We also included emotions in our clean speech so that expressions like laughter or crying will not be suppressed. The characteristics of the environment from which our users join a Skype meeting has a strong impact on the speech signal as well. To capture that diversity, we trained our model with data from more than 3,000 real room environments and more than 115,000 synthetically created rooms.

Since we use deep learning, it is important to have a powerful model training infrastructure. We use Microsoft Azure to allow our team to develop improved versions of our ML model. Another challenge extracting original clean speech from the noise needs to be done in a way that the human ear perceives it as natural and pleasant. Since there are no objective metrics that are highly correlated to human perception, we developed a framework that allowed us to send processed audio samples to crowdsourcing vendors where human listeners rated their audio quality on a one- to five-star scale to produce mean opinion scores (MOS). With these human ratings we were able to develop a new perceptual metric that, together with the subjective human ratings, allowed us to make fast progress on improving the quality of our deep learning models.

To advance the research in this field we open-sourced our dataset and the perceptual quality crowdsourcing framework. This has been the basis of three competitions we hosted as part of the Interspeech 2020, Interspeech 2021 and ICASSP 2021 conferences as outlined here: Deep Noise Suppression Challenge – INTERSPEECH 2021 - Microsoft Research

Finally, we created our deep learning model to run efficiently on the Skype desktop app in real-time. By optimizing for human perception, we were able to achieve a good trade-off between quality and complexity which means that most Windows devices our users are using can take advantage of our AI-based noise suppression. This feature is also available on Mac and our team is working on bringing this feature on mobile platforms too.

*The noise suppression feature is not presently available in the web or mobile versions of Skype.

Note: The new features are rolling out gradually over the next couple of days so if you don’t see them right away, just check again soon!

<< Previous article

Skype blog

Next article >>