Skip to content

Audio Module#

Additional dependencies

To use the soundevent.audio module you need to install some additional dependencies. Make sure you have them installed by running the following command:

pip install soundevent[audio]

soundevent.audio #

Soundevent functions for handling audio files and arrays.

Classes#

MediaInfo(samplerate_hz, duration_s, samples, channels, format, subtype) dataclass #

MediaInfo Class.

Encapsulates essential metadata about audio data for processing and analysis. The information stored in this dataclass can typically be automatically extracted from the audio file itself.

Attributes#
channels: int instance-attribute #

The number of audio channels present in the data.

For example 1 for mono, 2 for stereo.

duration_s: float instance-attribute #

The total duration of the audio, measured in seconds (s).

This represents the length of time the audio recording spans.

format: str instance-attribute #

A code representing the audio file format.

For example "WAV", "MP3".

samplerate_hz: int instance-attribute #

The sampling rate of the audio, measured in Hertz (Hz).

This indicates the number of samples taken per second to represent the analog audio signal.

samples: int instance-attribute #

The total number of samples in the audio data.

subtype: str instance-attribute #

A more specific subtype of the audio format.

For example "PCM_16", "A_LAW". These subtypes provide additional information about the audio data, such as the bit depth for PCM encoded audio, or encoding algorithm for compressed audio formats.

Functions#

compute_md5_checksum(path) #

Compute the MD5 checksum of a file.

Parameters:

Name Type Description Default
path PathLike

Path to the file.

required

Returns:

Type Description
str

MD5 checksum of the file.

compute_spectrogram(audio, window_size, hop_size, window_type='hann', detrend=False, padded=True, boundary='zeros') #

Compute the spectrogram of a signal.

This function calculates the short-time Fourier transform (STFT), which decomposes a signal into overlapping windows and computes the Fourier transform of each window.

Parameters:

Name Type Description Default
audio DataArray

The audio signal.

required
window_size float

The duration of the STFT window in seconds.

required
hop_size float

The duration of the STFT hop (in seconds). This determines the time step between consecutive STFT frames.

required
window_type str

The type of window to use. Refer to scipy.signal.get_window for supported types.

'hann'
detrend Union[str, Callable, Literal[False]]

Specifies how to detrend each STFT window. See scipy.signal.stft for options. Default is False (no detrending).

False
padded bool

Indicates whether the input signal is zero-padded at the beginning and end before performing the STFT. See scipy.signal.stft. Default is True.

True
boundary Optional[Literal['zeros', 'odd', 'even', 'constant']]

Specifies the boundary extension mode for padding the signal to perform the STFT. See scipy.signal.stft. Default is "zeros".

'zeros'

Returns:

Name Type Description
spectrogram DataArray

The spectrogram of the audio signal. This is a three-dimensional xarray data array with the dimensions frequency, time, and channel.

Notes

Time Bin Calculation: * The time axis of the spectrogram represents the center of each STFT window. * The first time bin is centered at time t=hop_size / 2. * Subsequent time bins are spaced by hop_size.

filter(audio, low_freq=None, high_freq=None, order=5, dim=Dimensions.time.value) #

Filter audio data.

This function assumes that the input audio object is a :class:xarray.DataArray with a "samplerate" attribute and a "time" dimension.

The filtering is done using a Butterworth filter or the specified order. The type of filter (lowpass/highpass/bandpass filter) is determined by the specified cutoff frequencies. If only one cutoff frequency is specified, a low pass or high pass filter is used. If both cutoff frequencies are specified, a band pass filter is used.

Parameters:

Name Type Description Default
audio DataArray

The audio data to filter with a "samplerate" attribute and a "time" dimension.

required
low_freq float

The low cutoff frequency in Hz.

None
high_freq float

The high cutoff frequency in Hz.

None
order int

The order of the filter. By default, 5.

5
dim str

The dimension along which to filter the audio data. By default, "time".

time.value

Returns:

Type Description
DataArray

The filtered audio data.

Raises:

Type Description
ValueError

If neither low_freq nor high_freq is specified, or if both are specified and low_freq > high_freq.

get_audio_files(path, strict=False, recursive=True, follow_symlinks=False) #

Return a generator of audio files in a directory.

Parameters:

Name Type Description Default
path PathLike

Path to the directory.

required
strict bool

Whether to check the file contents to ensure it is an audio file. Will take a bit longer to run, by default False.

False
recursive bool

Whether to search the directory recursively, by default True. This means that all audio files in subdirectories will be included. If False, only the audio files at the top level of the directory will be included.

True
follow_symlinks bool

Whether to follow symbolic links, by default False. Care should be taken when following symbolic links to avoid infinite loops.

False

Yields:

Type Description
Path

Path to the audio file.

Raises:

Type Description
ValueError

If the path is not a directory.

Notes

This function uses the is_audio_file function to check if a file is an audio file. See the documentation for is_audio_file for more information on which audio file formats are supported.

Examples:

>>> from soundevent.audio.files import get_audio_files

Get all audio files in a directory recursively:

>>> for file in get_audio_files("path/to/directory"):
...     print(file)

Get all audio files in a directory without recursion:

>>> for file in get_audio_files(
...     "path/to/directory", recursive=False
... ):
...     print(file)

get_media_info(path) #

Return the media information from the WAV file.

The information extracted from the WAV file is the audio format, the bit depth, the sample rate, the duration, the number of samples, and the number of channels.

Parameters:

Name Type Description Default
path PathLike

Path to the WAV file.

required

Returns:

Name Type Description
media_info MediaInfo

Information about the WAV file.

Raises:

Type Description
ValueError

If the WAV file is not PCM encoded.

is_audio_file(path, strict=False) #

Return whether the file is an audio file.

Parameters:

Name Type Description Default
path PathLike

Path to the file.

required
strict bool

Whether to check the file contents to ensure it is an audio file. Will take a bit longer to run, by default False.

False

Returns:

Type Description
bool

Whether the file is an audio file.

Notes

The list of supported audio file extensions contains most of the audio files formats supported by the libsndfile library. See: https://libsndfile.github.io/libsndfile/

Some formats were excluded as they do not support seeking and thus are not suitable for random access.

Supported formats:

  • aiff
  • au
  • avr
  • caf
  • flac
  • htk
  • ircam
  • mat4
  • mat5
  • mp3
  • mpc2k
  • nist
  • ogg
  • paf
  • pvf
  • rf64
  • sds
  • svx
  • voc
  • w64
  • wav
  • wavex
  • wve

load_audio(path, offset=0, samples=None) #

Load an audio file.

Parameters:

Name Type Description Default
path PathLike

The path to the audio file.

required
offset int

The offset in samples from the start of the audio file.

0
samples Optional[int]

The number of samples to load. If None, load the entire file.

None

Returns:

Name Type Description
ndarray

The audio data.

samplerate int

The sample rate of the audio file in Hz.

load_clip(clip, audio_dir=None) #

Load a clip from a file.

Parameters:

Name Type Description Default
clip Clip

The clip to load.

required
audio_dir Optional[PathLike]

The directory containing the audio file. If None, the recording path is assumed to be relative to the current working directory or an absolute path.

None

Returns:

Name Type Description
audio DataArray

The loaded clip. The returned clip stores the samplerate and time expansion of the recording from which it was extracted.

load_recording(recording, audio_dir=None) #

Load a recording from a file.

Parameters:

Name Type Description Default
recording Recording

The recording to load.

required
audio_dir Optional[PathLike]

The directory containing the audio file. If None, the recording path is assumed to be relative to the current working directory or an absolute path.

None

Returns:

Name Type Description
audio DataArray

The loaded recording.

pcen(array, smooth=0.025, gain=0.98, bias=2, power=0.5, eps=1e-06, dim=Dimensions.time.value) #

Apply PCEN to spectrogram.

Parameters:

Name Type Description Default
array DataArray

The spectrogram to which to apply PCEN.

required
smooth float

The time constant for smoothing the input spectrogram. By default, 0.025.

0.025
gain float

The gain factor for the PCEN transform. By default, 0.98.

0.98
bias float

The bias factor for the PCEN transform. By default, 2.

2
power float

The power factor for the PCEN transform. By default, 0.5.

0.5
eps float

An epsilon value to prevent division by zero. By default, 1e-6.

1e-06
dim str

The dimension along which to apply PCEN.

time.value

Returns:

Type Description
DataArray

Spectrogram with PCEN applied.

Notes

This function applies the Per-Channel Energy Normalization (PCEN) transform to a spectrogram, as described in [1].

The PCEN transform is defined as:

\[ PCEN(X) = \left(\frac{X}{(\epsilon + S)^{\alpha}} + \delta\right)^r - \delta^r \]

where \(X\) is the input spectrogram, \(S\) is the smoothed version of the input spectrogram, \(\alpha\) is the power factor, \(\delta\) is the bias factor, and \(r\) is the gain factor.

The smoothed version of the input spectrogram is computed using a first-order IIR filter:

\[ S_t = (1 - \beta) S_{t-1} + \beta X_t \]

where \(\beta\) is the smoothing factor.

The default values for the parameters are taken from the PCEN paper [1].

References

[1] Wang, Y., Getreuer, P., Hughes, T., Lyon, R. F., & Saurous, R. A. (2017, March). Trainable frontend for robust and far-field keyword spotting. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5670-5674). IEEE.

resample(array, target_samplerate, window=None, dim=Dimensions.time.value) #

Resample array data to a target sample rate along a given dimension.

Parameters:

Name Type Description Default
array DataArray

The data array to resample.

required
target_samplerate int

The target sample rate of the resampled data in Hz.

required
window Optional[str]

The window to use for resampling. See scipy.signal.resample for details.

None
dim str

The dimension along which to resample the audio data. By default, "time".

time.value

Returns:

Type Description
DataArray

The resampled audio data.

Notes

This function uses scipy.signal.resample to resample the input data array to the target sample rate. This function uses the Fourier method to resample the data, which is suitable for resampling audio data. For other resampling methods, consider using the xarray.DataArray.interp method.

Raises:

Type Description
ValueError

If the input audio object is not a :class:xarray.DataArray, or if it does not have a "samplerate" attribute, or if it does not have a "time" dimension.