Evaluation#
In computational bioacoustics, a key aspiration is to effectively replicate and automate the intricate analyses conducted by human experts. Achieving this goal would represent a big step in scalability, enabling the exploration of vast amounts of data and uncovering previously unexplored insights into animal ecology and behavior.
To assess the success and reliability of automated methods, it is important to
establish evaluation criteria. soundevent
provides specific objects that play
an important role in the evaluation process. This page delves into these
objects.
Matches#
When assessing a method's capability to accurately identify sound events in audio, a common practice is to compare predicted sound events with the ground truth. Regardless of the specific matching approach, the outcome typically involves a set of matched and unmatched sound events. In response to this, the soundevent package introduces the Match object to represent these scenarios.
A Match object comprises potentially empty source (predicted) and target (annotated) sound events, an affinity score offering a numerical measure of geometric similarity, an overall numeric score for the match, and a set of additional metrics. Notably, some predictions or ground truth sound events may remain unmatched, and this is accommodated by creating a Match object with either the source or the target (but not both) empty. The additional metrics are essentially instances of Features, representing named continuous values.
erDiagram
Match {
float affinity
float score
}
SoundEventPrediction
SoundEventAnnotation
Feature
Match }|--o| SoundEventPrediction : source
Match }|--o| SoundEventAnnotation : target
Match ||--o{ Feature : metrics
Understanding Affinity and Score
Affinity serves as a measure of how well the geometries or regions of interest of two matched sound events align, disregarding any information about the semantic "meaning" of the sound. On the contrary, the overall score for the match incorporates this semantic information. For instance, there might be a predicted sound event whose geometry aligns with one of the ground truth events, but the assigned class is entirely incorrect. In such cases, a high affinity may be observed, but the score will be low due to the misalignment in semantic interpretation.
Clip Evaluation#
The ClipEvaluation object encapsulates all information related to the assessment of a Clip Prediction in comparison to the ground truth Clip Annotations. It includes details about all sound event matches, whether matched or unmatched, along with an overall numeric score for the entire prediction. Additionally, a list of supplementary metrics is provided, offering insights into various aspects of the prediction's performance.
erDiagram
ClipEvaluation {
UUID uuid
float score
}
ClipAnnotation
ClipPrediction
Match
Feature
ClipEvaluation }|--|| ClipAnnotation : annotations
ClipEvaluation }|--|| ClipPrediction : predictions
ClipEvaluation ||--o{ Match : matches
ClipEvaluation ||--o{ Feature : metrics
Evaluation#
The Evaluation object serves as a collection of clip evaluations, offering an overall score along with additional metrics. This object is designed to represent a model's performance across a set of Clips, providing a means to assess its correctness and reliability. As the evaluation of performance is context-dependent, the object includes an evaluation_task field, a text field providing a clear indication of the specific task attempted by the predictions. This context ensures that the scores and metrics provided have a well-defined meaning. For instance, examples of evaluation tasks include "Clip Classification," where predictions aim to accurately determine the "class" of each processed clip. While there are no strict restrictions on this field, using standard names is recommended for easier comparison between evaluations.
erDiagram
Evaluation {
UUID uuid
datetime created_on
str evaluation_task
float score
}
ClipEvaluation
Feature
Evaluation ||--|{ ClipEvaluation : clip_evaluations
Evaluation ||--o| Feature : metrics
Evaluation Set#
An Evaluation Set is a curated collection of fully annotated clips designed for reliable evaluation purposes. Serving as a comprehensive evaluation tool, it can be viewed as the equivalent of a benchmark dataset. To facilitate understanding and usage, each evaluation set is characterized by a name and a description. The name provides a clear identifier, while the description communicates essential information about the contents and intended use cases of the evaluation set. This ensures that researchers and practitioners can confidently employ the evaluation set to assess the performance and reliability of various models and algorithms in a standardized manner.
erDiagram
EvaluationSet {
UUID uuid
datetime created_on
str name
str description
}
ClipAnnotation
EvaluationSet }|--|{ ClipAnnotation : clip_annotations