AutoMin 2025 Workshop
Date: August 28, 2025
Time: 14:00 – 17:00 CEST
Format: Fully online (Zoom)
Recording: Watch on YouTube
AI-generated summary (Zoom): Editable for fixes (please do), or Original Zoom Doc file
Schedule
- 14:00 – 14:05 — Introduction (Organizers)
- 14:05 – 14:50 — Keynote 1: Sebastian Stüker (Zoom Inc) — The role of speech summarisation in multilingual unified communication
- 14:50 – 15:10 — Overview of the AutoMin 2025 Shared Task (Organizers) — Slides (PDF)
- 15:10 – 15:20 — Break
- 15:20 – 15:40 — System Presentation: GETALP team — Gabriela‑Nicole Gonzalez‑Saez (LIG, France)
- 15:40 – 16:00 — System Presentation: HallucinationIndexes team — Praveenkumar Katwe (IIT BH & Informatica Business Solutions, India)
- 16:00 – 16:45 — Keynote 2: Sebastien Bratières (Translated, Italy) — Multimodal foundation models for speech
- 16:45 – 17:00 — Wrap‑up
All times are in Central European Summer Time (CEST).
System papers
Keynote presentations
Sebastian Stüker (Zoom Inc)
Title: The role of speech summarisation in multilingual unified communication
Abstract:
Speech communication occurs in many forms in today’s world. Unified communication
systems bundle many different kinds of communication, including speech communication,
into unified platforms. For example, Zoom offers a platform that combines meetings,
telephone calls, webinars, events, voice messages, and in-meeting recordings into a
single platform.
In this talk we will discuss some of the research and development challenges for
integrating speech summarisation into unified communication, with examples of how
it enriches the user experience of Zoom’s platform.
We will further show how speech summarisation has evolved from being a singular
assistive technology to becoming an integral part of AI Agents that make the vision
of “from conversation to completion” a reality.
Sebastien Bratières (Translated, Italy)
Title: Multimodal foundation models for speech
Abstract: Human language is produced in different ways, in several modalities, such as Unicode glyphs for written text, sign language captured as video, handwriting dynamics from a stylus. Speech, in turn, requires several sensors for machines to perceive: the audio and video channels cover nearly all aspects of speech production, from acoustic non-verbals to gaze, face expression and gestures; it is typically completed by text as an easy-to-engineer representation of (verbal) surface form and semantic content.
In this talk, we look at speech as a multimodal phenomenon among others. Neural foundation models offer the promise of processing several modalities at once, for perception and generation. To unlock this promise, what are the architectural, training recipe and data manipulation choices we should take today? What challenges will remain in the long term?
Test Data Instructions (April 17th 2025)
Track A (Minuting) test data is mentioned in the CfP and in the dataset (with freeform .txt files for the test set generations)
Track B (Q&A) The test data is provided in the file
elitr-bench-qa_test2.json
available in this archive:
data.zip.
Use the following password to unzip the file: utter
.
Your submitted file should follow the format shown in this
submission sample. Below is an example of the required structure:
{
"id": "meeting_en_test2_001",
"questions": [
{
"id": "1",
"question-type": "what",
"answer-position": "S",
"question": "What was the purpose of the meeting?",
"groundtruth-answer": "Preparation for a workshop or conference event on automatic text summarization.",
"generated-responses": [
{
"model": "<your system name>",
"generated-response": "<your generated response for this question>"
}
]
}
]
}
Please make sure your submission strictly follows the required format and includes all necessary fields.
Important: While the groundtruth-answer
field is included in the provided data,
it must not be used by your system to generate answers. However, we ask that you
copy this field into your submission as-is, to help us with scoring and evaluation.