AutoMin 2023 Call for Participation

AutoMin 2023: Shared Task on Automatic Minuting

Ever since most of our interactions went virtual, the need for automatic support to run online meetings became essential. Due to frequent meetings and the resulting context switching, people are experiencing an information overload of epic proportions. Hence a tool to automatically summarize a meeting proceeding would be a valuable addition to the virtual workplace. Automatic minuting is close to summarization; however, there are subtle differences. While summarization is motivated towards generating a concise and coherent summary of the text, minuting is more inclined towards adequately capturing the contents of the meeting (where coverage is probably more significant than coherence and conciseness). Apart from the main task of summarizing meeting transcripts, another crucial task is to develop efficient evaluation measures to judge the quality of the automatically generated minutes.

AutoMin 2023 is the second run of our shared task on automatic minuting of meeting transcripts. Our objective is to drive community efforts towards understanding the challenges of the task and develop tools for this important use case. With this shared task, we invite the speech and natural language processing community to investigate the challenges of automatic minuting with real meeting data in two different settings: technical project meetings (both in English and Czech) and parliamentary proceedings (English). Additionally, we hope to explore better options for evaluating the quality of meeting minutes.

AutoMin Tasks

We propose four tasks total. Every participant should take part in at least one of Task A and Task D. Tasks B and C are optional.

Main Task A: The main task consists of automatically creating minutes from multiparty meeting transcripts. The generated minute will be evaluated both via automatic and manual metrics.
Task B: Given a pair of meeting transcript and minute, the task is to identify whether the minute belongs to the transcript. During our data preparation from meetings on similar topics, we found that this task could be challenging given the similarity in various named-entities.
Task C: Given a pair of minutes, the task is to identify whether the two minutes belong to the same or different meetings. This sub-task is important as we want to uncover how minutes created by two different persons for the same meeting may differ in content and coverage.
Main Task D (New Task): Given a meeting transcript, a candidate minute, and a set of one or more reference minutes, assign a score indicating the quality of the candidate minute.

Procedure Overview

Some training data is already available and described below, the rest will be released shortly.
The shared task itself will run from January till mid March 2023. You will be given test inputs and no reference outputs. You will be expected to submit your outputs before the System Submission Deadline. Details on the submission process will be given in due time.
You will be expected to write a paper describing your system and submit it by the System Report Submission Deadline in April 2023.
You will recieve the System Report Review Notification and will be expected to submit the camera-ready version. The camera-ready deadline will be annnounced.

Data

Two main training data sources for the two domains of the test set are:

For ELITR Minuting Corpus, please use only the train and dev sets for training, do not use test or test2.

Aside from this, we recommend the following datasets to use in your training although our domains do not match them:

CNN-Daily Mail: You can use the scripts in here to generate the non-anonymized version of the corpus.
The AMI Meeting Corpus. You can download the summary corpus from here.
The ICSI Meeting Corpus
The Spotify Podcast Dataset.
The SAMSum Corpus.
The XSum Dataset.

Task participants are allowed to use any further data. When submitting, you need to indicate which data was used to train your system:

Minimal - minimal submissions use only the in-domain training data (i.e. only ELITR Minuting Corpus and EuroParlMin, for the respective section of the test set),
Constraint - constraint submissions may use the in-domain training data and CNN-DM, AMI, ICSI, SAMSum and XSum.
Non-constraint - non-constraint submissions may use any other data. Systems using API calls to large pre-trained models like ChatGPT are welcome, but fall into this category, too.

In any case, please clearly describe which data was used in what way in your system paper. A comprehensive list of summarization datasets could be found here:

Evaluation

Manual Evaluation of Task A

We will use several quality criteria which are common for evaluation of text samples produced by automatic language generation systems: adequacy, readability, grammaticality and relevance. Unlike other similar tasks, textual coherence will not be taken into account, because we believe meeting minutes are not always supposed to have a coherent textual form. The manual evaluation will be carried out blindly by our annotators.

Additionally, we will launch a pilot evaluation via our ALIGNMEET tool. The evaluation will be based on the alignment between the transcript and minutes.

Automatic Evaluation of Task A

ROUGE will be the primary metric for automatic evaluation (ROUGE-1, ROUGE-2, ROUGE-L). Additionally, we will use BERTScore and/or BARTScore.

Evaluation of Tasks B and C

For the subsidiary tasks B and C, classwise F1 will be the evaluation metric.

Evaluation of Task D

Task D will not be evaluated by a single criterion. All submissions to Task D will be evaluated in terms of Pearson correlation against all manual and all other automatic evaluation scores.

Publication

All participants are required to submit a brief technical report describing their method. The proceedings will be published in the ACL Anthology.

Contact

For further information about this task and dataset, please contact us at automin@ufal.mff.cuni.cz.