Ever since most of our interactions went virtual, the need for automatic support to run online meetings became essential. Due to frequent meetings and the resulting context switching, people are experiencing an information overload of epic proportions. Hence a tool to automatically summarize a meeting proceeding would be a valuable addition to the virtual workplace. Automatic minuting is close to summarization; however, there are subtle differences. While summarization is motivated towards generating a concise and coherent summary of the text, minuting is more inclined towards adequately capturing the contents of the meeting (where coverage is probably more significant than coherence and conciseness). Apart from the main task of summarizing meeting transcripts, another crucial task is to develop efficient evaluation measures to judge the quality of the automatically generated minutes.
AutoMin 2023 is the second run of our shared task on automatic minuting of meeting transcripts. Our objective is to drive community efforts towards understanding the challenges of the task and develop tools for this important use case. With this shared task, we invite the speech and natural language processing community to investigate the challenges of automatic minuting with real meeting data in two different settings: technical project meetings (both in English and Czech) and parliamentary proceedings (English). Additionally, we hope to explore better options for evaluating the quality of meeting minutes.
We propose four tasks total. Every participant should take part in at least one of Task A and Task D. Tasks B and C are optional.
Two main training data sources for the two domains of the test set are:
For ELITR Minuting Corpus, please use only the train and dev sets for training, do not use test or test2.
Aside from this, we recommend the following datasets to use in your training although our domains do not match them:
In any case, please clearly describe which data was used in what way in your system paper. A comprehensive list of summarization datasets could be found here:
We will use several quality criteria which are common for evaluation of text samples produced by automatic language generation systems: adequacy, readability, grammaticality and relevance. Unlike other similar tasks, textual coherence will not be taken into account, because we believe meeting minutes are not always supposed to have a coherent textual form. The manual evaluation will be carried out blindly by our annotators.
Additionally, we will launch a pilot evaluation via our ALIGNMEET tool. The evaluation will be based on the alignment between the transcript and minutes.
ROUGE will be the primary metric for automatic evaluation (ROUGE-1, ROUGE-2, ROUGE-L). Additionally, we will use BERTScore and/or BARTScore.
For the subsidiary tasks B and C, classwise F1 will be the evaluation metric.
Task D will not be evaluated by a single criterion. All submissions to Task D will be evaluated in terms of Pearson correlation against all manual and all other automatic evaluation scores.
All participants are required to submit a brief technical report describing their method. The proceedings will be published in the ACL Anthology.
For further information about this task and dataset, please contact us at automin@ufal.mff.cuni.cz.