Seminář: Abstractive summarization of fact check reports with pre-trained transformer tuning on extractive summaries

Datum a čas 28. 4. 2022 16:00 - 17:00
Místnost 468NB, a online na Zoomu (pro zaslání linku pište na svatek@vse.cz)

Abstractive summarization of fact check reports with pre-trained transformer tuning on extractive summaries

Prezentující: Peter Vajdečka (KIZI VŠE)

Fact checking is an activity aiming to remedy the global problem of disinformation spread. The result of this process, undertaken by numerous initiatives such as demagog.cz or politifact.com, are fact check reports written by human editors. Since the reports are frequently too long for a casual reader, and contain auxiliary parts not directly relevant for judging the claim veracity, automated creation of fact check report summaries is a topical task. The reader could then look at the shorter summary, containing the most salient points of the report, and then decide whether they dig deeper into some parts of the full report or not.

In the field of natural language processing, neural network models with transformer architectures achieve state-of-the-art results on many downstream tasks, including text summarization. These models are trained on a massive textual knowledge base, which ensures that just a small quantity of data is required to fine-tune these models – in contrast to large amounts of training data needed when the learning process starts from scratch just for the particular application.

We propose a novel  procedure for text data reduction for the purpose of fine-tuning a natural language generation model, the Unified Text to Text Transformer (T5), in order to summarize a fact check report. First, the Local Outlier Factor approach is used to generate an extractive summary of the report, using sentence vectorization via the TF-IDF, Doc2Vec and BERT contextual representations. In addition, BERT is fine-tuned specifically for the  given task and achieves the best results when compared to the other vector representations. Finally, the T5 Transformer is fine-tuned using these extractive summaries (reports containing fewer sentences than the original ones) to generate the final abstractive summaries. On English texts from politifact.com, the new method outperformed all state-of-the-art methods. As regards the Czech language, we were, to our knowledge, the first to apply automatic summarization to demagog.cz data.  For comparison, the new procedure was also applied to generate short summaries for a known Czech news dataset (SumeCzech); although we only used 10% of the initial training data for model fine-tuning, we overcame most of the state-of-the-art results.