What's Up, Docs? Document Summarization with LLMs

Looking for a great way to start working with LLMs? See if you can summarize research papers from an open archive of the social sciences. #development #science

beginner practice
11 months left
96 joined

Problem description

In this practice competition, you have three goals:

  1. Assemble one-paragraph summaries of a corpus of social science academic papers
  2. Gain experience working with large language models (LLMs)
  3. Be a good friend and neighbor (accomplish this one in whatever way suits you)

Document summarization is one of the classic LLM applications, so this should provide a good introduction to getting useful, reliable, work from LLMs (which isn't always their strong suit).

Data


The data for this challenge is text derived from a sample of social science papers on SocArXiv. The included papers were all licensed under the CC-BY 4.0 license, longer than 10,000 characters, and likely written in English.

Features

The feature in this challenge is the main text of papers, which were extracted from preprints. The text excludes (or attempts to exclude) the abstract text, references, acknowledgements and other text irrelevant to the summarization task, and text that was highly similar to the abstract.

The features consist of just two columns:

  • paper_id (str) - A unique identifier for the paper
  • text (txt) - The main body of the paper in Markdown format.

Label

The label in this challenge is the author-written abstract in plaintext.

  • summary (txt) - The abstract of the paper.

Data example


A single row in the training dataset looks like this:

paper_id text summary
0 ## FROM SOVEREIGNTY TO EXTRATERRITORIAL CONSCIOUSNESS... In this article, Victor Fan argues that analysing contemporary Hong Kong cinema...

Performance metric


For scoring how well your summaries align with the author-written abstracts, we'll be using the average F1 of the ROUGE-2 score per document.

ROUGE-2 measures the overlap of bigrams (word pairs) between a given piece of text and a reference text. We take the arithmetic mean of the F1 scores for each document.

Score |$= \frac{1}{N} \sum_{i=1}^{N} F_1(prediction_i, reference_i)$|

where |$F_1 = \frac{2 \cdot P \cdot R}{P + R}, P=\frac{bigram\_matches}{bigrams\_in\_prediction}, R=\frac{bigram\_matches}{bigrams\_in\_reference}$|

See the "Evaluation" section of the benchmark for a longer discussion and implementation.

Submission format


The format for the submission file is a CSV with two columns, paper_id and summary.

For example, your .csv file that you submit would look like:

paper_id,summary
1000,"Humanity has long been plagued by..."

Good luck!


Looking for a great tutorial to get you started? Check out the benchmark walkthrough created for this challenge.

Good luck and enjoy this! If you have any questions you can always visit the user forum for this competition!