Extractive Supported Datasets

Note

In addition to the below datasets, all of the abstractive datasets can be converted for extractive summarization and thus be used to train models. See Option 2: Automatic pre-processing through nlp for more information.

There are several ways to obtain and process the datasets below:

  1. Download the converted extractive version for use with the training script (which will preprocess the data automatically (tokenization, etc.)). Note that all the provided extractive versions are split every 500 documents and are compressed. You will have to manually process if you desire different chunk sizes.

  2. Download the processed abstractive version. This is the original data after being run through its respective processor located in the datasets folder.

  3. Download the original data in its original form, which depends on how it was obtained in the original paper.

The table under each heading contains quick links to download the data. Beneath that are instructions to process the data manually.

CNN/DM

The CNN/DailyMail (Hermann et al., 2015) dataset contains 93k articles from the CNN, and 220k articles the Daily Mail newspapers. Both publishers supplement their articles with bullet point summaries. Non-anonymized variant in See et al. (2017).

Type

Link

Processor Repository

artmatsak/cnn-dailymail

Data Download Link

CNN/DM official website

Processed Abstractive Dataset

Google Drive

Extractive Version

Google Drive

Download and unzip the stories directories from here for both CNN and Daily Mail. The files can be downloaded from the terminal with gdown, which can be installed with pip install gdown.

pip install gdown
gdown https://drive.google.com/uc?id=0BwmD_VLjROrfTHk4NFg2SndKcjQ
gdown https://drive.google.com/uc?id=0BwmD_VLjROrfM1BxdkxVaTY2bWs
tar zxf cnn_stories.tgz
tar zxf dailymail_stories.tgz

Note

The above Google Drive links may be outdated depending on the time you are reading this. Check the CNN/DM official website for the most up-to-date download links.

Next, run the processing code in the git submodule for artmatsak/cnn-dailymail located in datasets/cnn_dailymail_processor. Run python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories, replacing /`path/to/cnn/stories` with the path to where you saved the cnn/stories directory that you downloaded; similarly for dailymail/stories.

For each of the URL lists (all_train.txt, all_val.txt and all_test.txt) in cnn_dailymail_processor/url_lists, the corresponding stories are read from file and written to text files train.source, train.target, val.source, val.target, and test.source and test.target. These will be placed in the newly created cnn_dm directory.

The original processing code is available at abisee/cnn-dailymail, but for this project the artmatsak/cnn-dailymail processing code is used since it does not tokenize and writes the data to text file train.source, train.target, val.source, val.target, test.source and test.target, which is the format expected by convert_to_extractive.py.

WikiHow

WikiHow (Koupaee and Wang, 2018) is a large-scale dataset of instructions from the online WikiHow.com website. Each of 200k examples consists of multiple instruction-step paragraphs along with a summarizing sentence. The task is to generate the concatenated summary-sentences from the paragraphs.

Dataset Size

230,843

Average Article Length

579.8

Average Summary Length

62.1

Vocabulary Size

556,461

Type

Link

Processor Repository

HHousen/WikiHow-Dataset (Original Repo)

Data Download Link

wikihowAll.csv (mirror) and wikihowSep.csv

Processed Abstractive Dataset

Google Drive

Extractive Version

Google Drive

Processing Steps:

  1. Download wikihowAll.csv (main repo for most up-to-date links) to datasets/wikihow_processor

  2. Run python process.py (runtime: 2m), which will create a new directory called wikihow containing the train.source, train.target, val.source, val.target, test.source and test.target files necessary for convert_to_extractive.py.

PubMed/ArXiv

ArXiv and PubMed (Cohan et al., 2018) are two long document datasets of scientific publications from [arXiv.org](http://arxiv.org/) (113k) and PubMed (215k). The task is to generate the abstract from the paper body.

Datasets

# docs

avg. doc. length (words)

avg. summary length (words)

CNN

92K

656

43

Daily Mail

219K

693

52

NY Times

655K

530

38

PubMed (this dataset)

133K

3016

203

arXiv (this dataset)

215K

4938

220

Type

Link

Processor Repository

HHousen/ArXiv-PubMed-Sum (Original Repo)

Data Download Link

PubMed (mirror) and ArXiv (mirror)

Processed Abstractive Dataset

Google Drive

Extractive Version

Google Drive

Processing Steps:

  1. Download PubMed and ArXiv (main repo for most up-to-date links) to datasets/arxiv-pubmed_processor

  2. Run the command python process.py <arxiv_articles_dir> <pubmed_articles_dir> (runtime: 5-10m), which will create a new directory called arxiv-pubmed containing the train.source, train.target, val.source, val.target, test.source and test.target files necessary for convert_to_extractive.py.

See the repository’s README.md.

Note

To convert this dataset to extractive it is recommended to use the --sentencizer option due to the size of the dataset. Additionally, the --max_sentence_ntokens should be set to 300 and the --max_example_nsents should be set to 600. See the Convert Abstractive to Extractive Dataset section for more information. The full command should be similar to:

python convert_to_extractive.py ./datasets/arxiv-pubmed_processor/arxiv-pubmed \
--shard_interval 5000 \
--sentencizer \
--max_sentence_ntokens 300 \
--max_example_nsents 600