RACE Dataset

Leaderboard

Description

Race is a large-scale reading comprehension dataset with more than 28,000 passages and nearly 100,000 questions. The dataset is collected from English examinations in China, which are designed for middle school and high school students. The dataset can be served as the training and test sets for machine comprehension.

Notes

1. RACE dataset is available for non-commercial research purpose only.

2. All passages are obtained from the Internet which is not property of Carnegie Mellon University. We are not responsible for the content nor the meaning of these passages.

3. You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purpose, any portion of the contexts and any portion of derived data.

4. We reserve the right to terminate your access to the RACE dataset at any time.

Download

Please use this link to download the datasets.

Paper

RACE: Large-scale ReAding Comprehension Dataset From Examination

Guokun Lai*, Qizhe Xie*, Hanxiao Liu, Yiming Yang and Eduard Hovy.

@article{lai2017large,
    title={RACE: Large-scale ReAding Comprehension Dataset From Examinations},
    author={Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard},
    journal={arXiv preprint arXiv:1704.04683},  
    year={2017}
}

Contact

Please contact Guokun Lai and Qizhe Xie for questions about the dataset.

Data Usage

Each passage is a JSON file. The JSON file contains following fields:

  1. article: A string, which is the passage.
  2. questions: A string list. Each string is a query. We have two types of questions. First one is an interrogative sentence. Another one has a placeholder, which is represented by _.
  3. options: A list of the options list. Each options list contains 4 strings, which are the candidate option.
  4. answers: A list contains the golden label of each query.
  5. id: Each passage has a unique id in this dataset.