This package includes the Quasar datasets for QA by search and reading. There are two datasets -- Quasar-S and Quasar-T. -S consists of cloze style questions over software entities, and -T consists of trivia questions. For both datasets we also provide long and short contexts extracted from text corpora using a lucene search for the questions. The datasets are organized in the following directory structure: . |-- dataset_statistics.py |-- quasar-s | |-- candidates.txt | |-- contexts | | |-- long | | | |-- dev_contexts.json.gz | | | |-- test_contexts.json.gz | | | `-- train_contexts.json.gz | | `-- short | | |-- dev_contexts.json.gz | | |-- test_contexts.json.gz | | `-- train_contexts.json.gz | |-- questions | | |-- dev_questions.json.gz | | |-- test_questions.json.gz | | `-- train_questions.json.gz | `-- relation_annotations.json |-- quasar-t | |-- answer_annotations.json | |-- contexts | | |-- long | | | |-- dev_contexts.json.gz | | | |-- dev_nps.json.gz | | | |-- test_contexts.json.gz | | | |-- test_nps.json.gz | | | |-- train_contexts.json.gz | | | `-- train_nps.json.gz | | `-- short | | |-- dev_contexts.json.gz | | |-- dev_nps.json.gz | | |-- test_contexts.json.gz | | |-- test_nps.json.gz | | |-- train_contexts.json.gz | | `-- train_nps.json.gz | |-- genre_annotations.json | `-- questions | |-- dev_questions.json.gz | |-- test_questions.json.gz | `-- train_questions.json.gz `-- readme.txt There are two sub-directories for each dataset -- 'questions/' containing the questions and answers split into train/test/dev sets, and 'contexts/' containing the long and short pseudo-documents retrieved for each question by our retrieval system. There are three types of files in these folders: 1. _questions.json.gz: The questions, one json-formatted string per line, in the following format: { "answer": "sarajevo", "question": "In the act that incited WWI , Serbian Gavrilo Princip assassinated Archduke Franz Ferdinand in 1914 in what city ?", "uid": "s0q11", "tags": ["1tok", "yes-answer-long", "yes-answer-short"] } - If "tags" contain "1tok", it means the answer is a single token. - If "tags" contain "yes-answer-long", it means the answer is present in at least one retrieved long pseudo-document for this question. - If "tags" contain "yes-answer-short", it means the answer is present in at least one retrieved short pseudo-document for this question. Note: "yes-answer" is determined by searching for the answer string in the context string, without tokenizing either. For quasar-S, the questions are cloze-style, and the cloze to be filled in is denoted by "@placeholder". E.g., { "answer": "programming-languages", "question": "lisp -- lisp is a family of general purpose @placeholder influenced by the lambda-calculus and with the ability to manipulate source code as a data structure .", "uid": "lisp@programming-languages@45", "tags": ["yes-answer-long"] } 2. _contexts.json.gz: The retrieved pseudo-documents (long / short) for the questions. Each line corresponds to the question on the same line in _questions.json.gz. The line is a json formatted string in the following format: { "contexts": [ [ 62.570347, "On mac OS El Capitan I have a virtual-machine vagrant with laravel-homestead box ." ], ... ], "uid": "homestead@php@159" } Each pseudo-document is accompanied by a float -- its retrieval score. The documents are sorted according to the retrieval score. The "uid" matches that of the question for which these contexts were retrieved. 3. _nps.json.gz (only for quasar-T): We also provide contiguous chunks of NN* tagged tokens from the context as candidate answers (only for quasar-T). Again each line corresponds to the question in _questions.json.gz, in the format: { "nps": [ ... [ "aerosol spray", 69, 29 ], ], "uid": "s3q41931" } Each element in "nps" is a list with three elements - [candidate, context_id, token_id]. The context_id is the index into the list of context documents, and token_id is the position of the start of the np in the context, when tokenized by white-space. Both are 0-based indices. If the correct answer is not detected as an NN* chunk we add it to the list of NPs above. The context_id and token_id are set to -1 in this case. The "*.json.gz" files can be read in python as follows: import gzip def read_data(path): with gzip.open(path) as f: for line in f: yield eval(line) 'quasar-s/' also contains candidates.txt, which is the output vocabulary for the clozes. Hence, every answer is one of these candidates. We are also providing human-collected annotations over subsets of the dev split for the two datasets to allow analysis into the performance of different models. These are provided as json-formatted dictionaries mapping the annotation to a list of question "uid"s from the dev set for which that annotation is true. For 'quasar-s/': 1. relation_annotations.json: Annotations of the relation type between head entity of the cloze question and the answer entity. For 'quasar-t/': 1. answer_annotations.json: Annotations of the type of the answer, such as "location" or "date/time". 2. genre_annotations.json: Annotations of the genre of the question, such as "arts" or "math/science". Report bugs and missing information at bdhingra@andrew.cmu.edu.