Freebase Preprocessing for WebQuestions-KBIR

We've received several inquiries about our preprocessing and have compiled the following short guide accordingly. This is more a record than a howto -- I haven't pulled out any installation-specific variables or written any kind of config script to make this reusable. Email krivard, at cs.cmu.edu, with any questions.

Which version of freebase to use

The freebase dump we used was the one published by Microsoft here:

https://www.microsoft.com/en-us/download/details.aspx?id=54511

That archive is intended to accompany their FastRDFStore project (https://github.com/Microsoft/FastRDFStore) which is referenced somewhere deep in the WebQuestionsSP documentation.

It is timestamped as 2015-08-09 and documented as being "the last dump of freebase," but that is almost certainly an error, since it indeed does not match the "final dump" published by freebase proper (via:https://developers.google.com/freebase/) whose last-modified date is 2015-08-09. If you use the freebase-published version on any WebQuestionsSP tasks, you wind up with a pile of missing solutions. The MS archive (wherever they got it from) is the correct one to use for the WebQuestionsSP dataset and its descendants, including kbir.

Extracting the 2-hop subgraphs

Preprocessing

Freebase is really big. If you use the entire graph, and process it as a text file, it is extremely slow (speedups can be made by using a database, but that's outside our scope here). We cut down the file to remove predicates and domains we don't care about, including musicbrainz and most of the ontological relations, and replaced the fully-qualified node ids with a shortened version.

The script we used for this step is smallerFreebase.webQSP.sh. The freebase file we used is fb_en.txt from the MS archive.

Graph walks

This step uses a really gnarly Makefile and a reasonably recent copy of GNU Parallel to generate a 2-hop graph walk for each query.

It takes a while. I believe we got it down to a few days (maybe a week, with restarts and debugging?) on a 64-core machine with 0.5T of memory. On a smaller machine you'll need to fiddle with the block sizes, split sizes, number of simultaneous jobs, etc. You'll wind up spending a lot more time merging intermediate results files, but it should still work. Definitely prototype your settings on a small query file first, with number of queries no lower than the number of outer jobs. You want to catch any resource saturation issues before you have thousands of partially-computed graphs that just have to be thrown out.

Files used for this step:

The query file format is tab-delimited:

questionid <seed_entity>=weight
(though the weight is ignored so you could just cut it off after the entity id)

Sample:

$ head /remote/bones/project/webquestions_kbir/entity_linking/data/freebase_oracle_questions.txt
WebQTrn-0       <fb:m.06w2sn5>=0.0
WebQTrn-1       <fb:m.09l3p>=0.0
WebQTrn-3       <fb:m.03st9j>=0.0
WebQTrn-4       <fb:m.0160w>=0.0
WebQTrn-5       <fb:m.02fgm7>=0.0
WebQTrn-6       <fb:m.0c2yrf>=0.0
WebQTrn-7       <fb:m.084l5>=0.0
WebQTrn-8       <fb:m.07484>=0.0
WebQTrn-9       <fb:m.0c9c0>=0.0
WebQTrn-11      <fb:m.01sn3>=0.0
[...]
WebQTrn-66      <fb:m.03mgx6z>=0.0,<fb:m.03_wpf>=0.0

15 April 2019, krivard