Suranah project status report

From Cohen Courses
Revision as of 11:42, 3 September 2010 by WikiAdmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

The Mechanism

I have refined and partially validated the mechanisms that I will be using for this project. The first mechanism is to verify the exact relations which are extracted from ReadTheWeb. While this would not be garnering us any new data, it will provide user judgments on the accuracy of the set of extracted facts. This is especially important for ReadTheWeb as it helps select the relations and hence patterns to be promoted, and also gleans high probability facts which are wrong.

The second mechanism will give similarity judgments, with respect to a class for a given set of entities. This is more useful for estimating which set of classes should ReadTheWeb extend to. On a broader context, putting human input in a loop without substantial costs can help prevent the semantic drift on a wide range of bootstrapping approaches.

(Without using a set of animations or a demonstration, explicitly explaining the mechanisms is quite difficult. But among other things both of these mechanism show N different entities at a time, and extracts information using asymmetric/symmetric verification between two players.)

Partial Data Analysis

I partially analyzed a subset of 11577 high probability relations from ReadTheWeb. Around 94% of them were is instance of a the class relations. Though it cannot be completely accurate, I tried to use WordNet to estimate out the percentage of abstract relations like rodeo is a sport, as opposed to concrete relations like Enrico Fermi is a scientist. It turns out that around 79% of the relations were abstract for the subset I worked on. As most of the relations maybe too esoteric for most people, I am also working on trying to extract the relations which are most likely to be verified. I am using a measurement based on Wikipedia and Google n-gram statistics. I would attempt to also maximize information gain per data point in terms of patterns related to the entity shown.

It maybe noted that while the second mechanism will be used for getting similarity judgments, it also verifies the 'is instance of' relation in the process. As the game time would be somewhat equally divided between both the mechanisms, it is interesting that skewness of the data can be used for our advantage.

Implementation

Besides the relation analysis and processing code, I have also written some amount of code for the user interface, interaction management, scheduling and other game related concepts. Most of the data processing code is written in Java with appropriate APIs for handling other sources like WordNet. The game backend would consist of only Java, with various frameworks for communication. I am implementing the frontend in Adobe Flex.

I plan to complete the entire backend by mid November, and start some evaluation for the mechanisms. Though it is not entirely sure right now, it is possible that I would also be able to complete the front-end by late November, and be able to evaluate the mechanisms with the completed front-end.