Msamadi project abstract
Contents
What I plan to do
In analysis of social media class (and also as part of my PhD research) I developed a program that was able to extract instructions for any how-to query from the Web. The key ideas behind the program were using some HTML features and also a classifier to extract instructions from the Web. However there are many other websites that contain instructions that are written in different formats. The structure of these websites are not known to our program. In this project I propose to use redundancy of the data available on the Web to learn the structure of new websites.
Motivation
Recently, many websites have been developed that provide solutions and tips for many tasks and projects (e.g., eHow.com or WikiHow.com). Each of these how-to manuals provides step-by-step instructions that describe how to do the given task. Currently, eHow.com contains more than 1.5 million articles produced both by experts and amateur users. According to the web statistics, each month 70 million people visit eHow.com. By extracting new instructions from "unknown" websites we may be able to add new instructions to these websites.
Interesting point
There are a lot of redundancy in the content of instructions that our program can extract at this stage. This redundancy might be useful to extract new instructions from the Web.
Evaluation
The performance of the system can be measured by comparing the instructions that are extracted by our program to the content of eHow.com or WikiHow.com. The comparison can be done by myself or autonomous users.
Techniques that can be used to solve this problem
- Using wrappers to learn the structure of websites.
What question to answer
Is there enough redundancy on the extracted instructions so that we can use to extract new instructions? How many new instructions can be extracted by our program?
Conferences to publish paper
International World Wide Web conference The Twenty-Fourth AAAI Conference on Artificial Intelligence
Related Work
Bootstrapping Information Extraction from Semi-structured Web Pages Web Data Extraction Based on Partial Tree Alignment
Team Member
Mehdi Samadi