Mehdi project abstract

From Cohen Courses
Jump to navigationJump to search

What I plan to do

In analysis of social media class (and also as part of my PhD research) I developed a program that was able to extract instructions for any how-to query from the Web. The key ideas behind the program were using some HTML features and also a classifier to extract instructions from the Web. However there are many other websites that contain instructions that are written in different formats. The structure of these websites are not known to our program. In this project I propose to use redundancy of the data available on the Web to learn the structure of new websites.

Motivation

Recently, many websites have been developed that provide solutions and tips for many tasks and projects (e.g., eHow.com or WikiHow.com). Each of these how-to manuals provides step-by-step instructions that describe how to do the given task. Currently, eHow.com contains more than 1.5 million articles produced both by experts and amateur users. According to the web statistics, each month 70 million people visit eHow.com. By extracting new instructions from "unknown" websites we may be able to add new instructions to these websites.

Interesting point

There are a lot of redundancy in the content of instructions that our program can extract at this stage. This redundancy might be useful to extract new instructions from the Web.

Evaluation

The performance of the system can be measured by comparing the instructions that are extracted by our program to the content of eHow.com or WikiHow.com. The comparison can be done by myself or autonomous users.

Techniques that can be used to solve this problem

  • Using wrappers to learn the structure of websites.

What question to answer

Is there enough redundancy on the extracted instructions so that we can use to extract new instructions? How many new instructions can be extracted by our program?

Team Member

Mehdi Samadi