Difference between revisions of "Msamadi project abstract"

Latest revision as of 13:12, 9 October 2010

What I plan to do

In analysis of social media class (and also as part of my PhD research) I developed a program that was able to extract instructions for any how-to query from the Web. The key ideas behind the program were using some HTML features and also a classifier to extract instructions from the Web. However there are many other websites that contain instructions that are written in different formats. The structure of these websites are not known to our program. In this project I propose to use redundancy of the data available on the Web to learn the structure of new websites.

Motivation

Recently, many websites have been developed that provide solutions and tips for many tasks and projects (e.g., eHow.com or WikiHow.com). Each of these how-to manuals provides step-by-step instructions that describe how to do the given task. Currently, eHow.com contains more than 1.5 million articles produced both by experts and amateur users. According to the web statistics, each month 70 million people visit eHow.com. By extracting new instructions from "unknown" websites we may be able to add new instructions to these websites.

Interesting point

There are a lot of redundancy in the content of instructions that our program can extract at this stage. This redundancy might be useful to extract new instructions from the Web.

Evaluation

The performance of the system can be measured by comparing the instructions that are extracted by our program to the content of eHow.com or WikiHow.com. The comparison can be done by myself or autonomous users.

Techniques that can be used to solve this problem

Using wrappers to learn the structure of websites.
Using information extraction techniques to extract important keywords from a document.

What question to answer

Is there enough redundancy on the extracted instructions so that we can use to extract new instructions? How many new instructions can be extracted by our program?

Conferences to publish paper

International World Wide Web conference

The Twenty-Fourth AAAI Conference on Artificial Intelligence

Web Data Extraction Based on Partial Tree Alignment

Team Member

Mehdi Samadi

@@ Line 13: / Line 13: @@
 == Techniques that can be used to solve this problem ==
 * Using wrappers to learn the structure of websites.
+* Using information extraction techniques to extract important keywords from a document.
 == What question to answer ==
@@ Line 19: / Line 20: @@
 == Conferences to publish paper ==
 International World Wide Web conference
 The Twenty-Fourth AAAI Conference on Artificial Intelligence
 == Related Work ==
 [[Bootstrapping Information Extraction from Semi-structured Web Pages]]
 [[Web Data Extraction Based on Partial Tree Alignment]]

Difference between revisions of "Msamadi project abstract"

Latest revision as of 13:12, 9 October 2010

Contents

What I plan to do

Motivation

Interesting point

Evaluation

Techniques that can be used to solve this problem

What question to answer

Conferences to publish paper

Related Work

Team Member

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools