Msamadi project abstract

From Cohen Courses
Jump to navigationJump to search

What I plan to do

In analysis of social media class (and also as part of my PhD research) I developed a program that was able to extract instructions for any how-to query from the Web. The key ideas behind the program were using some HTML features and also a classifier to extract instructions from the Web. However there are many other websites that contain instructions that are written in different formats. The structure of these websites are not known to our program. In this project I propose to use redundancy of the data available on the Web to learn the structure of new websites.


Recently, many websites have been developed that provide solutions and tips for many tasks and projects (e.g., or Each of these how-to manuals provides step-by-step instructions that describe how to do the given task. Currently, contains more than 1.5 million articles produced both by experts and amateur users. According to the web statistics, each month 70 million people visit By extracting new instructions from "unknown" websites we may be able to add new instructions to these websites.

Interesting point

There are a lot of redundancy in the content of instructions that our program can extract at this stage. This redundancy might be useful to extract new instructions from the Web.


The performance of the system can be measured by comparing the instructions that are extracted by our program to the content of or The comparison can be done by myself or autonomous users.

Techniques that can be used to solve this problem

  • Using wrappers to learn the structure of websites.
  • Using information extraction techniques to extract important keywords from a document.

What question to answer

Is there enough redundancy on the extracted instructions so that we can use to extract new instructions? How many new instructions can be extracted by our program?

Conferences to publish paper

International World Wide Web conference

The Twenty-Fourth AAAI Conference on Artificial Intelligence

Related Work

Bootstrapping Information Extraction from Semi-structured Web Pages

Web Data Extraction Based on Partial Tree Alignment

Team Member

Mehdi Samadi