Difference between revisions of "Project mehdi"

From Cohen Courses
Jump to navigationJump to search
 
m (1 revision)
(No difference)

Revision as of 10:42, 3 September 2010

This is one of the project reports given in the course Social Media Analysis 10-802 in Spring 2010.

  • Author: Mehdi SamadiProperty "Author" has a restricted application area and cannot be used as annotation property by a user.
  • Presentation: [1]

Abstract

The ability to determine what actions are needed to perform a task is of interest in many application domains. Given a task (e.g., "how to cook pizza"), this paper presents a technique to use Web to find a set of steps which describe how to do the given task. The goal is to build an automatic program that is able to search for all the websites that have described step-by-step how to do the given task. It then extract all the steps from the Web and summarize them for the user. For evaluation, we have chosen five different random categories from WikiHow.com. For each category, three of the featured subjects are chosen randomly and are given to our program to extract steps. Our experimental results have shown that our technique achieves accuracy of 80%. We have also shown that our proposed method can extract as twice as the number of steps that are written in eHow.com and WikiHow.com.

Brief Description

Planning techniques require that actions be defined precisely in a formal language, e.g. Planning Domain Description Language (PDDL). Traditional way to define a planning problem is to ask an expert to write an action model using PDDL. It has been a goal for planning community to automate this process and unfortunately it is something far from being a reality currently. In this paper we present a technique that can be used to automatically extract plan from the Web. This can be used as the first steps towards the goal of automatic plan extraction.

For many domains there are several websites on the Web which have described how to perform different tasks. For example eHow.com and WikiHow.com are social websites that contains several plans in different categories. Most of these plans are entered by users. In this work we present a technique to automatically extract a step-by-step plan from unstructured plan descriptions which are available on the Web.

Our proposed method has four different parts:

Query Suggester: Given a query from user, this subsystem modifies the query using different patterns. For example consider the query "cook pizza" which is entered by user. The query suggester modifies it to "how to cook pizza step by step?", "hits to cook pizza" etc.

Crawler: The crawler subsystem searches for each query which is suggested by query suggester in Google and extracts the top ten results that are returned by Google. The crawler subsystem downloads and stores each of these websites for the next parts of the system. It also fixes webpage encoding, ignore the webpage if it is not written in English or if it links to a file (e.g. PDF file).

Step Extractor: After crawling all the webpages, the step extractor reads each of the them and extract steps from them. For step extraction we have used only text features. Step extractor also extracts the main content of the webpage and also remove ads from it.

Step Summarizer: This subsystem summarizes all the steps which are extracted by step extractor.

Related Work

There have been very little work on extracting plan from human generated plans or action model that are written in unstructured or semi-structured formats. To the best of our knowledge, the only work which is related to our research is the tool developed by Addis et al.[1] They have designed a method to parse WikiHow.com and convert its plans to to a semi-structured format. They have done this by using the semantic tags which are written in the HTML file.

There have been some other work[2,3] which have tried to develop different tools for knowledge acquisition in planning and learning plans from examples. However their focus have been on extracting plan from structured data which are defined in a formal language and they have not used any of the automatic techniques to extract plan from the Web. In this category, the most related one is GIPO[2], a tool which is developed by R. Simpson to provide a graphical environment for user to be able to design planning domains. GIPO has a database of written plans which can be used in the process of writing new planning domains. However all of these plans are written by an expert.

[1] Andrea Addis, Giuliano Armano, and Daniel Borra jo. "Recovering plans from the web". In Proceedings of SPARK, Scheduling and Planning Applications woRKshop, ICAPS’09, Thessaloniki (Greece), September 2009.

[2] R. M. SIMPSON, D. E. KITCHIN, and T. L. McCLUSKEY. "Planning domain definition using GIPO" Knowl. Eng. Rev., 22(2):117–134, 2007.

[3] Xuemei Wang. "Learning by observation and practice: An incremental approach for planning operator acquisition. In In Proceedings of the 12th International Conference on Machine Learning, pages 549–557. Morgan Kaufmann, 1995.

Experimental Results

For evaluation, we have chosen five different random categories from WikiHow.com. For each category, three of the featured subjects are chosen randomly and are given to our program to extract steps. Our experimental results have shown that our technique achieves accuracy of 80%. We have also shown that our proposed method can extract as twice as the number of steps that are written in eHow.com and WikiHow.com.