Difference between revisions of "Forum-Based Language Learning Analysis"

From Cohen Courses

Jump to navigation Jump to search

Latest revision as of 09:46, 15 February 2011

1 Team Members
2 Introduction
- 2.1 Motivation
3 Dataset
4 Proposed Work
- 4.1 Network Structure
5 Possible Methods
6 Evaluation
- 6.1 Coarse-grain evaluation
- 6.2 Fine-grain evaluation
7 References

Team Members

Adam Skory

Gabriel Parent

Introduction

Second-language learning requires a lot of time and effort. Fortunately, some tools can be used to facilitate the learning task. Online forums are a type of social medium used by learners, for example, to ask for help with a certain grammatical rule or a certain idiom.

Online forums have been used to create topic-topic, user-user, and user-topic graphs. These graphs have been used for such tasks as recommendation systems, investigating knowledge propagation, and identifying influence. In this work we plan to use data from a forum dedicating to studying the Spanish language to facilitate language learning by identify salient topics.

Motivation

The primary goal of this work will be the extraction of topics in the forum. Our the motivation is to find not just what learners of Spanish find difficult in the realms of vocabulary, grammar, and culture, but also how those difficulties relate to each other and change over time. In particular, we would like to investigate the stages of language learning in terms of topics of concern with the intention of showing whether or not there is a general pattern amongst learners. If these patterns can be found, evidence of certain linguistic difficulties could be used to predict further difficulties and students can be offered help possibly even before they are aware that help is needed. Along these lines, it could also be possible to suggest to a learner other forum users that related strength/weakness to be study-peer.

Dataset

For this dataset will be performing a crawl of http://forums.tomisimo.org/

Some statistics about the forum:

Threads: 9,046
Posts: 100,535
Members: 4,863
Active Members: 742

The primary areas of the forum are:

Vocabulary
Translations
Grammar
Practice & Homework
Teaching & Learning
Culture
Teaching and Learning Techniques
Introductions
General Chat

The forum is run on the vBulletin system and anonymous postings are not allowed.

Proposed Work

Network Structure

We will construct a network with nodes of types: Thread, Post, User, and Topic. The first three node types are explicit in the forum structure. The Topic nodes are not explicit, and must be extracted from the thread titles, post texts, and network structure. The following table shows potential link types between these nodes.

	Thread	Post	User	Topic
Thread	Hyperlink	Part-of	Creator, Participant	Primary, Secondary
Post		Direct Reply, Indirect Reply	Author	Primary, Secondary
User			Quotation, Hyperlink	Interest
Topic				Related

It will be possible to further attach the following attributes to these nodes:

Thread
- Date
- Posted in section
- Number of views

Post
- Date

User
- Date joined
- Native language
- Age
- Location
- Interests

Possible Methods

We will try a combination of edge-removal techniques, including Max-Flow Min-Cut and Yang et al.'s (2007) method for finding implicit communities in graphs. Additionally, we will try a hub-authority inspired HITS approach, as well as additional unsupervised clustering techniques such as Yang & Meng's (2006) Markov clustering approach.

Evaluation

We will perform a coarse-grain and fine-grain evaluation of our topic model. For both approaches, we will randomly partition the total posts (nodes) in two categories: training and testing. The former will be used to train our topic model while the second one will be used for evaluation.

Coarse-grain evaluation

Since the forum is already structured in 9 broad categories (see above), these categories can be used for testing. The testing data will be used to train our topic model, which will in turn be used to classify the testing node in one of the 9 categories. Accuracy and Kappa values will be reported for this task.

Fine-grain evaluation

However, a more interesting question is how can a topic model be used to divide general categories, such as grammar, into more concrete topics such as noun gender or verb conjugation. To evaluate the validity of our model's divisions on these lines, we will generate a gold-standard for topic categorization. To do this, we will use traditional bag-of-words techniques (such as LDA) to extract potential topics. We will then manually annotate a short-list of these topics and then calculate the agreement between our graph-based model and the gold-standard.

References

Hao, J., Orlin, J.B. (1994) A Faster Algorithm for Finding the Minimum Cut in a Directed Graph, Journal of Algorithms

Yang, N., Lin, S., Gao, Q. (2007) An Exhaustive and Edge-Removal Algorithm to Find Cores in Implicit Communities. In Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management

Yang, N., Meng X. (2006) Identify Implicit Communities by Graph Clustering, Web Information Mining and Retrievel

Forman, G. (2003) An Extensive Empirical Study of Feature Selection Metrics for Text Classification, The Journal of Machine Learning Research

Swain, M., Brooks, L., Tocalli-Beller, A. (2002). 9. Peer-Peer Dialoge as a Means of Second Language Learning. Annual Review of Applied Linguistics, 22, pp 171-185

Wei, F.H., Lee, L.Y., Chen, G.D., (2004) Supporting Adaptive Mentor by Student Preference Within context of Problem-Solving, An Extensive Empirical Study of Feature Selection Metrics for Text Classification, IEEE ICALT

Retrieved from "http://curtis.ml.cmu.edu/w/courses/index.php?title=Forum-Based_Language_Learning_Analysis&oldid=4262"

@@ Line 1: / Line 1: @@
-'''Fast Learning of Graph Structure for Anomalous Pattern Detection'''
 == Team Members ==
@@ Line 9: / Line 7: @@
 == Introduction ==
-Online forums have been used to create topic-topic, user-user, and user-topic graphs. These graphs have been used for such tasks as recommendation systems, investigating knowledge propagation, and identifying influence. In this work we plan to use data from a forum dedicating to studying the Spanish language to to identify salient topics among learners of Spanish and to track influence among the users of the forum.
+Second-language learning requires a lot of time and effort.  Fortunately, some tools can be used to facilitate the learning task.  Online forums are a type of social medium used by learners, for example, to ask for help with a certain grammatical rule or a certain idiom.
+Online forums have been used to create topic-topic, user-user, and user-topic graphs. These graphs have been used for such tasks as recommendation systems, investigating knowledge propagation, and identifying influence. In this work we plan to use data from a forum dedicating to studying the Spanish language to facilitate language learning by identify salient topics.
+===Motivation===
+The primary goal of this work will be the extraction of topics in the forum. Our the motivation is to find not just what learners of Spanish find difficult in the realms of vocabulary, grammar, and culture, but also how those difficulties relate to each other and change over time. In particular, we would like to investigate the stages of language learning in terms of topics of concern with the intention of showing whether or not there is a general pattern amongst learners. If these patterns can be found, evidence of certain linguistic difficulties could be used to predict further difficulties and students can be offered help possibly even before they are aware that help is needed. Along these lines, it could also be possible to suggest to a learner other forum users that related strength/weakness to be study-peer.
 == Dataset ==
@@ Line 16: / Line 20: @@
 Some statistics about the forum:
-Threads: 9,046
+*Threads: 9,046
-Posts: 100,535
+*Posts: 100,535
-Members: 4,863
+*Members: 4,863
-Active Members: 742
+*Active Members: 742
 The primary areas of the forum are:
@@ Line 36: / Line 40: @@
 == Proposed Work ==
-== Related Work ==
+===Network Structure===
+We will construct a network with nodes of types: Thread, Post, User, and Topic. The first three node types are explicit in the forum structure. The Topic nodes are not explicit, and must be extracted from the thread titles, post texts, and network structure. The following table shows potential link types between these nodes.
+{| border="1" align="center" style="text-align:center;"
+|
+|Thread
+|Post
+|User
+|Topic
+|-
+|Thread
+|Hyperlink
+|Part-of
+|Creator, Participant
+|Primary, Secondary
+|-
+|Post
+|
+|Direct Reply, Indirect Reply
+|Author
+|Primary, Secondary
+|-
+|User
+|
+|
+|Quotation, Hyperlink
+|Interest
+|-
+|Topic
+|
+|
+|
+|Related
+|}
+It will be possible to further attach the following attributes to these nodes:
+*Thread
+**Date
+**Posted in section
+**Number of views
+*Post
+**Date
+*User
+**Date joined
+**Native language
+**Age
+**Location
+**Interests
+== Possible Methods ==
+We will try a combination of edge-removal techniques, including Max-Flow Min-Cut and Yang et al.'s (2007) method for finding implicit communities in graphs. Additionally, we will try a hub-authority inspired HITS approach, as well as additional unsupervised clustering techniques such as Yang & Meng's (2006) Markov clustering approach.
+== Evaluation ==
+We will perform a coarse-grain and fine-grain evaluation of our topic model. For both approaches, we will randomly partition the total posts (nodes) in two categories: training and testing.  The former will be used to train our topic model while the second one will be used for evaluation.
+===Coarse-grain evaluation===
+Since the forum is already structured in 9 broad categories (see above), these categories can be used for testing. The testing data will be used to train our topic model, which will in turn be used to classify the testing node in one of the 9 categories.  Accuracy and Kappa values will be reported for this task.
+=== Fine-grain evaluation ===
+However, a more interesting question is how can a topic model be used to divide general categories, such as grammar, into more concrete topics such as noun gender or verb conjugation. To evaluate the validity of our model's divisions on these lines, we will generate a gold-standard for topic categorization. To do this, we will use traditional bag-of-words techniques (such as LDA) to extract potential topics. We will then manually annotate a short-list of these topics and then calculate the agreement between our graph-based model and the gold-standard.
 == References ==
+Hao, J.,  Orlin, J.B. (1994) A Faster Algorithm for Finding the Minimum Cut in a Directed Graph, Journal of Algorithms
+Yang, N., Lin, S., Gao, Q. (2007) An Exhaustive and Edge-Removal Algorithm to Find Cores in Implicit Communities. In Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
+Yang, N., Meng X. (2006) Identify Implicit Communities by Graph Clustering, Web Information Mining and Retrievel
+Forman, G. (2003) An Extensive Empirical Study of Feature Selection Metrics for Text Classification, The Journal of Machine Learning Research
+Swain, M., Brooks, L., Tocalli-Beller, A. (2002). 9. Peer-Peer Dialoge as a Means of Second Language Learning. Annual Review of Applied Linguistics, 22, pp 171-185
+Wei, F.H., Lee, L.Y., Chen, G.D., (2004) Supporting Adaptive Mentor by Student Preference Within context of Problem-Solving, An Extensive Empirical Study of Feature Selection Metrics for Text Classification, IEEE ICALT

Difference between revisions of "Forum-Based Language Learning Analysis"

Latest revision as of 09:46, 15 February 2011

Contents

Team Members

Introduction

Motivation

Dataset

Proposed Work

Network Structure

Possible Methods

Evaluation

Coarse-grain evaluation

Fine-grain evaluation

References

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools