Forum-Based Language Learning Analysis

1 Team Members
2 Introduction
3 Dataset
4 Proposed Work

Team Members

Introduction

Second-language learning requires a lot of time and effort. Fortunately, some tools can be used to facilitate the learning task. Online forums are social medium that are used by learners, for example, to ask for help with a certain grammatical rule or a certain idiom.

Online forums have been used to create topic-topic, user-user, and user-topic graphs. These graphs have been used for such tasks as recommendation systems, investigating knowledge propagation, and identifying influence. In this work we plan to use data from a forum dedicating to studying the Spanish language to facilitate language learning, either by identify salient topics or proposing a study peer.

Dataset

For this dataset will be performing a crawl of http://forums.tomisimo.org/

Some statistics about the forum:

Threads: 9,046
Posts: 100,535
Members: 4,863
Active Members: 742

The primary areas of the forum are:

Vocabulary
Translations
Grammar
Practice & Homework
Teaching & Learning
Culture
Teaching and Learning Techniques
Introductions
General Chat

The forum is run on the vBulletin system and anonymous postings are not allowed.

Proposed Work

Network Structure

We will construct a network with nodes of types: Thread, Post, User, and Topic. The first three node types are explicit in the forum structure. The Topic nodes are not explicit, and must be extracted from the thread titles, post texts, and network structure. The following table shows potential link types between these nodes.

	Thread	Post	User	Topic
Thread	Hyperlink	Part-of	Creator, Participant	Primary, Secondary
Post		Direct Reply, Indirect Reply	Author	Primary, Secondary
User			Quotation, Hyperlink	Interest
Topic				Related

It will be possible to further attach the following attributes to these nodes:

Thread
- Date
- Posted in section
- Number of views

Post
- Date

User
- Date joined
- Native language
- Age
- Location
- Interests

Motivation

The primary goal of this work will be the extraction of these Topic nodes. Our the motivation is to find not just what learners of Spanish find difficult in the realms of vocabulary, grammar, and culture, but also how those difficulties relate to each other and change over time. In particular, we would like to investigate the stages of language learning in terms of topics of concern with the intention of showing whether or not there is a general pattern amongst learners. If these patterns can be found, evidence of certain linguistic difficulties could be used to predict further difficulties and students can be offered help possibly even before they are aware that help is needed.

Challenges

The biggest challenge we will face in topic extraction is the fact that many posts are written in a mixture of Spanish and English. For this task we will try tools such as AutoMap. Among the capabilities of AutoMap, such as Named-Entity Recognition, Stemming, Collocation Detection, and Flexible ontology usage, a significant amount of work will likely be needed to extend these capabilities to a bilingual corpus. One issue in particular will be the development of a thesaurus for mapping topics with actual references in the text. For example, a common issue when learning Spanish is the difference between the verbs "ser" and "estar". This topic may be referred to as "ser y estar", "ser and estar", "estar vs. ser", "ser+estar", etc. We will need to be able to automatically detect and combine all these variants.

Forum-Based Language Learning Analysis

Contents

Team Members

Introduction

Dataset

Proposed Work

Network Structure

Motivation

Challenges

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools