Project Ideas - Derry, Reyyan
Social Media Analysis Project Ideas
Team Members
Derry Wijaya [dwijaya@cs.cmu.edu]
Reyyan Yeniterzi [reyyan@cs.cmu.edu]
Project Ideas
We have several possible ideas for the project:
• We propose to do a mapping of event to opinion. An event can be social or political in nature, which brings about a change in opinion or vice versa. • We propose to analyze opinions from the perspective of associative sorting and social contagion. For example, to answer question on when does an opinion get pushed aside? i.e. centrality and periphery of opinions in the opinion-graph. • We propose to construct a social graph, but instead of people, we put words as nodes. Using this social graph of words, we propose to analyze: (1) how co-occurrence with other words (associativity with other words) can influence meaning of words (for example, the word 'BP' was frequently 'associated' (co-occurred) with negative words during and after the Gulf-spill event), (2) how new words emerge in the graph (like ‘Google’), or a new part of speech (like 'googling'), (3) how meaning and usage of words like “LOL” changes with time - from meaning “laughing out loud”, to “whatever” • We propose to automatically create social graph on opinions from tweets, where nodes are people, links are follower/following relations, colors are attributes (positive or negative towards the entity we are interested in: like ‘toyota’, ‘ford’, etc)
Dataset
For each of the ideas above, we propose to use (in order of the ideas):
•
• We then define a variety of features over pairs of such chains. These include all word TF-IDF similarity, proper noun TF-IDF similarity, proper noun Soft TF-IDF similarity, Soft TF-IDF similarity between the names (representative named mentions) of each chain, the semantic similarity between the descriptions (representative common nouns or noun phrases) of each chain, etc.
• Using these features, we train an SVM (libSVM) that classifies pairs of chains as being co-referent or not
• We take the outputs of this classifier and cluster all the chains that we have gathered from all the documents in the corpus.
• We store a persistent database of entities using this clustering, whereby each cluster represents a real-world entity. In other words, an entity is a list of chains in our database.
Motivation
• We wish to augment our CDC system to store more information for entities than just a list of chains. It would be helpful to retain a summary of useful attribute information for each entity, such as gender, nationality, occupation, email address, phone number, etc.
• We also believe that by extracting such attributes at the chain level and using them as additional features in our SVM, we may be able to improve the performance of our CDC system.
• Our current cross-document visualization tool is only capable of modeling relationships using co-occurrence statistics, and we wish to have a more descriptive way of representing relationships
• On a broader level, we wish to examine the upper limit of recall and precision associated with these problems, i.e. find answers to the questions:
o For how many entities does a given attribute exist in the data?
o For all such attributes, how accurately can we extract them?
Dataset
• To train and test our attribute and relation extraction modules, we plan to use one of the various ACE datasets (probably ACE 2004 or ACE 2005).
• For our CDC system, we are using the John Smith corpus, and WePS corpora, and a set of 400000 news articles over summer 2010, produced and labeled by a commercial organization.
Techniques
• For attribute extraction, we plan to implement standard algorithms that take seed examples of entities and attributes and learn extraction patterns, as introduced by Ravichandran and Hovy, 2002 “Learning surface text patterns for a question answering system”. [1]
• For relationship extraction, we plan to implement one of the papers referenced by Sunita Sarawagi in her survey on Information Extraction. [2]
• We may use different methods if we come across better ones while surveying related literature over the course of the semester
Superpowers
We have none. But in terms of our individual backgrounds, Bo and Rushin have been working with Bob Frederking and Anatole Gershman on entity extraction and co-reference resolution [3], and Kevin has been working on question answering and computer assisted language learning.