Project Brainstorming for 10-802 in Spring 2010

From Cohen Courses
For the sake of sanity, perhaps it is best to have the basic framework of each idea described in this page. Each person can create a new entry with the ==Name== syntax.

For actual discussion of a topic that you are interested in, take it to the 'Discussion' page (2nd tab on the top of this screen)

Skyler's Potential Ideas

I'll just come out and say that I'm probably one of the few people in the course that is not interested in NLP problems of social media. I hope that doesn't make me an outcast :p

My current research area involves developing new methods to scan through graph-based datasets looking for 'anomalous patterns' in connected components of the data. 'Anomalous patterns' is left intentionally vague, but in the most basic sense you could think of it as elevated levels of activity. That doesn't have to be the case though. I'm currently looking for new datasets to try out the algorithm (and tune it some more) and I think data from social media could be an excellent source. Please let me know if you have any ideas for data. (IE: is there a particular set of blogs (where edges represent links between them) that have higher than expected readership for a given hour/day/week)

I'm also interested in building on Luis von Ahn's idea of human computation. However, instead of accessing the power of individuals (or couples of individuals rather), can we instead ask/answer questions that would benefit from the role of 100+ users interacting together. Human Computation + Wisdom of Crowds = Social Computation?

Brendan's stuff

Sky, I don't think the class is predominantly NLP-focused?

here are two datasets i have that anyone is welcome to use:

I have a fairly large tweet scrape lying around (500 million messages, 5 bilion words, user profiles, but *no* social graph info) if anyone wants to use it.

I also have data from, a now-defunct face photo rating site that I like to call "multivariate Hot-or-Not." see my webpage for a paper on it, i'm sure people can think of many more things to do with it. tags, friends, images, and judgments on attractiveness, age, gender, race, politics, and other interesting things.

Peter's Stuff

Hey all, just stealing some notes I'd made on my user page. They may be of interest to some of you.

Some sites/communities on which one might want to do a project

  • - Community generated music rankings; this is music detection as a social phenomenon.
  • Discogs - A community-generated archive of albums and artist relationships. User ratings for albums, comments, etc.
  • Wikipedia - Need we mention wikipedia? It's critical, and there's still plenty to be done with it. But perhaps it's also boring. (Although I do like the idea of trying to figure out hot topics and political issues based on edit wars. Perhaps someone has already done so?)
  • facebook - Another obvious choice, provided you can legally get the data.
  • wow.allakhazam - A large repository of data from World of Warcraft. Also an excellent source of data to analyze, but you might want to pair it with some other information, such as The Armory
  • Encyclopedia Dramatica - Highly NSFW. Filled with descriptions of both memes and internet drama from standpoints ranging from the nonpartisan to the extremely biased. If you are curious about such events, this is one of the few sources of information, but computationally reducing it to a non-partisan collection of data will be quite a task.
  • Metafilter - An old, well-established online community. I don't know that it's ever been the hippest of sites, but perhaps that's part of its long-term success. As much as looking at a hot, new website or data source is a cool idea, an established online community may give us more information about our use of social media.

Some questions/projects one might want to think about examining

  • How do different types of forum software (such as phpBB as compared to MediaWiki or image board software such as Wakaba) engender different kinds of online interaction? How is the community a product of the technology?
  • A thorough comparison of a community's history of event in contrast with the event's history as logged in social media. (Or how different sites have preserved different histories of the same event; say, a comparison of the history of the 2008 campaign as logged on Wikipedia and Conservapedia.
  • What does it mean to troll? Can we automatically identify trolls?
  • How does the circulation of bittorrent files compare with the popularity of media in billboard charts? How is pirate pop culture similar/divergent to/from mainstream pop culture?