The AskMetaFilter Corpus

From Cohen Courses
Jump to navigationJump to search

Introduction

The AskMeFi Corpus is a complete set of of every question page on ask.metafilter.com, and every userpage for individual members of MetaFilter. This Wiki page is a brief description of the nature of the corpus, how it was obtained, and what to do if you'd like access to it.

Of What The Corpus Consists

Most simply, the corpus consists of 104,936 user profile pages (example) and 143,672 question pages from AskMetaFilter (example). This body of HTML reduces down to 46,372 members of MetaFilter users and 141,660 questions with 1,962,040 responses.

The Obtaining and Processing Of The Corpus

Parsing & Scraping

All of these pages were scraped using shell scripts incorporating cURL that were themselves written in a combination of Python and awk. The scrapes themselves were fairly simple to architect because pages on MetaFilter are numbered sequentially, starting at 1.

User Pages

In the case of the user pages, the maximum extant user at the time of the scrape was found through a process of trial-and-error at scrape time (less difficult than it might sound, given a basic knowledge of some current user numbers and the fact that all user pages show the user's join date), and then downloading the contents of every page from 1 to that maximum value. A small subset of users were missed because of server errors, but in those cases MetaFilter itself returned a server error page as a result. Some other, less certain number may have been missed, but even in those cases cURL should have downloaded an empty page.

Once the pages were obtained, they were parsed using a custom parser written using the BeautifulSoup library for Python; the issues below were dealt with as part of the parsing step, not the scraping step.

As noted above, there is a mismatch of approximately 50,000 users between the number of user pages and the number of users downloaded. Why? The empty pages are generated when someone starts but doesn't finish the account creation process. Once they have taken the initial step of generating a record, the counter is never decremented. The user page section thus contains three kinds of pages:

  1. Real user pages
  2. Empty placeholder pages for incomplete accounts
  3. Error pages

Because the placeholder pages themselves have a quite regular form (example), filtering them out of the data was straightforward. User pages also have a very regular, three-column structure. For the purposes of Analysis of Social Media, there was no need to obtain any of the less-regular information that users recorded, and so the parser was optimized to retrieve a few specific pieces of data:

  • User IDs
  • Join Dates (which themselves have three types: "Day One", "Sometime in 1999", and a specific date)
  • Whether or not the account has been disabled
  • Count of posts to MetaFilter
  • Count of comments to MetaFilter
  • Count of posts to MetaTalk
  • Count of comments to MetaTalk
  • Count of questions to AskMetaFilter
  • Count of answers to AskMetaFilter
  • Count of posts to MetaFilter Music
  • Count of comments to MetaFilter Music
  • Count of playlists posted to MetaFilter Music
  • Count of posts to MetaFilter Projects
  • Count of comments to MetaFilter Projects
  • Count of votes to MetaFilter Projects
  • Count of posts to MetaFilter Jobs
  • Count of different contributions by other users that this user has favorited
  • Count of times contributions by this user have been favorited by other users
  • Count of social network links that users have instantiated to this user
  • Count of social network links that this user has instantiated to other users

All of this information was output to a CSV file for ready analysis, and this CSV was itself eventually uploaded to MySQL so that it could be accessed in tandem with the far more elaborate MySQL data.

AskMeFi Question Pages

The question pages were scraped on March 3, 2010, and were intentionally taken to only include question asked prior to midnight on March 2, 2010 (back to the first question on December 8, 2003). Of course, because of the time required to actually scrape the questions, the corpus actually includes some comments made on March 3, 2010. It's possible to get a question page's index number simply by going to that question page, so the process of getting the set of question pages was once again quite straightforward.

As with the user pages, the question pages were parsed using a custom parser written using the BeautifulSoup library for Python.

Question pages contain the entire contents of a single thread on one question. The page has a generally regular structure, but if parsing the text of the page it's important to note that responses denoted as "Best" or that are made by the question asker are in a different type of "div" than conventional comments; it is quite easy to fail to note this and discover that one is missing a large amount of content.

The key portions of the individual question page that were specifically parsed out for future examination were:

  • The question title
  • The question subtitle - a one line introduction to the question that contains a synopsis of the question. This subtitle is itself the only portion of the question that will appear in the main lists of questions (not the question title).
  • The question body. In the event that a question does not have a body, the parser will intentionally dump the subtitle into the body text and leave the subtitle blank.
  • The asker's user ID
  • The date and time the question was asked
  • The question's subject. All question on the site must be classified as one of twenty subjects:
    • Computers & Internet
    • Food & Drink
    • Home & Garden
    • Sports, Hobbies, & Recreation
    • Travel & Transportation
    • Clothing, Beauty, & Fashion
    • Media & Arts
    • Law & Government
    • Education
    • Human Relations
    • Writing & Language
    • Health & Fitness
    • Work & Money
    • Science & Nature
    • Society & Culture
    • Grab Bag
    • Pets & Animals
    • Shopping
    • Technology
    • Religion & Philosophy
  • The number of times that the question has been favorited
  • The text of all responses to the question. Questions can get responses for up to one from the date on which they are asked, at which point the thread is closed.
  • The dates (not times) when responses were made
  • The user ID associated with each response
  • The number of times a particular response has been marked as favorite
  • Whether or not a response was marked as best
  • The different tags assigned to the question by different users. These can be quite free-form, and are often redundant and have typos. Tags themselves were not introduced to AskMetaFilter until February 16, 2005, and then retroactively added to older questions.

MySQL

Once the basic parser for all the qestion data had been completed, All data was loaded into a MySQL database. This process revealed a key fact about the data: mismatches in scrape times had resulted in some failed synchronization between the data on the user pages and the data on the AskMetaFilter pages. Because both data sets are technically accurate, this error has not been rectified within the MySQL. However, anyone using the database should bear this issue in mind; it is irrelevant unless one is combining the data sets. The most important ramifications of it are as follows:

  • 2324 different users have a mismatch of 0 to 12 questions between the data sets. (The Anonymous user account is the one outlier, with a mismatch of 53).
  • Two users who either asked or answered just one question exist in the user page data but not in the AskMetaFilter data.


What to do if you'd like access to the AskMeFi Corpus

Contact Peter Landwehr. He scraped it, and will likely be glad to let you use it for research.