Difference between revisions of "Blog Authorship Corpus"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over …')
 
 
(2 intermediate revisions by the same user not shown)
Line 4: Line 4:
  
 
Text taken from: http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm. The corpus can be downloaded by visiting this site.
 
Text taken from: http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm. The corpus can be downloaded by visiting this site.
 +
 +
Related Paper: [[Modeling_of_Stylistic_Variation_in_Social_Media_with_Stretchy_Patterns]]

Latest revision as of 01:21, 2 November 2011

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

Text taken from: http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm. The corpus can be downloaded by visiting this site.

Related Paper: Modeling_of_Stylistic_Variation_in_Social_Media_with_Stretchy_Patterns