Difference between revisions of "Blog Authorship Corpus"

Latest revision as of 02:21, 2 November 2011

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

Text taken from: http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm. The corpus can be downloaded by visiting this site.

Revision as of 02:20, 2 November 2011 (view source) Manajs (talk \| contribs) ← Older edit		Latest revision as of 02:21, 2 November 2011 (view source) Manajs (talk \| contribs)
Line 5:		Line 5:
	Text taken from: http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm. The corpus can be downloaded by visiting this site.		Text taken from: http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm. The corpus can be downloaded by visiting this site.

−	Related Paper: [[~~Modeling_of_Stylistic_Variation_in_Social_Media_with_Stretchy_Patterns:~~Modeling_of_Stylistic_Variation_in_Social_Media_with_Stretchy_Patterns]]	+	Related Paper: [[Modeling_of_Stylistic_Variation_in_Social_Media_with_Stretchy_Patterns]]

Difference between revisions of "Blog Authorship Corpus"

Latest revision as of 02:21, 2 November 2011

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools