Difference between revisions of "Google Books Ngram Data"

Latest revision as of 21:21, 14 February 2011

Google Books Ngram dataset is a corpus of about 5 million digitized books drawn from over 40 university libraries around the world by Google. Each page was scanned and the text digitized using OCR. The resulting corpus contains over 500 billion words in English (361 billion), French (45 billion), Spanish (45 billion), German (37 billion), Chinese (13 billion), Russian (35 billion) and Hebrew (2 billion). The oldest works were published in 1500s. The data is released in the form of n-grams (in light of copyright constraints). An n-gram is sequence of 1-grams, such as the phrases "stock market" (a 2-gram) and "the United States of America" (a 5-gram). More detailed description on the corpus can be found in Michel et.al. (2010) Quantitative Analysis of Culture Using Millions of Digitized Books: External Link

Corpus location: External Link

Revision as of 21:20, 14 February 2011 (view source) Dwijaya (talk \| contribs) ← Older edit		Latest revision as of 21:21, 14 February 2011 (view source) Dwijaya (talk \| contribs)
Line 1:		Line 1:
−	Google Books Ngram [[Category::Dataset\|dataset]] is a corpus of about 5 million digitized books drawn from over 40 university libraries around the world by Google. Each page was scanned and the text digitized using OCR. The resulting corpus contains over 500 billion words in English (361 billion), French (45 billion), Spanish (45 billion), German (37 billion), Chinese (13 billion), Russian (35 billion) and Hebrew (2 billion). The oldest works were published in 1500s. The data is released in the form of n-grams (in light of copyright constraints). An n-gram is sequence of 1-grams, such as the phrases "stock market" (a 2-gram) and "the United States of America" (a 5-gram). More detailed description on the corpus can be found in [[RelatedPaper::Michel et. al. (2010) Quantitative Analysis of Culture Using Millions of Digitized Books]]: [http://www.sciencemag.org/content/early/2010/12/15/science.1199644 External Link]	+	Google Books Ngram [[Category::Dataset\|dataset]] is a corpus of about 5 million digitized books drawn from over 40 university libraries around the world by Google. Each page was scanned and the text digitized using OCR. The resulting corpus contains over 500 billion words in English (361 billion), French (45 billion), Spanish (45 billion), German (37 billion), Chinese (13 billion), Russian (35 billion) and Hebrew (2 billion). The oldest works were published in 1500s. The data is released in the form of n-grams (in light of copyright constraints). An n-gram is sequence of 1-grams, such as the phrases "stock market" (a 2-gram) and "the United States of America" (a 5-gram). More detailed description on the corpus can be found in [[RelatedPaper::Michel et.al. (2010) Quantitative Analysis of Culture Using Millions of Digitized Books]]: [http://www.sciencemag.org/content/early/2010/12/15/science.1199644 External Link]

Difference between revisions of "Google Books Ngram Data"

Latest revision as of 21:21, 14 February 2011

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools