Archive for the 'nlp' Category

[bnc] British National Corpus

Wednesday, August 3rd, 2011

100M word -collection for english language

“The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British Engli…

[bnc] British National Corpus

Wednesday, August 3rd, 2011

Official Google Research Blog: All Our N-gram are Belong to You

Tuesday, September 22nd, 2009

google word frequency dataset, n-gram models. Donwload or DVDs are available from U of Penn. ~24GB.

Lingua::EN::Splitter – Split text into words, paragraphs, segments, and tiles – search.cpan.org

Monday, July 7th, 2008

perl module that splits text into words, segments. etc. Quality — uknown (i.e. I didn’t test it yet)