[bnc] British National Corpus
http://www.natcorp.ox.ac.uk/
http://www.natcorp.ox.ac.uk/
http://www.natcorp.ox.ac.uk/ 100M word -collection for english language “The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. [more]”
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html google word frequency dataset, n-gram models. Donwload or DVDs are available from U of Penn. ~24GB.
http://search.cpan.org/~splice/Lingua-EN-Segmenter-0.1/lib/Lingua/EN/Splitter.pm perl module that splits text into words, segments. etc. Quality – uknown (i.e. I didn’t test it yet)