Bookworm demonstrates a new way of interacting with the millions of recently digitized library books. The Harvard Cultural Observatory already collaborated with Google Books on the Google ngrams viewerthat has data for years. Bookworm doesn't work so closely with Google Books: instead, it uses texts in the public domain, in this case, books from the Open Library and Internet Archive. They have gathered millions of digital texts, and the descriptions of them librarians have made over the last two centuries. Bookworm uses that information to let you search for trends in any corpus you can create out of the library metadata, and to link to the underlying books so you can read them.
Library metadata makes all sorts of interesting queries possible. For example:
This project builds on the amazing work of the Open Library and Internet Archive projects. The Internet Archive makes scans of books publically available to the public with Optical Character Recognition already perfomed. The books come mostly from major research libraries and are scanned by the Internet Archive itself, Google, Microsoft and other scanning initiatives. The Open Library is the Internet Archive's cataloging wing; they hope to create a publically editable library catalogue with an entry for every book ever published. We try to include all the books available bothe Open Library and the Internet Archive. Currently, that means about 950,000 books. When you build a corpus, you can see exactly how many books you are searching in the construction box.
A zipped, tab-separated text file with the author, title, publication year, and Open Library edition ID number.
If you find mistakes in the catalog information (which you will!), you can go to a book's page at Open Library and correct whatever's wrong; when we next refresh our data against theirs, we'll get your changes in our system.
We only allow you to search for one or two words at a time. But what's a word? Basically, we split up the text at the spaces and keep a few punctuation marks attached in special cases described here. (Hopefully--we're still working out a few bugs). If a word is too rare (not in the top 1,000,000 words in the Google ngrams English corpus since 1700, accounting for the relative rareness of pre-1900 words), it's not in the database. (There are a few other tricks to clean it up a bit more--contact us if you want to know the details).
Before 1830, there just aren't that many books printed, and many of the ones that were are fragile and so weren't included in the first round of book digitization. After 1922, copyright law keeps us from getting access to books even for this sort of project (call your congressperson to complain!). The books after 1922 are both scarce, and tend to be different than the pre-1922 books.