What is Bookworm arXiv?

Bookworm demonstrates a new way of interacting with the millions of recently digitized library books. The Harvard Cultural Observatory already collaborated with Google Books on the Google ngrams viewerthat has data for years. Bookworm doesn't work so closely with Google Books: instead, it uses texts in the public domain, in this case, books from the Open Library and Internet Archive. They have gathered millions of digital texts, and the descriptions of them librarians have made over the last two centuries. Bookworm uses that information to let you search for trends in any corpus you can create out of the library metadata, and to link to the underlying books so you can read them.

What can I do with it?

Library metadata makes all sorts of interesting queries possible. For example:

  • Say you want to know about the history of Social Darwinism: when did "evolution" cross over from the sciences into the social sciences? You can compare the paths of keywords like "natural selection" in different genres. (Feel free to use Library of Congress Subject Headings as well, although LC classifications--the shelf location of books--is usually a more consistent class to use as well.
  • You can also use geographical information to make comparisons. Suppose that you want to know when different countries most frequently mention the word "war". You can see the spike in the United States during the Civil War, and the way that books mention 'guerre' first in France during the First World War, and only later do books from the UK and the United States start to use "war" a lot. (Although remember that cross-language comparisons are tricky--German books appear to use "Krieg" much less, but this likely says something about the language or the books we have available, not the underlying culture.)

    What Books does this use?

    This project builds on the amazing work of the Open Library and Internet Archive projects. The Internet Archive makes scans of books publically available to the public with Optical Character Recognition already perfomed. The books come mostly from major research libraries and are scanned by the Internet Archive itself, Google, Microsoft and other scanning initiatives. The Open Library is the Internet Archive's cataloging wing; they hope to create a publically editable library catalogue with an entry for every book ever published. We try to include all the books available bothe Open Library and the Internet Archive. Currently, that means about 950,000 books. When you build a corpus, you can see exactly how many books you are searching in the construction box.

    A zipped, tab-separated text file with the author, title, publication year, and Open Library edition ID number.

    If you find mistakes in the catalog information (which you will!), you can go to a book's page at Open Library and correct whatever's wrong; when we next refresh our data against theirs, we'll get your changes in our system.

    What's in a Word

    We only allow you to search for one or two words at a time. But what's a word? Basically, we split up the text at the spaces and keep a few punctuation marks attached in special cases described here. (Hopefully--we're still working out a few bugs). If a word is too rare (not in the top 1,000,000 words in the Google ngrams English corpus since 1700, accounting for the relative rareness of pre-1900 words), it's not in the database. (There are a few other tricks to clean it up a bit more--contact us if you want to know the details).

    Why are the defaults 1830 to 1922?

    Before 1830, there just aren't that many books printed, and many of the ones that were are fragile and so weren't included in the first round of book digitization. After 1922, copyright law keeps us from getting access to books even for this sort of project (call your congressperson to complain!). The books after 1922 are both scarce, and tend to be different than the pre-1922 books.

    Why isn't the OCR/metadata/collection better?

    - This is a proof-of-concept for the Digital Public Library of America's Beta Sprint to show how new interfaces can unlock library books. There are still OCR misreadings, duplicate books, missing metadata fields, and all sorts of exciting problems. If you click through on the chart to actually read the books, you might be able to figure out what's going on. We've solved most of the easy problems: think of all the rest of them as invitations to learn more about the condition of our digital resources.
  • - If you have any more specific questions, send us an e-mail and we'll get back to you as quickly as we can.