Documentation
This documentation explains how to build a zip file to be used with our OneClick Bookworm system.


To create a bookworm using OneClick, a zip file containing the following 3 components is required:


  1. A collection of raw texts in .txt format: /texts/raw/*.txt
  2. Description of each metadata field: /metadata/field_descriptions.json
  3. JSON objects with the metadata for each text as the lines of: /metadata/jsoncatalog.txt

Sometimes it's easiest if we just start by looking at some completed examples. In the table below you'll find a growing collection of demo zip files we've put together. These examples demonstrate how to incorporate different types of data into a Bookworm.

The Difficulty rating assigned to each example is primarily for relative comparisons. In general, the examples rated closer to Hard generally just make use of many different types of metadata.

Although these example files are here for you to learn how to structure your own zipfile, you should also feel free to create a Bookworm with them. Just use one of the URLs along with a name you come up with and Create a Bookworm. You may find this useful in seeing how quick and painless the whole process is!


Corpus Link Difficulty Time Units Description
US Congress Bills congress.zip Medium Daily data using Monthly and Yearly bins. Text files containing the summary of bills, resolutions, and amendments in the US Senate and House of Representatives from late 2006 to early 2013. The metadata here is marginally more complex than in the history dissertations and the text files are a lot longer (relatively speaking) as well.
History PhD Dissertation Titles historydiss.zip Easy-Medium Annual data using Yearly bins. Text files containing the title of History Ph.D. dissertations dating back to the early 1800s. The .txt files themselves are still small here, but the metadata is a bit more complex than the Baby Names data here.
Baby Names babynames.zip Easy Annual data using Yearly bins. Contains first names given to a sample of children born in 1920 to 2008.

These files should help get your feet wet with what to expect while creating your zip files. For a more fine-grained look, the next section provides a detailed description for each of the 3 required components.


Field Descriptions

field_descriptions.json

The field descriptions file describes the properties of each available metadata field. It is a json object consisting of an array of hashmaps, each corresponding to one metadata field which you will be supplying for at least some of the texts in your collection. Each hashmap consists of the following parameters:


Key Type Description
field string The name of the metadata variable.
datatype string The type of the data. categorical for things you can to be accessible by the front-end, time for datetime variables which should be displayed on an axis, searchstring for a field which specifies the HTML for search results in the front-end, and etc for data which you'd like to keep around but not load into SQL memory tables.
type string The format of the data. integer for ints, decimal for decimals (rounded to 4 decimal points), character for text which will be less than 255 characters, and text for strings of arbitrary length. time variables should be labelled as character if they are in datetime format (e.g. publishing date) or integer if they are not (e.g. author age at publication).
unique boolean Whether any given text can have only one type of this field (e.g. title) or not (e.g. subject).

If the datatype is time, there is an additional parameter "derived" which maps to an array of hashmaps, each corresponding to a time variable (x-axis) which you would like to make available to the API/front-end (e.g. month or year). Each hashmap consists of:


Key Type Description
resolution string The time resolution to bin by (e.g. year or month).


Metadata Catalog

jsoncatalog.txt

The metadata catalog file is a list of the metadata for each text, one json hashmap per line, each corresponding to one text in your collection. Each hashmap should consist of mappings from fields (as defined in the field_descriptions.json) to values for as many fields as are available.

There are 3 required fields that must be in each json hashmap:


Key Description of Value
filename The filename of the corresponding text file (with .txt omitted and no whitespace in the name).
date The date corresponding to a text file. Dates which are not integers should be specified as a string in the format: YYYY-MM-DD.
searchstring The HTML code displayed for a text when points are clicked on in the ngram graph.


Raw Texts

The raw texts are the text files in your collection (in .txt format). Each text file should contain only the raw text for each document. For example, here are 2 text files (bills in the US Congress) corresponding to the example jsoncatalog.txt and field_descriptions.json files used above:



Place all of your text files the /texts/raw/ directory of the zip file. The contents of these text files should be encoded as Unicode (UTF-8). Our system does pretty decent job of encoding ugly characters, but after too many of them it starts to get upset and may cause your Bookworm to fail when building. Also, avoid having any whitespace in the filename.