Since its publication in 1993, Martha Mitchell's 629 page Encyclopedia Brunoniana has served as the definitive reference work of Brown University's history. Its 668 articles document the University's buildings, departments, people, and publications. The Liber Brunoniana project utilizes natural language processing techniques to transform Mitchell's text into hypertext, automatically inferring over 5,000 hyperlinks between articles, delimiting content into categories, and constructing pages detailing the events of each year mentioned in the encyclopedia. This article details the techniques used to construct Liber Brunoniana, a book freed from the limitations of paper.
The possibility of creating Liber Brunoniana owes itself to Brown's longstanding distribution of a basic online edition of Encyclopedia Brunoniana. We used a simple python script to scrape the article text of this online edition. The faithful rendition of the text into mostly semantic HTML (blockquotes are enclosed in the appropriate tag, for example) eased the subsequent steps of transforming the text using rule-based natural language processing, and transforming the markup for our presentation.
Notwithstanding presentation, the current online edition of Encyclopedia Brunoniana is an effective transformation of a text into HTML, but a poor case-study on the enrichment of a document with hypertext. This fault is particularly jarring since hypertext was designed not for creating applications (as is now the trend), but for organizing and presenting vast amounts of organized documents. The only navigation mechanism provided by the current online edition is the index on its home page of over six hundred hyperlinks. Index-based navigation is suitable for printed books because books afford the user the ability to browse with the mere flip of a page. When the ability to browse is removed, index-based navigation remains suitable only for users who have a precise quarry in mind.
For insight on what an effective rendition of the encyclopedic form in hypertext entailed, we looked to none other than Wikipedia. The English edition of the site effectively presents over five million articles, a volume for which a print rendition would be unfeasible! While the smaller number of articles in Encyclopedia Brunoniana doesn't prohibit offering a definitive index of articles, it's enough that more expressive navigation mechanisms are useful. Liber Brunoniana borrows Wikipedia's classification of articles into categories (including the ability to classify categories themselves into categories) and inter-document navigation via wikilinks. For technical reasons, we haven't yet implemented an integrated document search, but we suspect the necessity of search is diminished for collections of under a thousand documents. Finally, we borrowed Wikipedia's practice of thematic meta-pages, namely that of year pages, which summarize events of a given year.
We initially tried to apply the clustering techniques detailed in Brandon Rose's Document Clustering in Python, but the results were, from a glance, underwhelming. That we could make this sort of at-a-glance evaluation owed itself to the uniform structures present in Martha Mitchell's choice of article titles. For many categories, we were able to generate simple rules that matched precisely the set of articles we desired. Articles about people, for example, have titles following the structure "Last, First M.I". Likewise, articles about Brown's gates contained the string "Gate" in their title.
The number of categories that could be derived with total accuracy from title alone was very small, but the certainty and success of the process enabled us to iteratively bootstrap structured semantics onto the text. Articles about buildings, for example, invariably contain the phrase "built in", but so did other types of articles: the article about a building's namesake often refers to the structure that memorialized that person, and non-building articles such as gates contain the phrase "built in", too. To create a category containing all articles about buildings, we searched the set of articles which had not already been placed into the "people" or "gates" categories for the text "built in".
Similarly, to construct the sub-category Professors, we filtered for the phrase "professor of" within articles that had already been categorized as people.
By iteratively structuring the document collection, we increased the precision of classification, without increasing the complexity of performing it. No surefire regular expression identifies Publications with few false positives, but we were able to continue to use very general search expressions by reducing the search space with prior classifications.
To create the same sort of experience of exploration that makes sites like Wikipedia and TVTropes addicting to navigate, we attempted to automatically identify keywords in articles that corresponded to other articles, and replace them with hyperlinks. The same structured properties of Martha Mitchell's titles that enabled classification simplified entity linking, too.
To identify keywords, we simply performed a case-sensitive search of each article's text for the names of other articles. This naïve technique performed surprisingly well. Applied directly to a collection like Wikipedia, such a process would flood articles with irrelevant hyperlinks (which is to say nothing of the problem of disambiguation), but it proves suitable for collections of documents with narrow breadth and uncommon names. The only problematic article in Encyclopedia Brunoniana was Well.
While a regular expression search-and-replace powered the initial attempt at entity linking, a common, confounding case rendered it useless. Brown—and by extension, Encyclopedia Brunoniana—tends to honor notable individuals in its rich history by naming buildings after them. Thus to "John Hay", add the "John Hay Library", and so on. With regular expressions alone, it is impossible to express that a hyperlink should never be nested inside another hyperlink; to express this, we enter the realm of context-free languages. Consistently handling these cases threatened to explode the complexity of the task into parsing HTML. Greg Hendershott's xexpr-map procedure reduced the challenge of expressing a context-aware tree transformer to a few lines of Racket:
For all articles, we perform linkification with an identity mapping between article name and keyword. For articles about people (which we can identify with absolute certainty), we additionally linkify with various common arrangements of name components.
Excluding links introduced by categories and year pages, this process introduced over 3,000 hyperlinks between documents. Visualized, this process reflects the transformation of a disparate cloud of about six-hundred-eighty articles,
...into a complex web that leaves few documents orphaned (if you include hyperlinks introduced by date pages, there are no orphaned pages):
That not sufficient evidence to say this is a functional improvement, but by exploring Liber Brunoniana you can be the judge of that.
Date Fact Extraction
Generating date pages (like the one for 1828) also benefited from being able to confidently identify pages about people. Generally speaking, the datification process consisted of tokenizing articles into lists of sentences, filtering out all sentences not containing four digit numbers, and adding the remainder to a date fact database from which date pages are created. Two challenges arose: Sentence tokenization, despite great support from Python's NLTK library, was confounded by the glut of esoteric abbreviations (mostly related to various degrees) whose periods were mistaken as sentence terminators. Fortunately, the tokenizer could easily be extended to recognize additional abbreviations.
The not-unexpected second challenge was disambiguating sentences that identified their subjects only by pronoun. While such sentences are perfectly acceptable with context, all date-facts are one-sentence fragments from articles. For articles about people, we know the likely subject of any pronoun, and replaced prounouns with the person's name. We avoided, importantly, replacing pronouns in sentences that already contained the subject's name. Such ambiguous sentences are fortunately much less common in articles not about people and we do not attempt to dereference any pronouns encountered there. While some complex disambiguation mechanism might be feasible, an ambiguous date fact is preferable to a date fact rendered unambiguously erroneous. All date facts are followed by a citation to the article they were extracted from so that readers can learn more if they wish.
Stiki: The Static Wiki
Static site generators are enjoying a renaissance in popularity for their ease of reasoning about, and their properties of consistent performance and resource consumption. Much of this renaissance has been in the area of static blogs, and in trying to replicate the functionality of a blog, some prepackaged generators are quite complex. We believed that the technique of static site generation would be an even more natural fit to the domain of online wikis. Moreover, we believed that such a generator could match the functionality of a dynamic wiki while operating on simple principals and mechanics. Although 'static' may seem antithetical to the collaborative nature of a wiki, it actually delegates those responsibilities to more qualified agents. Liber Brunoniana delegates the responsibility of collaboration and change-tracking to Git, and webhook-triggered build scripts render changes within seconds.
Stiki, the static wiki, is a simple static site generator for wiki-like sites that derives complex functionality from the consistent application of two principals:
- Files are pages.
- Folders are categories.
Of any document management system, filesystems have the best tooling available. Stiki leverages this tooling to express relationships between documents extremely tersely. To list all of the categories an article or category belongs to:
find -L . -samefile "$1" -print0 | dirname -z
To list all sub-pages of a category:
find "$1" -mindepth 1 -maxdepth 1 -xtype f
To list all sub-categories of a category:
find "$1" -mindepth 1 -maxdepth 1 -xtype d
In these three commands, we've expressed the necessary relationshps to create a encyclopedia-like site.
Stiki is less a piece of software than it is a set of principals. The initial, slightly unwieldy, Racket-powered generator created for this project occupies the Stiki repo, but the scripts in the Liber Brunoniana repo reflect the '1.0' rendition of the Stiki principals.
Bash as a Template Engine
Amidst the alphabet soup of web technologies, there is nothing less hip than a rusty shell script. In retrospect, however, it's not surprising that a shell created for an operating system where text is exchanged as universal interface between programs is exceptionally good at manipulating it. For HTML templating, Bash is an unlikely hero.
Stiki combines multi-line strings, redirection, and parameter expansion to template pages. Below is a minimal example of how Liber Brunoniana templates its articles; we create a universal page template, and sub-templates that pipe into it:
Better yet, since this is bash, the document relationships we've already expressed as shell scripts integrate seamlessly.
Make, very nearly, is the perfect companion to Stiki because of Stiki's one-to-one correspondence of dependencies and artifacts. However, it is Stiki's closeness to the filesystem that renders Make unusable. Because Stiki uses file names as page titles, page titles are limited to any valid file name. This allows spaces as article titles by preserving spaces in filenames, a capability Liber Brunoniana uses extensively. Unfortunately, Make prohibits spaces in filenames. I'm not aware of any Make-like build systems that do not have this limitation. (Suggestions welcome!) For now, we work around this with a short build script written in Bash.
This project, as a whole, grew out of a conviction that the presentation of history need not itself be historic. The navigational transformations we applied to the text are only part of the presentation experience; the aesthetics of the presentation is equally important. Designing an appropriate interface for presenting Encyclopedia Brunoniana was an act of reconciling history with modernity.
Armed with the improvements of CSS3, the most critical aspects of which enjoy broad browser support, designing modern interfaces for the web is no longer a challenge. A subtle way to project an impression of modernity is to ensure that your interface feels tailor-made to the user's device. Sites that use mobile-only or desktop-only designs are doomed to neglect some set of users. Sites that dynamically present users two distinct interfaces depending on their device can be the worst of both worlds, shattering any expectation the user had of a consistent application interface.
Through flexbox alone, Liber Brunoniana's interface is adaptable enough that users at all resolutions see virtually the same interface. When considering the role of device-dependent layout, we considered the order in which elements demand the user's attention. A consistent design isn't one that enforces identical layouts across canvas sizes, it's one that enforces a consistent hierarchy of visual importance. On wide displays, Liber Brunoniana reduces white-space on the page with a sticky page navigation side bar to the left of the article.
Compared to this sidebar, the content of the article dominates the visual space, both in width and location (centered on the screen). However, as you decrease the width of the canvas, the dynamic-width article text occupies an increasingly small fraction of the screen and the fixed-width sidebar becomes a competitor for attention. To handle this scenario, Liber Brunoniana moves the sidebar to the preamble of the article on sufficently narrow displays.
Typography of Authority
By virtue of the medium, content published on the web suffers from an authority problem—the luminescent, evanescent frame of a web browser simply lacks the certainty of existence that eminates from a printed book; the dubious content of much of what's published on the web doesn't help either. The online edition of Encyclopedia Brunoniana, sharing its contents with the printed edition, has equal functional authority, but failed to project an equal impression of authority. Liber Brunoniana projects authority to the user by borrowing the lessons of print typography.
Types of Trust
In a recent A/B experiment on readers, New York Times opinion writer Errol Morris tested (though not to academic standards) whether the typeface of a statement affected how likely readers were to believe it. Such an effect exists in folk knowledge; Comic Sans is notoriously unauthoritative, while Helvetica is prized for its neutral stateliness. As Morris's follow up article detailed, there is academic evidence for such an effect. The million-dollar question, however, is whether a typeface can go beyond neutrality and exert authority. Morris writes,
The conscious awareness of Comic Sans promotes — at least among some people — contempt and summary dismissal. But is there a typeface that promotes, engenders a belief that a sentence is true? Or at least nudges us in that direction? And indeed there is.
It is Baskerville.
Following an evidenced-based approach, Liber Brunoniana is set in Libre Baskerville. This improved on an earlier iteration of the site set in the Fell Types in aesthetics, readability, and performance.
Watch the Width
The rule of thumb to limit line lengths to 45-75 characters is perhaps the strongest enrichment that can be made to web typography, and with very little effort. We used Chris Coyier's bookmarklet to settle on a
max-width of 50em for article content.
The pleasent typography of Liber Brunoniana—not developed by designers—would not have been possible without tools like Type-Scale.org, a web application choosing visually appealing 'scales' of text sizes for headings. Liber Brunoniana uses an augmented fourth scale for wide-display clients, and a perfect fourth scale for narrow display clients.
Ditch the Columns
Columnar layouts are one of the most compelling new features of CSS, but their use on Liber Brunoniana is limited to the print media version of the site, and category listings.
In print, columns are a panacea, at once improving both legibility and information density. What works well in one medium, however, does not necessarily translate well to others. The choice to use columns effects a fundamental change in how the eye traverses the content: rather than strictly up-to-down, a print columnar design limits column height to the field of view of the eye so that the reader ‘scrolls’ their vision horizontally. It’s an inversion of the consumption model we’re used to on the web.
The key here is limited vertical field of view. The web favors vertical scrolling and websites have an effectively limitless vertical dimension; when that content spills out of the viewport, a scrollbar is simply added. Applying columns to a vertically-limitless design arbitrarily bisects the content at the midway point.
More specifically, embrace print stylesheets. Encyclopedia Brunoniana feels authoritative in the print form, and Liber Brunoniana's web presentation (with some effort) feels authoritative, too. While digitizing a book is a herculean effort, converting a web page is as simple as Ctrl+P; your users will do it. Liber Brunoniana retains authority in its printed form by using a print-targeting media query.
As a general check-list:
- Remove navigation elements
- Layout remaining elements appropriately
- Choose print-appropriate font sizes
- Limit line length
- Consider using columns, but expect flaky browser support
Applying these considerations, a modern browser can produce a print of a Liber Brunoniana article approaching the quality of a document typeset with LaTeX.
If Encyclopedia Brunoniana is to remain the definitive reference of Brown's history for decades to come, it will be as a living document. Liber Brunoniana is one step towards creating an infrastructure to support a growable, accessible, historical record. The ink is far from dry on Liber Brunoniana. At the time of writing, this represents a minimal presentable effort. In the coming months,
- the text will continue to be classified into categories
- the datify process will be refined to exclude likely quantities but include the trailing two digits of year ranges
- the HTML markup of site content will be converted to a more expressive wiki-text
- following improvements to authorship will come improvements to presentation
- support will be added for media pages, bringing Encyclopedia Brunoniana to life with historic images
Liber Brunoniana is open-source, and hosted on Github.