An older relative of mine recently asked me to explain the Text Encoding Initiative. I began, as I often do when attempting to explain text encoding to my elders, by making a comparison to the versification and text coloring in the many editions of the Christian Bible, but quickly realized that, for perhaps the first time, I could also safely draw analogies from the idea of eBook readers and the use of semantic tagging by corporations to achieve higher Google ranks. The more I explained, the more I realized this should, in fact, be the glory year of the TEI, but it certainly doesn’t feel that way. In January of 2011, Amazon reported that they sold more Kindle eBooks than paperbacks in 2010. The reading public is increasingly moving from paper to digital media, and the texts consumed on these devices will need to be prepared in a way that allows them to be easily disseminated to a number of devices that, at present, make use of a set of incompatible, occasionally proprietary, encoding formats. The Internet Archive, for instance, releases all of their books in 8 different ebook fromats. Notably, they do not use TEI, nor do they provide tools to generate derivatives from texts that use ours guidelines. Nor do I believe that any of the major ebook publisher use the guidelines as base format for their publications.
2011 also is rapidly becoming, for Digital Humanists at least, the year of the semantic web. From the API workshop at the University of Maryland to the Linked-Open Data for Libraries and Museums conference at the Internet Archive, from the release of the microformats specification to the recent set of linked-open data awards given at the NEH Digital Humanities Startup Grant competition, semantic tagging has moved beyond the niche community of a few information architecture geeks to a central position in the scholarly conversation. Yet, despite the large number of eBook publishers and editors present at the linked open data summit in San Francisco, very few showed any interest in or even awareness of the TEI (despite the fact that, during at least one conversation, relevant TEI work on overlapping hierarchies was directly useful to the problem being discussed).
Finally, 2011 has also been a year when the potential of automatic processing of large corpora has become obvious to even non-digital humanists. Press coverage of the Google N-grams tool, the continuing work of Bamboo, and arguments over the value and legality of the Hathi Trust show the research questions that can be generated even by processing unstructured texts with unreliable metadata.
We are, in fact, at a moment when the TEI should be ascendant. The TEI community’s primary value proposition is that it provides a standardized vocabulary for describing text that is more expressive that the vocabularies of related schema (such as HTML, LaTex, or ePub) that and be used for interchange, corpora processing, software-independent preservation, and, according to some, literary criticism itself. We remain, however, on the fringes of the metadata world, of the non-scholarly text encoding world, and have demonstrated only very moderate success using TEI for large scale data processing. The TEI is, I fear, in danger of becoming a Dvorak keyboard. When the Apple IIc was released, it included a switch, mysterious to many, labeled simply “Keyboard.” For those who didn’t read the manual, the switch’s function seemed to be to change between “gibberish” or “normal” mode. In reality, the switch toggled the Keyboard ROM between the familiar U.S. QWERTY standard and the supposedly more efficient arrangement, the Dvorak Simplified Keyboard. In 1936, education psychologist, August Dvorak, patented a keyboard that, he claimed, arranged the keys in a way that, on modern keyboards, significantly improved efficiency. The QWERTY keyboard, he explained, was intentionally inefficient and separated letters that commonly occurred together in English to prevent adjacent levers from jamming against each other when struck in rapid succession. Better typewriter designs and later, computers, eliminated the problem of jamming keys, and so a more efficient layout would, it would seem, place commonly co-occurring characters in close proximity to each other. Of course, the Dvorak keyboard has never really caught on. By failing to achieve significant buy-in from the most important user communities, the supposedly superior method of text inscribing has become all but irrelevant save for a few niche user groups. The standard narrative of the history of Dvorak claims that it failed for a set of reasons that may be summarized into the following 4 points:
- The advantages offered by Dvorak were not sufficient to upset the ubquity of QWERTY.
- New technologies that extended the QWERTY standard made switching to a new one impractical
- Dvorak was inconsistently implemented (the number keys in many Dvorak keyboard mapping follow the QWERTY arrangement).
- The economy of the 1930s did not encourage the development of implementations of the standard
I fear that unless the TEI community and leadership undertake some dramatic and immediate changes, the TEI guidelines are in grave danger of suffering the same fate for many of the same reasons.
To begin, the advantages of TEI are not sufficient to upset the popularity of the text encoding standard with QWERTY-like ubquity, HTML. I acknowledge that the analogy is in some ways imperfect. To begin with, TEI is, in fact, the older standard. Tim Berner’s Lee first publicly release HTML in 1991; the TEI P1 guidelines were published in 1990. The two standards grew up together, but the wider applicability of HTML allowed it to catch on much more quickly to the point that now TEI is the standard that must prove its worth against more familiar HTML.
Today, HTML is one of the most commonly understood computer languages and examples are readily accessible to anyone who learns how to “View source” in their web browser. Humanities students usually pick up the basics within less than an hour of instruction. The tags are very generic and limited, and so can be mastered quickly, but the latest versions of the language are also extensible enough to provide encoders most of the semantic power of TEI. TEI is, undoubtedly, more expressive, but the affordances of the additional power do not, for most outweigh the learning curve and additional complexity of the schema.
A traditional argument against HTML for text encoding claims that it’s focus on presentational rather than descriptive markup limits its use-cases too narrowly. This may have been convincing in 1996, but this position is simply no longer defensible by rational argument. Since HTML 4.0 the presentational elements have been all but entirely deprecated (the “i” and “b” tags for italics have been replaced) in favor of descriptive tagging with CSS. The recent developments in microformats (semantic markup encoded in the attribute tags of standard HTML elements) promoted by Google and Bing through the schema.org structures signify the beginning of what will likely be increased used of HTML for semantic tagging.
All of this points to another example of why TEI cannot win the hearts and minds of the general public or new digital humanists. There is too much invested in HTML as the de facto standard of text markup, and too many new technologies assume its use. Microformats are just one example. The most popular open e-book standard, epub, uses XHTML as its base. The most common massive data process tools, internet search engines, have spent years developing algorithms to parse and process HTML. Fringe standards, like the TEI, must work, with neglible resources, to map their format into ones usable by these tools.
Indeed, even for TEI-based tools, a conversion process is often necessary. TEI may be a standard format, but it is rarely applied in a standard way. We have too many fundamentally redundant tags. When does a block of text lose enough of “the semantic baggage of a paragraph” to switch from a “p” to an “ab.” If a verse line in a play has a stage direction on the same physical line in the source document, should an “lb” tag follow the stage direction? Is so, why does it not follow the rest of the verse lines. Must a parser really check to see if there is an “implied” line break due to the verse line? These are common questions that emerge by those simply trying to apply the canonical guidelines. For those who extend the TEI, even more inconsistencies emerge. Apologists point to the TEI-M standard that managed to wrangle diverse and inconsistently tagged TEI documents into a corpora for text processing, but the fact that this had to be done at all suggests the TEI is not really the best tool for assisting “distant reading.” If it is not, though, then it really should not be recommended by granting bodies for its affordances for interoperability.
This inconsistency is in part due to, and in part the cause of, the paucity of tools that make use of TEI. By 1993 the HTML specification had a popular web browser. We do not yet have a tool comparable to Mosiac or Netscape for the use of TEI files. I understand the argument that any good standard should be independent of a particular, ephemeral tool, but I believe the disappearance of Mosiac and the emergence of four or five major browser (not including the mobile versions) suggests that an interoperable standard actually depends upon the implementation of tools that test and prove it.
This, is, I think the future of the TEI, at least for the near term. The standard is working well enough for the very small subset of scholars and metadata specialists who are actually using it. In these days in which libraries have adopted a standard of “more product, less process,” a standard as process-intensive as TEI must have a product to justify it’s use. In a period in which e-Publishing is now clamoring for better and more functional interfaces for text, we need to show that our standard provides more functionality. I propose, then, a mortatorium on funding for any discussion related to the TEI standard. New tags can wait. Indeed, fewer tags are probably what is needed, but for now, let’s leave that aside. Let us import the whole of our TEI vocabulary (the only thing we have that others don’t have at the moment) and import it into microformats, then build tools that can use it. Some of these tools might be built by specific institutions with project specific grants, but I think we will actually go father if we work together to build tools that belong to no one institution but are built by many. Interedition has already demonstrated the significant coding output that can be achieved simply by bringing programmers together in a room for a few days and only a few thousand dollars. If we were to use the TEI budget to fund these sorts of meetings rather than the committee-based writing that produces the TEI guidelines, if we were to allow instititions to pay their dues with programmer hours dedicated to a collaborative open source project over the course of a year, I suspect our value to the larger humanities enterprise, and perhaps even the general public, would be clearer. As it is, we are presenting endanger of joining the Dvorak keyboard in the archive of irrelevant obscurities. The TEI has not been altogether successful thus far, but it would be a shame if what we have accomplished was simply discarded because our pragmatic competitors convince the public and our funders that we are entirely useless.