This document is NOT completed. It was, at one point, cleaner, but I have been adding some issues that we need to resolve. So, I need to write a public release level ATE specification. MVO.
Rather than create "yet another standard", we decided that our internal data representation should leverage existing standards, whenever possible, and use standards that are simple, well documented, stable, and for which there are a large number of existing and free/inexpensive tools. We determined that the most effective representation is the combination of (unqualified) Dublin Core 1.0 metadata specification with very basic HTML. Very few extensions are currently optional to build ARTFL style databases: page identifiers, explicit sentence tags, and proper name tags, all of which may be omitted. PhiloLogic IGNORES completely ALL SGML tagging which we have not specifically described below. This allows compatibility with other systems in that you can pass tags blindly thru the system.
By using a very basic and well documented set of encoding specifications, design of PhiloLogic and ATE will allow users to build databases from different encoding schemes (see our SGML to ATE document LINK COMING), allow individuals to develop directly ATE compatible documents using existing tools, and allow us to use PhiloLogic to index arbitrary collections of HTML documents in a WWW space with minimal encoding and no modification. In contrast to the TEI and other specifications, which work from the top-down to provide an infrastructure for all possible encodings in a system independent fashion, ATE is system dependent and built from the bottom-up. We specify only tagging that we actually have in production or that we are planning to treat in PhiloLogic. Extensions to this basic specification will be, as always, completely optional and will use existing schemes, preferably TEI specifications, whenever possible.
We believe that a simple representation is sufficiently rich for the creation of complex databases and easily used by many scholars in environments with which they are already familiar.
For default functionality of PhiloLogic bibliographic control, we require on a selected subset of the 15 base data elements of the Dublin Core. The 15 base elements are, with a quick indication of the desired contents of each element:
<head> <meta name="DC.title" content="Complete Title"> <meta name="DC.creator" content="Author Name"> <meta name="DC.publisher" content="Publisher"> <meta name="DC.date" content="date"> <meta name="DC.type" content="Genre or type"> <meta name="DC.identifier" content="Short Identifier"> <meta name="DC.contributor" content="Editor or other"> <meta name="DC.subject" content="TO BE DETERMINED"> <meta name="DC.format" content="TO BE DETERMINED"> <meta name="DC.language" content="TO BE DETERMINED"> <meta name="DC.description" content="TO BE DETERMINED"> <meta name="DC.relation" content="TO BE DETERMINED"> <meta name="DC.coverage" content="TO BE DETERMINED"> <meta name="DC.source" content="TO BE DETERMINED"> <meta name="DC.rights" content="TO BE DETERMINED"> </head>This information is mapped to a refer bibliography containing the following information in the standard layout which is used by PhiloLogic to build the bibliographic data handler:
%a Rousseau, J.-J. %T Discours sur Sciences et Arts %D 1750 %Y traite ou essai %P In Oeuvres Completes, T.3. Paris, Gallimard, 1964. %S DiScAAt this time, the default bibliographic representation allows for searching on
We have provided a more detailed outline of Recommended Dublin Core Contents for use under PhiloLogic. This document specifies in detail the format and contents of Dublin Core elements that will work best under PhiloLogic. PhiloLogic is compatible with basic Dublin Core as defined in the unqualified Dublin Core specification.
Randomly selected example:
<meta name="DC.title" content="De rationali et ratione uti">
<meta name="DC.creator" content="Gerbertus Auriliacensis">
<meta name="DC.publisher" content="Patrologia latina, vol. 139. J. P. Migne, ed. Parisiis: excudebat Migne, 1853">
<meta name="DC.date" content="MED">
<meta name="DC.identifier" content="GerAur, DeRaEtR">
<meta name="DC.contributor" content="Chadwyck-Healey (Release 5: 1995)">
<meta name="DC.format" content="ARTFL HTML-SGML">
<meta name="DC.language" content="la">
<meta name="DC.rights" content="c. 1995 Chadwyck-Healey Inc. Do not export or print from this database without checking your licence agreement to see what is permitted.">
Important note: I need to document alternative bibliographic control,
the refer format, since we do load databases in this way as well.
In fact, the loader extracts the bibliographic information from the
DC representation and generates a refer bibliography with
some additional information required by the loader.
We will adopt a multi-level structural hierarchy for main textual objects.
The top level is the document, and lower levels are as follows:
| <body> | |
| <h1>[ANY STRING]</h1> | 1st level division |
| <h2>[ANY STRING]</h2> | 2nd level division |
| <h3>[ANY STRING]</h3> | 3rd level division |
| </body> |
Note: The system expects section breaks -- <h[1-n]> ... </h[1-n]> -- to appear on a seperate line. It would be best is the entire header title appear on a single line:
<h1>Header Title</h1>
These can be as long as you want, but remember that we use them to display tables of contents, which use indentations to indicate nesting, so really long header titles will wrap, making the display less effective.
Full document navigation is in full production for PhiloLogic
databases as described in the
sections on
Retrieving and Navigating Documents and
Navigating Documents from Word Searches of
the
PhiloLogic
User's Manual.
Further object levels which descend from the lowest division level
are as follows:
| <p> | paragraph/stanza |
| sentence | delimited by punctuation or explicit tags (<sent>) |
| word | delimited by white space and punctuation |
We have decided that the value of the page numbers noted here will NOT include spaces. Generally, I would keep these babies short, since it is a matter of display space in KWIC reports. In general, these look like:
<page n="12">
but of course, pages can have letters and other oddities,
<page n="vol1:12">
for page 12 of volume one. And, you can have alternates, just don't
use a space:
<page n="23:[sheet_3]">
but of course
<page n="23:3"> or
<page n="23:sheet_3">
might be just as effective.
Important Note Page tags must appear on their own lines.....
For search and retrieval purposes, any front matter (such as the editor's preface, title page, etc.) or back matter (appendices, indices, etc.) will be considered a logical unit of text tagged as a first-level structure.
The new loader should implement this system for tagging text structure,
but it should also be able to handle texts already marked up in HTML without
further modification. In other words, missing <page n="..."> markers
or <h1> ... <h1> tags should not prevent a file from loading into
a database.
Explicit sentences will be tagged as follows:
Implicit sentences will be identified by the punctuation marks . ? ! and object tags (e.g. <h1>, <p>, etc.) as we have in the current loader.
Note: the system will, for every paragraph, check to see if there are any <sent> tags. If not, it will apply implied sentence recognition.
The current ARTFL loader handles all valid HTML special characters (for a list of these characters, see the offical list). There are also many SGML character entities which do not map to ISO Latin, such as &obar; (the letter "o" with a macron over it). Commercially available data offers a wide variety of unoffical SGML character representations, so the preceding example is one of many possibilities. In order to account for unusal character representations, the loader should generate a words.R file where each entry has two fields:
In the future, we may be able to extend this representation
of unusual spellings and characters to include fields for things such as
parts of speech, root forms, etc.
NOT IMPLEMENTED YET
<h1>Notes</h1> <page n="nts"> <notetext n="0" xpg="398" xpgobj="136">* Sono contrassegnati da un asterisco i capitoli di altri a Veronica Franco. </notetext>Links to NOTES in the text (behaves like a ref):
deposited the mummies<note n="21" ref="21"> that had been <note n="0" ref="a">, che degna gloria <note n="5" ref="*">where n=NOTE NUMBER -- the internal identifier which links to the note and ref=DISPLAY IDENTIFIER (the thing that gets displayed to indicate there is a note). We do not want numbers or other note identifiers in the running text, since these would be indexed as characters and break word adjacent searching.
The notetext tag appears as indicated. In 2t loaders, you want paragraphs between them, for 3t, the notetext tag will suffice.
<notetext n="21" xpg="58" xpgobj="58"> <i>mummies</i>.</notetext> <notetext n="5" xpg="398" xpgobj="136">* Sono contrassegnati da u n asterisco i capitoli di altri a Veronica Franco.where n=NOTE NUMBER -- a string tied to the note tag -- xpg=PAGE IDENTIFIER -- usually a page number -- and xpgobj= is an INT counted from 0 in the doc. Thus, in
<notetext n="8" xpg="VIII" xpgobj="8">(9) Giulio Secondo....</notetext>indicates that the page identifier = "VIII" which is the tag for the 8th page object. We normally echo out the Note Identifier at the beginning of the notetext, e.g. (9) or "*". It is called by
<note n="8" ref="9">Most importantly is that the two values for n="VALUE" be identical since this tied reference to notetext....
External object linkage is performed in format.ph, typically by expanding an internal tag to a URL:
s/<FIGURE INLINE="." SYSID="([^"]*)">/<IMG SRC=$image_server$1>/g;where $image_server is set to an appropriate value in format.ph, such as:
$image_server = "http://www.lib.uchicago.edu/efts/EVD/figures/";
For administrative ease and consistency, it is wise to adopt general conventions which we can put in the default installed format.ph script. Conventions are noted below.
<figure sys.id="V0740035.TIF" inline=n figno="35">
This allows us to clearly build a link and determine how an image
should be displayed. Thus,
<FIGURE SYSID="FILE_NAME.EXT">
<FIGURE INLINE="Y" SYSID="FILE_NAME.EXT">
will be functionally equivalent, building a link to an image
for display in the WWW browser using the
<IMG SRC=... HTML construction. Similarly,
<FIGURE INLINE="N" SYSID="FILE_NAME.EXT">
will be used to build links to clickable image links using the
<A HREF="..." construction.
There are a couple of assumptions here.
Example Code:
$image_server = "http://www.lib.uchicago.edu/efts/VOLTAIRE/figures/"; # The following goes into the Object formatter s/<FIGURE INLINE="Y" SYSID="([^"]*)">/<IMG SRC="$image_server$1">/g; s/<FIGURE SYSID="([^"]*)">/<IMG SRC="$image_server$1">/g; # You can modify the formatting of the link s/<FIGURE INLINE="N" SYSID="([^"]*)">/[<a href="$image_server$1">image<\/a>]/g;
April 2, 99: ALL hyphens will act as word separators!
Notes are a problem in general. We are putting them at the end of documents as an h3 ... in order to get page fetching functioning properly, we will also add a page tag before the h3 notes......