|
Text Encoding for PhiloLogic
The ARTFL Project recommends that all users encode their texts following the Text Encoding Initiative's (TEI) TEI Lite encoding scheme.
Philologic is known to handle more than TEI Lite encoding and variations in
metadata data, but has not been extensively tested on very heavily encoded
documents.
PhiloLogic Specific Encoding
For optimal functionality under PhiloLogic we recommend the following specifications:
The TEI Header
Below is an example of a valid TEI Header known to run under PhiloLogic:
<!DOCTYPE TEI.2 SYSTEM "teixlite.dtd">
<TEI.2>
<teiHeader>
<fileDesc>
<titleStmt>
<title>TITLE of Electronic Resource</title>
<author>AUTHOR of Electronic Resource</author>
<sponsor>SPONSOR of Electronic Resource</sponsor>
<funder>FUNDER of Electronic Resource</funder>
<principal>PRINCIPAL RESEARCHER of Electronic Resource</principal>
<respStmt>
<resp>STATEMENT OF RESPONSIBILITY for Electronic Resource</resp>
<name>NAME</name>
</respStmt>
</titleStmt>
<editionStmt>
<edition>EDITION of Electronic Resource <date>DATE</date> </edition>
</editionStmt>
<extent>EXTENT of Electronic Resource</extent>
<publicationStmt>
<publisher>PUBLISHER of Electronic Resource</publisher>
<address>
<addrLine>ADDRESS</addrLine>
</address>
<date>DATE</date>
<idno>UNIQUE IDENTIFIER</idno>
<distributor>DISTRIBUTOR of Electronic Resource</distributor>
<availability>
<p>COPYRIGHT of Electronic Resource</p>
</availability>
</publicationStmt>
<seriesStmt>
<title>SERIES TITLE (to which Electronic Resource belongs)</title>
<respStmt>
<resp>STATEMENT OF RESPONSIBILITY for SERIES</resp>
<name>NAME</name>
</respStmt>
<idno>UNIQUE IDENTIFIER of SERIES</idno>
</seriesStmt>
<notesStmt>
<note>NOTES</note>
</notesStmt>
<sourceDesc>
<bibl>
<author>AUTHOR of SOURCE DOCUMENT <date>AUTHOR DATES</date> </author>
<title>TITLE of SOURCE DOCUMENT</title>
<editor>EDITOR of SOURCE DOCUMENT</editor>
<extent>EXTENT (page range) of SOURCE DOCUMENT</extent>
<imprint>
<pubPlace>PLACE of PUBLICATION for SOURCE DOCUMENT</pubPlace>
<publisher>PUBLISHER of SOURCE DOCUMENT</publisher>
<date>DATE of PUBLICATION for SOURCE DOCUEMENT</date>
</imprint>
</bibl>
</sourceDesc>
</fileDesc>
<encodingDesc>
<projectDesc>
<p>PROJECT DESCRIPTION (Encoding of SOURCE DOCUMENT)</p>
</projectDesc>
<samplingDecl>
<p>SAMPLING of TEXTS (for Corpus/Collection)</p>
</samplingDecl>
<editorialDecl>
<p>CORRECTIONS to SOURCE DOCUMENT</p>
</editorialDecl>
<classDecl>
<taxonomy id="genre">
<category>
<catDesc>Genre</catDesc>
</category>
</taxonomy>
<taxonomy id="authorgender">
<category>
<catDesc>Author Gender</catDesc>
</category>
</taxonomy>
<taxonomy id="period">
<category>
<catDesc>Period</catDesc>
</category>
</taxonomy>
</classDecl>
</encodingDesc>
<profileDesc>
<creation>
<date>CREATION DATE of SOURCE DOCUMENT</date>
<address>
<addrLine>PLACE of CREATION</addrLine>
</address>
</creation>
<langUsage>
<language>LANGUAGE of SOURCE DOCUMENT</language>
</langUsage>
<textClass>
<keywords>
<list>
<item>KEYWORDS</item>
<item>KEYWORDS</item>
</list>
</keywords>
<keywords scheme="genre">
<list>
<item>GENRE</item>
</list>
</keywords>
<keywords scheme="authorgender">
<list>
<item>GENDER</item>
</list>
</keywords>
<keywords scheme="period">
<list>
<item>PERIOD</item>
</list>
</keywords>
</textClass>
</profileDesc>
<revisionDesc>
<change>
<date>DATE</date>
<respStmt>
<resp>BY</resp>
<name>NAME</name>
</respStmt>
<item>CHANGE</item>
</change>
</revisionDesc>
</teiHeader>
- Notes
Any note (end, foot, margin, etc.) occuring in the text should be coded in this manner:
Where both id="xxx" and
target="xxx" are unique identifiers and
n="x" represents the actual note reference
(usually a superscript numeral or an *). id
and target must not be the same. By convention,
we use an alpha (n or r) to distinguish them, e.g. for refs
id="r1" and notes id="n1".
In the "Notes" section of the document, this same note would appear as follows:
<div1 type="notes">
<head>Notes</head>
<pb n="nts"/>
<note id="n1" place="foot" target="ref1" resp="Author">1 TEXT OF NOTE</note>
- Internal Cross References
Cross References to textual objects (Sections, Chapters, etc.) will should be coded in this manner:
The Object itself:
<div2 type="Chapter" id="c2">
The Reference to the object:
<ref type="cross" target="c2">See chapter 2</ref>
Note that both id="xxx" and target="xxx" use the same unique identifier.
- Images in the Text
References to images embedded in the text should be coded in this manner:
<figure n="filename.ext">
<figDesc>Caption</figDesc>
</figure>
PhiloLogic using other encoding schemes
Currently PhiloLogic is known to run coherently on databases encoded using the following schemes:
- MEP - The Model Editions Partnership (Example: The Sanger Archive in our Sample Databases)
- CES - Corpus Encoding Standard (Example: BBC Urdu Sample - Restricted access)
- ATE - ARTFL Text Encoding
(Examples forthcoming). This is HTML, Dublin Core and optional extensions
for pages, notes, sentences, and the like. We specify a small subset
of HTML that we will actually do something with and need proper use
of <h1-N tags for loading. PhiloLogic is known to load arbitrary
HTML, but your mileage may vary. To load ATE and documents that look
like ATE: philoload DBNAME texttype=ate and set TextType in
philo-db.cfg to ate.
- DocBook. Proof of
concept only. We loaded the only three samples of literary texts we
are able to find. The loader and system could easily be exapnded to
handle most of DocBook if there is demand. Not sure that text analysis
of the primary use of DocBook, technical documentation, is all that
worthwhile. Load and configure with texttype=docbook
- Plaintext. Tested on
Gutenberg (plaintext) and
Liberliber documents. Important
caveat: input data files MUST be converted to UTF-8 before loading.
Load and configure with texttype=plaintext. The loader will try
to identify paragraphs, Gutenberg headers and trailers (available but
not indexed for searching), "chunkify" the document into reasonable
portions, and extract Author/Title info from Gutenberg files.
|