|
PhiloMine (release 0.1a)
Demonstration and Samples
ARTFL/DLDC University of Chicago
|
Since we are unable to provide a working demonstration of PhiloMine,
the following will simply be a set of runs with various parameters
on various datasets that we are working on. If you really
want to give this a whirl on a particular dataset, please contact
us and we will try to arrange it.
PhiloMine is a web-based set of cgi functions which are accessed by
search forms, such as
example one and
example two (note these are not working forms). The parameters
of the search forms are discussed in our Functions
and Options section.
Comparitive Classification Examples
This is just a sample of different tests with a couple of different
parameters. It is interesting to note that Weka Naive Bayes and
SMO appear to generate rather different feature lists than our
perl based NB and SVMLight.
- Black Drama, 1950-2006, comparing American and non-American authors
using preset filter and a few words removed.
- Naive Bayes
accuracy: 88.8%, 84.4% cross validated.
- Naive Bayes
with document balancer on, accuracy: 88.8%, 83.9% cross validated.
- Naive Bayes with
random falsification on (note author ethnicity is mixed in
corpus one and corpus two). accuracy: 74.2%, 48.6 cross
validated (bottom of page). About random, which is what we want.
- Decision Tree (with
graphic and rules) accuracy: 100%, 83.6% cross validated.
- Weka: Naive Bayes
accuracy: 86.5%, 81.5% cross validated. Note the different
highly predictive features when compared with the other
Naive Bayes
implementation.
- SVM Light, note
slightly different way to represent accuracy:
error=10.04%, recall=90.10%, precision=91.97%
- Support Vector Machine (Weka
SMO), accuracy: 100%, 92.4% cross validated.
- Support Vector Machine (Weka
SMO), with Random Falsification, accuracy: 100%, 52.8%
cross validated.
- 18th and 19th century French male/female authors, roughly
balanced for genre, etc.
- A couple of interesting comparisons run by Glenn Roe at
ARTFL, which suggests potential useful kinds of queries:
- Comparison of 17th and 20th century theatre
Sample 20. Only one
text was misclassified, Moréas, Jean, Iphigénie: tragédie en cinq
actes, a self-consciously neo-classicist playwright.
- Comparison of pre- and post-revolutionary poetry
Sample 21. Post-revolutionary
documents 100% accuracy, Pre-revolutionary documents 92%. Of the
pre-revolutionary texts, all 5 of Andre Chenier's texts are
misclassified, which would seem to support Victor Hugo's claim that he
was a direct precursor of the Romantics.
Predictive Function Examples
- Encyclopedie: Multiclass NB predictions for unclassified
articles:Sample 3
and Sample 4.
Please see Rationale and Design Note for
further discussion.
- Coming soon, multiclass SVM prediction.
- Predict authors of unsigned articles?
Descriptive Function Examples
- Black Drama, 1950-2006, comparing American and non-American authors
Similarity/Clustering Function Examples
We are still examining ways to evalute results from this function.
The examples from the Encyclopedie are a little easier to evalute
since articles are classified and are usually about "something".
More difficult to evaluate are letters and div level objects of
French texts. This type of measure could be used for building
a document clustering scheme. Right now, an interactive environment
is helpful for testing.
- Encyclopedie: Sample 5
and Sample 6 (see
Rationale and Design Note for more
comments on these two examples.
- Napoleon Letters:
find the 10 most "similar" letters to 2 letters written on
Sept 26, 1797, in 1,500.
- Large Div level objects in 18th century French documents
Sample 19.