|
PhiloMine (release 0.1a)
Functions and Options
ARTFL/DLDC University of Chicago
|
Functions and Options found in PhiloMine and
Encyclomine.
As noted in the Introduction, we currently have two flavors of
PhiloMine, the main version is primarily for comparisons of
two sets of documents in a PhiloLogic database. Called PhiloMine,
this can be used with any existing PhiloLogic database installation.
The second variant is named Encyclomine. This performs classification,
prediction, and vector space searching on "chunks" within one or more
documents and cannot be simply dropped into an existing PhiloLogic
database. We are planning to merge these two flavors in a later
release in conjunction with an update release of PhiloLogic.
We have tested all PhiloMine functions on a number of large full-text
databases, including 2,500 (150 million words) Frantext database and
numerous smaller datasets. Encyclomine is currently being used
for the Encyclopedie of Diderot and d'Alembert, a large collection of
19th century newspapers, and an experimental version of Frantext
(built as div level chunks with additional metadata). Please
consult our Demonstration and Samples for a discussion of
sample results and applications.
What follows is the current list of functions and options found in
both PhiloMine and the more experimental Encyclomine variant.
Functions
- Descriptive
- Differential Relative Rates (DRR): a simple but useful report
that generates tables showing words that are over-represented in document
group one (c1) compared document group two (c2). These are calculated
by frequency per 10000 in each document and selected by (now hard wired)
relative ratios of use.
- Information Gain (Weka): displays the user selected top
features as measured by Information Gain as a list. These are displayed
without reference to the two groups of documents, only those features
which are most effective at distinguishing the two vectors. A statistically
more coherent test than DRR above. Note that this is often used by
other classification techniques (decision trees) to build branches.
- Comparative
- Classifier: Multinominal Naive Bayesian (MNB)
- Classifier: Weka Naive Bayes
- Classifier: Decision Tree (DT) Generate Graphic Tree
- Classifier: SVMLight
Multiclass:
- SVMLight One-vs-All (OVA)
- SVMLight One-vs-One (OVO)
- Classifier: SMO (Weka)
- Predictive
- Predict: Multinominal Naive Bayesian (MNB). Train on c1 predict on c2.
- Cluster/Similarity
- Vector Space (frequency/normalized)
Options
- Balance Instances: randomly selects documents from the larger
vector to create a balanced comparison dataset. We often use this
several times to accumulate results comparing numerous documents
with a desired characteristic (e.g. male authors) against smaller
subsets (e.g. female authors) in large collections. Ideally,
multiple runs will give a good sense of classification accuracy.
- Limit Features (per number of documents): eliminates features
that occur in more than a user selected number of documents and in few than
a user selected number of documents. This is loosely based on Chen, Lee,
and Hwang's discussion of "Conformity and Uniformity" in "A Hierarchical
Neural Network Document Classifier with Linguistic Feature Selection", in
Applied Intelligence, 23, 277-294, 2005.
- Limit Features (per document frequency): an alternative feature
limitation which filters out user selected high and low frequency
words by document. Probably not as effective as per document feature
limitation.
- Random Falsification (shuffle instances): for comparative classifications,
this shuffles the instance vectors in order to deliberately confuse the
classifier. Given the large number of features and relatively limited
number of instances, this is a good way to see if the classification
accuracy is based on the data or your "greedy" classifiers. For example,
a Weka SMO run comparing male/female authors gets 85.4% cross validated
accuracy. When shuffled, same run gets 51.9%, cross validated. Given the
power of many systems to find patterns, we believe this is an important
control, along with N-fold cross validation, to make sure that classification
criteria are having the impact that is detected.
See A. Ruiz and P.E. López-de-Teruel, "Random Falsifiability and
Support Vector Machines".
- Exclude Features: filter out particular features. This is
to remove specific features from the classification task that may
privilege classification. In the Encyclopedie tasks, strings that
denote class of knowledge and authors should be removed, such as
"archit architecture architect. Use reg-ex as normal with
PhiloLogic searches. One should inspect the list of features
generated by most of the classifiers to see things that should be
removed. Also note that some of the classifiers will also list
features that occur in one group and not another.
- Use Features: base classification only on these
selected features. We have not used this much to date, but it is an
interesting experiment to see just how predictive a particular set
of features may be. Further experimentation will be required.
- Stem Words: currently implemented for Vector Space searching.
This uses Sébastien Darribere-Pleyt's
(perl mod)
implementation of the Porter Stemmer for French. This will probably be
extended as an option to all classification tasks. Behaves oddly for
many pre-20th century forms and will not readily allow for link backs to
the words in PhiloLogic. The English version could also be added.
So far, we have not found huge classification differences using stemmed
vs. unstemmed features and there is some internal discussion of whether
form/tense is a useful classification criteria.
- SVMLight
- C:This tuning parameter is described by the SVMLight documentation as "trade-off between training error and margin". If you set it on the form it will propagate to the command sent to SVMLight. Values should be orders of magnitude different to substantially affect performance. This parameter applies to binary and multiclass runs with SVMLight.
- Multiclass Field: The name of the field that contains the class information, for example, 'genre'.
- Multiclass Values: All values for which you would like to train models for the multiclass SVMLight run. Enter each value separated by spaces. For example, 'poetry mystery news'.