|
PhiloMine (release 0.1a) Rationale and Design Notes ARTFL/DLDC University of Chicago |
PhiloMine is an attempt to implement extensions to PhiloLogic to support a wide set of data mining tasks and functions on relatively large data sets in an interactive, Web-based environment. This allows the user to select sets of documents to work with, tests or tasks to perform, feature selection and modification, as well as the ability to link back into the documents using standard PhiloLogic services whenever possible. The base version of PhiloMine will install as new cgi-bin executables as an extension to existing PhiloLogic databases without modification to that database.
We have made a series of decisions about the kinds of tasks and approaches that we implemented first, growing out of opinions concerning the most effective approaches and our own research considerations. The following provides a brief outline of the design rationale and some of the directions that we are planning to follow in future.
The are a number of different ways to consider machine learning and text mining tasks that may augment and complement more traditional text analysis functions found in PhiloLogic. For the moment, we are drawing a very simple distinction between three kinds of text mining applications, comparative, predictive, and clustering/similarity. This is neither an exhaustive or exclusive list, but a point of departure.
Comparative Text Mining. This is perhaps the simplest way to frame a text mining problem, by comparing two sets of documents given a particular feature about the documents, authors, or other data typically thought of as metadata. These may include gender, race, nationality of authors, periods of composition, genres of texts, subject matter, and so on. Often, all of the salient information regarding the feature in question is well known for the collection at hand, and not subject to prediction. For example, the nationality and ethnicity of playwrights in the Alexander Street Press collection of Black Drama is already known. Comparative text mining uses existing metadata as a basis:
Predictive Text Mining. This is what is typically considered to be text or data mining, having a system learn on an established group of documents (training set) and applying the generated model to another set of documents. Ideally, one would add salient information to the second set of documents. Sample 3 shows a predictive classification task. In this case, we trained a Naive Bayesian classifier on 9,239 articles with an assigned class of knowledge in Diderot's Encyclopedie, set as corpus 1, and applied the resulting model to 4,694 unclassified main articles. The classes of knowledge are normalized and translated renditions of the classes found in the Encyclopedie (with spaces removed in this example). The resulting classifications are interesting and impressive (links to the articles are disabled in this sample). We feel this is due to the coherence of classification system, as demonstrated in the strong performance in many comparative tests (Sample 4. We are currently assessing the accuracy using several other indicators and may modify the feature selection for future applications.
Document Similarity and Clustering. Comparative and Predictive text mining are instances of "supervised learning". PhiloMine also supports a type of unsupervised learning, based on vector space searching. This is a commonly applied technique in information retrieval which lends itself to a variety of applications. Our current implementation in PhiloMine is designed to find documents in a set that are most similar to a second set of query documents, where the two sets can be identical (compare each to all others). Depending on the dataset and features, vectors can be very large, at times exceeding 10,000 (very) sparely populated matrix rows. Sample 5 shows the result of a normalized vector search of 4,880 articles in the Encyclopedie using the words in 303 articles by d'Alembert, displaying the top 11 (10 + the query article in this case) ranked by relevance (cosine of the angle of the two lines for each vector in the matrix). Many of the most similar articles are by also by d'Alembert, which stands to reason as he wrote on a specific number of scientific topics, but there are quite plausibly similar articles by other authors and also outside of a particular class of knowledge. The comparison can be expanded, as in Sample 6 which shows the similarity of unclassified articles against all main articles. We have applied this component of PhiloMine to a variety of datasets with less topic structure, including correspondence, newspaper articles, and all "div" level chunks in the ARTFL database.
Traditional text analysis tools, such as PhiloLogic, provide many helpful ways to search, sort and sift through many occurrences of words, phrases, or intersections of words, but are distinctly limited in the degree to which such results can be generalized because they are tied closely to individual words and occurrences. This is becoming ever more problematic because of the vastly increasing scale of modern text databases. Comparative and Clustering/Similarity text mining functions, in particular, would seem to open the possibility of using the computer to address questions at a higher order of generalization by moving away from the individual word as the primary locus of analysis. Thus, for example, we can use "all the words" to examine the distinction between plays by American and non-American Black playwrights and determine words are most distinctive of each. In this case, one may wish to use more traditional tools like PhiloLogic to inspect those break words. Document clustering or similarity again uses "all the words" -- or at least most of them -- to suggest potential associations between texts or parts of texts. Text mining applications may form a very useful complement to the more traditional searching, sorting and sifting facilities in systems like PhiloLogic. Determining just what that relationship is, or should be, is one of the objectives of PhiloMine.
As we have been working on various research projects, using PhiloMine as well as standalone applications, we have found several considerations need to be kept in mind when using text mining systems and attempted to build in at least some controls in PhiloMine:
Without careful assessment and controls, these powerful systems will suggest connections that are not valid or meaningful. They may lay claim to knowledge, of a sort, but they are totally devoid of anything resembling common sense. Furthermore, given the almost limitless ability to define problems and associated features sets combined with a wide array of possible statistical tests, if you run 100 different experiments, you are virtually assured of finding at least some "interesting" results, simply by random chance.