Todo

From PhilologicWiki

Contents

philo-db.cfg $BIBOPS{"dgphilodiv"} -> $BIBOPS{"dgphilodivid"}

In the stock version of philo-db.cfg, we have:

  1. Define DIV LEVEL for searching

$BIBOPS{"dgphilodiv"}="exact";

This should actually be, I believe:

  1. Define DIV LEVEL for searching

$BIBOPS{"dgphilodivid"}="exact";

Proximity Searching for same word fails

Paul Schaffner reports that trying to perform a proximity search for the same word fails, returning all of the occurences of the word, with hit highlighting broken. The phrase search "day day" performs properly, as does the phrase "day to day". A proximity search for "day day" fails. My initial assessment is that this is a search3 evaluation error. Judging by the duplicated "hit" highlighting, it anchors the proximity on one occurrence and then when doing an index search for the next item, it gets the same one, decides it is done. This appears to be a long standing bug, as the same behavior is notes in search2. I believe this is the first report of this behavior in PhiloLogic 2 or 3 (or about 8 years).

This is a fairly important bug fix since searching for repeated words in close proximity is of interest to literary scholars.

Toogle Display of Titles in Frequency by Period/Author Reports

Robert suggests:

I was wondering if we should not try to arrange the frequency result query and response differently. Currently, you get back, say, decades with lists of works. There is a turn-off title button. I'lm wondering if we could not have that be the default mode... but then when you get the list of decades with there frequences... you could click on the decade and the titles for that decade would open up below (not in a java box) for that decade...


Permit phrase search on collocation report

Robert asks if we can support phrase searching for collocation table reports. Currently, I am limiting the permitted searches when getting a collocation report to single vectors of words, such as grand*. I did this because it would be difficult to decide what would be the "pole" in a multiword search. Robert suggests permitting searches for "grand* homme*". This will need modification to search3t and artfl_pole.pl. The logic of artfl_pole.pl is pretty hardwired to the single word per hit model, so it may need some significant hacking.

Search link back from collocation report

Robert suggests that we allow users to click on a word in the collocation report and run a search, presumably in the context of the pole word and bibliographic specification. This too will require modification to artfl_pole.pl. Looking at it, I think I erred in making artfl_pole.pl a standalone function. Next revision, I'll make it a search3t subroutine, which will make this and the obvoe modification more coherent.


Sort on DivHead words in Frequency by DivHead Report

Martine Groulx reports that the Frequency by DivHead report sorts only on the frequencies of words appearing in each Div (Chambers Cyclopedia). These are typically dictionary or reference work entries. Charles also had to modify this report for certain function in Proestant database for ASP.

Solution is to leave hooks in either search3t: &dofreqbydiv or (more reasonably) philosubs.pl:&DivHeadFreqLinks (since we already have a subroutine), to generate a new sort key and resort on frequency and divhead. MVO might try to hack that and add it to Patches.


Hyphens break word highlighting

If you have loaded a database with hyphens as a non-word-breaking character, and your search comes up with a result that has a hyphen in it, it won't highlight properly. The word will be highlighted until the hyphen, then not.

Perhaps the word pattern that is used to load the database could be saved in philosubs and then used to do the highlighting too?

You can solve this by editing ConcSpan in philosubs.pl and removing - or any other character that should be considered as part of a word from the pattern.

Triple Search Results for document 0

Vincenzo Lomiento lomiento_AT_alice.it reports that PhiloLogic produces each hit 3 times in the case of having a one document database and specifying some metadata in addition to the word to be searched. It works fine without specifying metadata, which is probably why we have not seen this. Vincenzo also reports that this occurs in cases where the search produces more than 9 hits.

Some tests to run: will this behavior be replicated in cases of large databases where the first loaded document -- document 0 -- is specified by metadata?

My guess is search3 is failing to initialize properly on the "0", probably finding it as a failed logic test, whereas it is a valid document id.

Another solution may be simply to uniq the hitlist -- recall that it produces EXACTLY the same hits, including byte offsets, which is in normal behaviors impossible.

Missing 'f' in sprintf on line 960 in search3t

Title says it all. Will be fixed in the next release.

Add sorted word count navigation

Robert suggests that we add a little navigation function to allow users to click on a letter [a,b,c] for word count files. This needs to be hooked into getwordcount.pl. It will only work for alphabetized lists.

Investigate TRE agrep support for approximate matches

Allow for fuzzy matching with regexp:

http://laurikari.net/tre/download.html

TRE agrep can handle full regular expressions and approximate matching via edit distance. This would allow us to add similarity matching to regexp searches.

contextualize.pl doesn't know when we are at the end of the document, throws up bogus next arrows

If you click on the page number link from a results page, contextualize.pl will bring you the page content, but it doesn't know about where the document ends, so it will put up "Next" or "Previous" page arrows even if you are at the last or first page of a document.

PDF text recognizer

Robert Scholes (Brown) suggests that a PDF text recognizer (loader) would be a useful addition. This would probably work like a plaintext loader without document structure, but you could get most of the reporting functions. I'll have to think about it a bit more. Might need to create a "dummy" document in the background. MVO.

Fix 32K document limit

As noted in the Optional Code section, we still have a 32K document limit, which can easily be bumped to 64K documents. Probably just need to redo the unpack function when reading search hits.

Total word count for user selected groups of documents

We currently have word frequencies for individual documents and various word counting functions from searches. Jacques Guilhaumou suggests that we implement a word count feature for groups of documents, such as global word count for all documents by a particular author. This would not be difficult to do. A couple of ways to do this would simply to have a link from a bibliography search with total word count. This could push to a variant to the current getwordcount.pl, which would simply sum the counts from a list of philodocids: getwordcountpl?DBNAME.OBJ:OBJ:OBJ:OBJn.sortorder We could simply put a switch in to see if we have one OBJ or many, and then use the rest of the same code. We would need some thinking about this. 01/08

Using a user-selected bibliography, searches that return no results display erroneous bibliographies

If you run a null search on a database (by simply pressing "enter" in the search box or hitting "Submit" without filling in any search criteria), then select a bibliography using the checkboxes, then run a search that returns no results, bibliography display contains duplicates and omits some entries. For example:

http://philologic.uchicago.edu/cgi-bin/philologic3/search3t?dbname=docsouth&word=foober&CONJUNCT=PHRASE&DISTANCE=2&PROXY=or+fewer&multidocid=72&multidocid=738&multidocid=1032&multidocid=418&multidocid=53&OUTPUT=conc&DFPERIOD=1&POLESPAN=5&THMPRTLIMIT=1

The solution is to sort the bibliographic result array before printing the bibliography.

It also seems that you still can't select the document with philodocid of 0. Try running a search here:

http://robespierre.uchicago.edu/philologic/pmt3.whizbang.form.html

Select all documents... 0 doesn't appear.

Add Frequency by Div object formatting to philosubs.pl

Add a function to allow modification of result display for frequency by Div object, like chapters. Specific request to add author to Encyclopedie frequency by article report.

Add field sorting/counting for Frequency by Div Objects

Add a function, like we have for bibliographic metadata, to allow the user to get counts by selected metadata. Again, this is specific to the Encyclopedie. Frequency by author, class of knowledge, etc.

textload.cfg should be database specific

We ought to have textload.cfg be database-specific in some manner, or at least put a copy of it into the database directory after running a load, so you can go back and see what parameters you loaded it under.

Textload can generate errors if CHARSNOTTOINDEX reduces work to null string

If you have words that are entirely made up of bytes that match the pattern in your CHARSNOTTOINDEX, they will be reduced to nothing and philoload will fail with an error about counts differing. You can solve this with a hack like this:

 $oldword = $theword;
      if ($CHARSNOTTOINDEX) {
           $theword =~ s/($CHARSNOTTOINDEX)//g;
        }
        if ($theword eq '') {
            $theword = $oldword;
        }

Edit Install page to reflect *reality*

For example, indicate that Mac OS X is the main operating system we're supporting. Currently reads:

...if you were on Mac OS X (which is close to being supported but
is really cranky right now and I don't recommend trying it unless you
want to bang your head against it to make it work...

Theme-Rheme fails to sum Middle

Middle of Clause: out of 0

This should have a number and total count.

Sorted KWIC description message is incorrect

By default kwicresort.pl points the description of the page to the wrong message. It was 240. This should be 190. But we have to modify the display a bit. Might require a new message.

Compilation Failure of Index verification on 64 bit Ubuntu

[Blog Entry]

Docid 256 search results appended to low byte orders

For high frequency words when metadata returns low byte ids, results from docid 256 may be appended.