Book logo xindy

A Flexible Indexing System

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: xindy, omega and Unicode

Yannis Haralambous writes:

 > Hi,
 > you may have heard about the Omega extension of TeX. We are using
 > 16-bit tables internally, so that in general our system is based on
 > Unicode. We need an indexing utility compatible with this scheme.
 > Would it be possible to upgrade xindy to be 16-bit compatible?
 > Here is what we need:
 > foo.idx files will contain characters in extended hexa notation:
 > ^^^^0123^^^^abcd^^^^0080 and so on
 > It should be possible to write a merge/sort file of the type
 > (merge-rule "^^^^41d8" "A") 
 > where A is the 16-bit character of hexa value 41D8
 > or to have a notation like in the newest perl (with utf8 module)
 > (merge-rule "^^^^41d8" "\x{41d8}")
 > This means that can (and will) be more than 256 letter groups, and
 > that it also should be possible to define groups of groups (Latin
 > entries, then Greek entries, then Cyrillic entries, and so on).
 > Is this possible? If yes, in the short range? in the long range?

xindy is based the CLISP implementation of Common Lisp. Additional
libraries for managing regular expressions (namely the GNU Rx library)
are used for the merge and sort rules. None of the listed components
directly supports 16-bit Unicode characters.

One could - at least to some extend - use the merge- ans sort-rules to
achieve the results you need in an ad-hoc manner, though several
problems might arise:

- Strings are in all of the above systems null-terminates, i.e., any
  unicode characters of the form \x{yy00} and \x{00yy} cannot be
  properly handled.

- Merge- and sort-rules need to be 16-bit aligned for proper
  operation. Currently alignment occurs only on 8-bit (character)
  boundaries. To give an example 

    (merge-rule "A" "a") 
  (don't know if that makes any sense at all) applied to the character


  will result in a substitution which you probably don't want to
  One could circumvent this by applying "boundary characters", i.e.,
  encode the above string differently such as 

    "4A 3 xy ..."

  but obviously you will run into other troubles then.

- Another problem is the amount of rules in the substitution database.
  The current solution will probably not scale well if several
  thousands of substitution rules happen to be in the database. I can
  only expect that things will significantly slow down. There is an
  internal hash-table for efficient encoding of substitutions, which
  needed to be expand from 8 to 16 bit at first. Further optimization
  might be needed.

- As you already mentioned the letter groups must be expanded to 16
  bit as well.

To sum all of the above considerations I think that there is a
substantial amount of work to do to extend xindy from 8 to 16 bit
because it orthogonally touches the inner workings of xindy dealing
with keywords at almost all levels.

A better approach might be to reconsider the whole model of merge- and
sort-rules into a more modular architecture that models letters as
objects. We have discussed some of these aspects more than a year ago
on this list. Your demands actually are further arguments to rethink
the whole model of merging and sorting on a character basis without
higher-level concepts, which I consider to be vital for future

I personally will not be able to change xindy in the way needed, but
I'll provide any help to others to do so. I even think that there is a
lot of potential for research work in this area (at least more than
enough for a computer science diploma thesis) to think about more
general frameworks for this kind of problems. 

One thing I've learned from the xindy project is that indexing is far
more complex than I ever thought and that it is hard to find good
trade-offs for providing practical solutions to this problem.

You are welcome to further discussions...

Cheers, Roger
Roger Kehr
Computer Science Department         Darmstadt University of Technology