home ~ history ~ concept ~ application aiding design ~ semitic languages ~ Sembase applications

a database project for the study of Semitic roots

Application Aiding Design

Project Design

Creating an initial table was relatively easy. After over a decade of working with the material on paper, I had a fairly good idea of what fields I would need. When Paradox offered the "formatted data" type, I faced a very different problem: font. First, I had to determine what system of transliteration I would use. For consonants, I adopted a system in common use among specialists in comparative Semitic linguistics. Some issues have been put off until I get to them, such as transliterating the sibilants in Old South Arabian (manipulation of data prefers one character for each sound). But how to type such a font? It was clear that I would not be able to complete this project in one lifetime if I could not more or less speed-type the transliteration.

It turned out that I needed to create my own font. First I knew of no font that would do the transliteration I needed. But also, I needed it to sort properly. The first task was to determine how many characters I needed and then how many slots of the font table the Paradox database engine supports (since certain slots are reserved for software developers). To do this, I used the numeric pad to enter into Paradox each slot available to font designers in the font table, and noted whenever I got the message "the database engine does not support this character." In this manner I identified what slots I could use. It turned out that they totaled only enough for me to fill one font table with just the lower case, plus the letters for European languages, and common symbols. So I had to create two fonts, one for lower case, and another for upper case.

When putting my characters into the font table, where would I put them? I could leave the letters of the conventional English alphabet where I found them, and add the letters with diacriticals in any available slots left over. But if I were to rely on a stock alphabetization routine, "d-underscore" and "d-dot" would not be sorted in two respective groups following "d". The solution was to place the characters in the slots in the order I wanted them to alphabetize, and use a sort routine rather than an alphabetization routine. Using this sort order I have full control over alphabetizing material in the font.

This was fine, but it meant that there would be virtually zero correspondence between the font table and the keyboard. Normally, each key addresses a particular slot, and the slot contents had been shifted from their original locations, in a sort of musical chairs. This for me was no obstacle, since I owned a Gateway Anykey keyboard (now unfortunately no longer in production). I could thus program the keyboard so that all letters without diacriticals corresponded to the keyboard, and characters with a line diacritical (underscore and macron) use the shift key. All letters formed with a dot are typed using the control key. All letters with the breve (and a few others when there is only a third alternative) use the alt key. A few additional characters were located on the spare set of function keys. This procedure enabled me to acquire adequate typing speed. Since this keyboard is essential to data entry, I have acquired five or six of them, located on different continents!

Although there exists a comparative Semitic standard for the representation of consonants, the same is not true for the vowels. This is partly due to the disproportionate importance of consonants in these languages. I have represented the vowels of Arabic according to norms common among Arabists, and the vowels of Hebrew according to the norms in Hebrew studies. The representation of the vowels in the Semitic languages of Ethiopia differs with respect to the short and long "a". I have reperesented Geez vowels according to the norms of Geez specialists, and Amharic vowels according to the norms for Amharic studies. Tigre is sometimes vowelled according to Geez, and sometimes according to Amharic, so I have adopted the Geez treatment of "a" for Tigre, because this is the same as in the rest of the Semitic languages. But all other Semitic languages of Ethiopia are treated according to Amharic norms. Within Aramaic, the respective norms are applied for Jewish Literary Aramaic and Syriac. All inscriptional materials are transcribed with only the consonants as found in the text, even when "w", "y" and aleph may have been used to indicate vowels, with the exception of Aramaic inscriptions, where aleph is transcribed with "a macron" when it is the emphatic ending.

I am presently debating whether to standardize the representation of the feminine ending, when it is just a vowel, on the norm for Hebrew ("a" circumflex). In harmony with Aramaic lexicography, my citation form for Aramaic is the emphatic, where the "t" of the feminine ending appears.

The Paradox memo data field enables me to change font within a single cell. This allows capitalization, which requires changing to the font version for capital letters. I was also able to develop a font to enter the Greek alphabet into the records, as needed. It was designed to enable rapid typing without having to master the comples keyboard layout normally used for Greek versions of modern software (which would not have been compatible with the Paradox database engine).

The following graphic of my data entry form for written Arabic illustrates the fields:

Sembase Data Entry Form
(note that clicking on tabs reveals additional options)

Note that the root appears four times. A bit of an overkill, no? No. In the field labeled "root," the xliterals "root consonants" are entered in one field. When searching for a root, the alternative to this field would be to search the fields labeled "Entry One" or "Entry Two." These are formatted data fields. Paradox stores them in a separate file, not in a table file. Although I am satisfied with the search speeds on these fields, searching the table itself, and only a five-character field, is faster.

The second root entry (labeled "Alph.") consists of five fields, one for each consonant, up to five. This allows sorts on the third consonant, for example, much like the early Arabic dictionary Kitab al-Ayn. Sorting on the consonants from first to last facilitates finding potentially related roots that begin with the same consonant. But it tends to result in overlooking roots that do not. This allows one to sort on the consonants in any order. If C1 and C3 have shifted, sorting on C2 will still place the two (or more) potential cognates in nearly adjacent positions.

The third root entry (labeled "Phon.") is the same as "Alph." except that the font used has the characters distributed in the font table in the order of proximity in the oral cavity. Of course this is not a linear relationship. A method of establishing this order will be explained later. Sorting here results in an "alphabetization" that places "b," "f" and "m" in close proximity. This feature will result in orderings of large numbers of records so that those with consonants that are possible sound shifts will be in proximity with each other. And one can sort in any consonantal order (C1 C2 C3 or C3 C2 C1, for example). Because the analysis to determine the best-fit (most useful) order of characters in this font will be done when much more data has been entered into Sembase, the entry of the xliterals will be finalized at a later time.

The "Entry .." fields are designed to be dictionary style. The results of a query can be put into a report in dictionary format.

The top line of the data entry form lists the four fields that constitute the table "key." This can be made to not allow duplicates. It can also link tables. The "language" field would seem straight-forward, but it really isn't. What does one do with a root that occurs in an inscription with insufficient text to determine what language it is? It may be necessary to have a code for a Semitic inscriptional group.

Dialects pose the same problem. Sources may give roots for North Yemeni Arabic, but is this all one dialect? If a source focuses clearly on a subdialect, should it be lumped in with the larger dialect community to which it belongs? At this point, all written Arabic is given the designation "Written Arabic" (Arb wrt). Sorting Lisan al-Arab into ancient Arabic dialects may not even be possible, and certainly is a separate project. At some point it would be nice to be able to code records to indicate that they are attested in first-century-hijri sources (and first/second-century as well), even if they do not constitute a single dialect strictly speaking.

At present, the Arabic "dialect" field is used to differentiate "written Arabic" from spoken dialects such as "Arb Ymn" for Yemeni, and from inscriptional material (Arb ins).

Geez is almost exclusively "written," although inscriptional entries can be so designated (Gez ins).

Hebrew records are designated as being Biblical (Heb bib), inscriptional (Heb ins), post-Biblical (Heb pbh; modern is of no interest, and no distintion is made between pbh and medieval), and Qumran (Heb Qum).

Aramaic is separated into Impirial (Arm Imp), other Old Aramaic (Arm old), Daniel (Arm Dan), Ezra (Arm Ezr), Jewish Literary Aramaic (Arm JLA, using Jastrow, Dalman and Levy), Jewish Babylonian Aramaic (Arm JBA, principally Sokoloff), Jewish Palestinian Aramaic (Arm JPA), Judaean Aramaic (Arm Jud, using Sokoloff), Syriac (Arm Syr), Christian Palestinian Aramaic (Arm CPA), Mandaic (Arm Man), Samaritan (Arm Sam), Nabatean (Arm Nab, including Petra, Mada'in Salih and others), Hatrian (Arm Hat), some specipic inscriptional sources (such as Sam'ali, Arm Sam), other inscriptional materials lumped together (Arm ins), and Egyptian Aramaic (Arm EgA). When JLA and/or JBA have a root or word in common with Syr (with the typical vowel differences), the record is designated as being "written" Aramaic (Arm wrt, or simply Arm in the "entry" field). Inscriptional Aramaic records usually reference the site. The source lexical works are Aramaic/Aramaic, Aramaic/Latin, Aramaic/German Aramaic/French and Aramaic/English. Sorting out all of this Aramaic, using over fifty lexical sources to check for each record, makes for slow going.

The "Concept" field is edited with a drop-down list to ensure uniformity. The number of possibilities has dropped from aroung 340 to 130, since many were merged. Since the subconcept coding was retained in merging, there was no information loss. The subconcept field (labeled "Key") accepts one or more key words. Thus a key word "build" can be used for "build, construct" and any other synonyms. This is so one does not have to search on every possible synonym that a definition may use for roughly the same thing. A drop-down list of keywords has been established for every concept category. The tabs labeled "Concepts" display these dropdown lists for data entry into the "Key" field, one for each concept. A drop-down list will entry only one key word, but will at the same time present the full list for inspection, in case additional keywords need to be typed in manually. One first determines and enters the concept category, and then the subcategory/ies within it. All manual editing requires extensive and systematic proofing.

Note the dropdown entry fields for:

"1=2" (roots where C1 = C2, i.e., the first consonant is the same as the second),

"1=3" (roots where C1 = C3),

"Metathesis" (roots with the same or nearly the same meaning and the same consonants but in different order) and

"MRP" (minimal root pair, i.e., two roots with the same or nearly the same meaning, two of the three consonants the same, and no metathesis). In the example presented in this record (Arb wrt: WHT), the root is a member of six such pairs, involving "shifts" associated with C3, which in this record is "t". It is also a member of at least one metathesis pair.This is considered to be a "characteristic" of this root.

home ~ history ~ concept ~ application aiding design ~ semitic languages ~ Sembase applications