home ~ concept ~ design ~ application aiding design ~ semitic languages ~ Sembase applications

a database project for the study of Semitic roots

History of the Sembase Project

towards a synoptic Semitic resource

In Cairo in the seventies and eighties, students, professors and researchers met in each other's apartments without notice, scouted out Egyptian bars prized for the fact that no tourists would ever be found there (including an interesting bouza dive), and broke the crust together at various restaurants, often the old Cafe Riche (the Filfila being then only a few tables in an alleyway). Sipping Stella beer, our conversation ranged from politics to our research, to the gossip of the academic expat community. It was then that I began calling to my friends' attention various Arabic roots that seemed related. The Stella encouraged some creativity in drawing these "relationships." At some point a friend asked if I was recording my observations. I was not. Another objected that with no controls, one can make anything into anything. This stimulus, and a bit more Stella, prompted the ultimate fantasy: what if there were a tool that would quickly search all essential lexical sources of all Semitic languages and display all information that might be relevant to evaluating a proposed relationship?

On a trip to Kharga Oasis, I took with me pages of typing paper cut into fourths and began recording "xliterals" (triliterals, biliterals, etc.) by assigning them to semantic categories. It occurred to me that although not all roots in a category would be related (far from it), the roots that are related should fall within the same semantic category. By recording only roots, and using sufficiently broad categories, I hoped to arrive at groups that would be amenable to human inspection, to facilitate the identification of root pairs or groups that might be related. Eventually, I went through several Arabic dictionaries, and arrived at about 34,000 records. Then I began doing the same for Hebrew.

Along the way I received training in database construction. But I was determined to use only a database program that would allow me to format entries much like dictionary entries (in addition to other fields) to facilitate publishable reports with little additional formatting. For some time, no database software that I knew of provided this facility. Eventually, Paradox offered the "formatted data" type. I thought (wrongly) that Access must surely have the same type, and bought it. When I failed to find it in the manual, I called up Microsoft and was informed that indeed Access did not permit complete text formatting in a single cell (although an aftermarket product claimed to give it this capability). So if one italicized a single word in just one cell, the whole field became italics. So I went with Paradox. Subsequently Corel decided to cease further development of the product. It will suffice quite nicely for the production of paper publications. But if Sembase is ever to be made available in electronic format, the data will at some point have to be migrated to another platform.

Note: This database has been designed for cultural and historical research as well.


Project Concept

The first step to designing a database is to identify the components. Lexical research deals with two basic components: phonology and semantics. The Semitic languages are based on consonantal roots, which most usually are triliterals. Vowels are of substantially less importance. The term root is a bit vague. At times it may refer (A) to a stem or base form. At a single point in time, a root is a phonological component that carries the base meaning. It may be put through various forms to generate vocabulary. The term root may also refer (B) to an etymological root. It then is a lexeme at some point in the history of a language from which later forms developed. In addition, the term root may be (C) an abstraction, and refer to an xliteral, usually a triliteral, without reference to language or semantic content. Even further, it may refer to (D) an xliteral and that part of its semantic content that share a common developmental history that is different from the same xliteral and some other semantic content with a different history (both within the same language). In this case, if we assume that two historical roots, each with different xliterals, accidentally converged through one or more sound shifts so that they later have the same consonants, but retain their different meanings, it would be possible to still consider them to be different roots although they share the same xliteral. Lexicographers often indicate this by registering the two as separate dictionary entries. Finally, and jumping out of the linguistic cage, we can also refer to (E) the cultural roots of a language. For example, two Semitic languages may associate rendering judgment with "cutting," although their respective expressions use unrelated "roots" meaning "to cut."

This project has to consider the possibility of accidental convergence anytime that the same xliteral carries substantially different meaning. Thus the same xliteral may be registered into more than one semantic category in the database. If it is found in three languages/dialects, and is entered in three semantic categories in the first, two in the second and one in the third, then it will be entered six times in the database. Therefore the root field must permit duplicate entries, and database records are unique only when four fields are combined: root, language, dialect and concept (the semantic category). There should be no duplicates across these four fields (defined as the key).

When I had completed my Arabic set of about 34,000 records, on paper "cards," I had about 340 semantic categories, ranging from "to cut" (cut, snip, shear, tear, trim etc.) to "birds." It was clear that it would be impossible to handle many more than that. These were determined empirically, i.e., they were created when needed (when a root would not fit in an existing category); and when categories were found to be nonproductive, they were collapsed into a semantically similar category. Later, when the data were entered into the database, with this initial sort accomplished, it was then possible to assign the members of each category to subcategories. During this process, some categories were found to be more appropriately subcategories of another category. Thus by the time the data for "Written Arabic" were entered into the database, the number of main categories had dropped to 259, and the number of subcategories had reached nearly 2,000. I could never have sorted the material into 2,000 categories in a single pass. When Arabic was finished, drop-down data entry lists were created for each category to facility the assignment of the data from the remaining languages to the semantic subcategories, thereby enabling data entry in a single pass. The number of main categories continues to decrease, with no loss of subcategories (and hence no loss of information), and as the project moves into a data analysis stage, a full review of the data and these categories will take place, and the categories and subcategories will be adjusted as needed.

It is necessary to find as many "characteristics" of a record item as possible, and code them into the database. Only in this manner can the database manipulate the data to produce results of use to the researcher. For example, a characteristic may be whether the first two consonants (C1 and C2) of a triliteral are the same. It may be whether two records have the same meaning and the same consonants, but with metathesis. It may be whether two roots have the same meaning, two of three consonants the same, and no metathesis. All of these characteristics will be noted for each record in the data analysis stage, and noted in the database.

home ~ concept ~ design ~ application aiding design ~ semitic languages ~ Sembase applications