In today's world, even people in the same general discipline, but specialized in different areas, may not understand each other's work. Nonlinguists working on Middle East topics, or linguists devoting their time to the study of other language families, may not have been exposed to the Semitic family of languages (or Semitic subfamily of Afro-Asiatic). The Semitic languages all share certain distinctive characteristics. This project is especially dependent on one of them, the consonantal root system. This very brief introduction is intended to enable the nonspecialist to more fully understand Sembase.

Geographically, the Semitic languages were spoken in the Middle East (the Fertile Crescent, Mesopotamia and the Arabian Peninsula). Some also spread into North Africa (Punic and Arabic) and what is today Ethiopia. They include Akkadian (along with Babylonian and Assyrian) in the East; Ugaritic, Eblaite, Amorite, Canaanite (Phoenician, Punic, Moabite and Hebrew), and Aramaic in the Northwest; Arabic, Sabaic, Hadhrami, Ma'ini, and Qatabani (among others) on the Arabian Peninsula; and Geez, Tigre, Tigrinia, Gurage, Harari and Amharic in Ethiopia (among others). Evidence indicates that all Semitic languages have developed from a common language in use long before writing (and hence unattested), which Semitists term Proto-Semitic. This would have been a member of the Afro-Asiatic family of languages, along with sister languages, probably including some ancestor of Egyptian (Proto-Egyptian?). At a much earlier date there would presumably have been a Proto-Afro-Asiatic. The territory of the Afro-Asiatic languages would have been Western Asia (the Middle East), and parts of Africa, although for all we know Proto-Semitic may have originated in Africa and migrated to the Middle East. Although we think of Proto-Semitic, it would have been divided into dialects somewhat anticipating its development into the Semitic languages as evidenced by living or otherwise extant information. At all earlier stages, there would have been alien linguistic influences.

Of these languages, four are especially important partially due to their relative antiquity, but also due to the fact that considerable knowledge of extensive vocabularies has come down to us in an unbroken tradition: Aramaic, Classical Hebrew, Geez and Classical Arabic. Each has an extensive textual corpus providing the basis for a firm knowledge of morphology and syntax. Their alphabets were vocalized at a relatively early date. Although some short multilingual texts have been found, the "forgotten" languages (Akkadian, Ugaritic, Sabaic etc.) have had to be deciphered by comparison with these four. There are many living Semitic linguistic communities today (the Modern Arabic dialects, Amharic and other languages in Ethiopia, a number of dialects of Modern Aramaic, Modern Hebrew and modern South Arabian languages in Yemen and Oman). These are sufficiently changed, and/or influenced by non-Semitic languages, that their value for historical linguistic research is limited.

On the other hand, some of the forgotten languages were written very early and have great importance for that reason. Unfortunately, the first alphabet used for Semitic languages was that developed for Sumerian. For some time, logograms were used so extensively that the texts do not reveal sufficient reliable information regarding the phonology. Furthermore, early phonological characters were developed initially for Sumerian, which is not a Semitic language, and does not have the same wealth of consonants. Thus the phonology of Old Akkadian (early third millenium B.C.E.), for example, cannot be said to be perfectly known. To the extent that the words represented by the ideograms became written by phonological characters, it is not always easy to know if the word is Semitic or Sumerian. And when known to be Semitic, one does not know that it was written nearly as it was spoken at the time. Some Semitic consonants were often not distinguished by the emerging phonological chracters. And there may have already been an archaizing tradition, which was considered appropriate for writing (writing something down being a somewhat formal act in itself). As for Hebrew, the earliest inscriptional Hebrew material is sufficiently limited that it cannot independently document most of the language recorded in Biblical texts. Scholars differ on the dates of composition of the various parts of the Bible, and whether or the extent to which the Biblical texts reflect a state of the language prior to those dates of composition. While some Biblical Hebrew can be assigned to the first half of the first millenium B.C.E., some is clearly only attested in the second century, and post-Biblical Hebrew even later. Furthermore, just as Old Akkadian borrowed from Sumerian, Hebrew borrowed words as well, from other Canaanite dialects, Aramaic and even Arabic. Unfortunately, since the Semitic languages are so similar, it is not always possible to know what is borrowed and what is native to a language. Attempts even to identify the language of a short inscription may depend on where it was found, and the style of the script, as much as the linguistic content, if the text does not happen to contain a clear identifying trait. The challenge can be even greater when the material is found only in unvocalized texts.

In any case, the "age" of a language is not the same as the "age" of the traits that may have persisted in it. The Qur'an and some other Arabic compositions of significant length can be reliably assigned to the seventh century C.E. Their morphology, syntax and vocabulary would not be more recent. Even though some of the "pre-Islamic" poetry corpus is generally thought to antedate the Qur'an, it would appear that Classical Arabic belongs to a period around a thousand years after that of Biblical Hebrew. Even so, many Semitists feel that Arabic has preserved some especially old morphology, phonology and vocabulary. Arabic case endings are an example. Case endings have all but totally disappeared from Hebrew. But no single language is a magic key to unlock the secrets of Proto-Semitic.

A major defining characteristic of Semitic is the root system. Nearly all roots are triliterals, three consonants (most usually three different consonants, referred to as radicals). Each root bears a basic meaning. The triliteral is put through various forms to create vocabulary, conjugate verbs, decline nouns and define their plural and dual forms. Thus for example, in Arabic, k/t/b bears the meaning of "to write." Examples of the forms include:




he wrote
it was written
I wrote

he writes

writ, document, ordinance

The same three consonants (xliteral, in this case a triliteral) may have very different meanings, raising the possibility of accidental convergence. It is improbable that k/t/b "to write" derives from the same ancient root as k/t/b "to sew a waterskin":

he sewed (a waterskin)
he closed (a bag)
thongs for sewing a sack

Although there are important differences between these languages, all have remarkably similar forms based on triliteral roots (although these are somewhat modified or obscured in the currently spoken Semitic languages of Ethiopia). Verbal vocabulary is generated by modifications of the verbal root. Doubling the second consonant makes the verb causitive, intensive, "factitive" or declarative:







it grew, became large
he enlarged (

it grew fat, fleshy
he fattened (

he killed
he massacred

he broke (
he shattered (

he became an unbeliever, infidel
he declared someone to be an infidel

to nickname

Prefixing a consonant (', s, sh, or h, all apparently cognate) also makes the verb causative:



to come to rest, settle in a place
to settle in a place, make someone stay in a place

to lean on, rely on
to prop up, support

Prefixing "n" makes the verb passive:



to write
to be written

to break
to be broken

These verbal forms are common to nearly all Semitic languages. Both the triliteral root system and some verbal forms clearly already existed in Proto-Semitic. Note for example that these three forms (doubling the second radical, prefixing "s" and prefixing "n") are attested in ancient Egyptian. Although it is known that in historical times there were Semitic incursions into the Nile Delta, it is not clear that population movements could account for structural Semitic influences in Egyptian before 3,000 B.C.E. If these verbal forms are native to Egyptian, this is further evidence that they already existed in Proto-Semitic, or even earlier. Note too that doubling the second radical, which is universal in Semitic and common in Egyptian, is part and parcel with the use of triliteral roots (i.e., it is the second of three that is doubled).

It is also characteristic of the Semitic languages that the consonant-vowel configurations are mobile. In some forms, all three radicals are separated by vowels. But when in one form the first and third radicals are adjacent to each other, while in another it is the second and third that are adjacent. This impedes sound shifts conditioned by assimilation. Although sound shifts and metathesis (the change in order of consonants) do happen, and are even common, it is not clear what conditions them, apart from the observation that sound shifts occurring in other language families also occur in Semitic. Sound-shift pairs tend to be consonants that are similar, proximate in the oral cavity, or otherwise associated. For example, "t"/"k" and "t emphatic"/"q" may be associated in cognitive speech processes although not adjacent in the oral cavity.

Yet another characteristic of the Semitic languages is the nearly total absence of compounding. A "construct" relationship does exist: "light the day" means "daylight." But although at times similar in function, constructs are far from being compounds. They cannot be verbalized, and therefore cannot be put through the verbal form system. Vocabulary generation by prefixing prepositions or adverbs (or even a negative or privative particle) is almost nonexistent. On the other hand, preformatives are used, especially the consonant "m" as in maktab above. The "m" in itself bears no meaning; depending on the consonant-vowel configuration it may produce a participle (sometimes a substantive), a noun of instrument, or a noun of place of the action. Possibly these derive from a relative+verb clause: one who does, that which does, that (place) where it happens.

In the absence of compounding, vocabulary has evolved partly by assigning different but related meanings to variants of a root caused by sound shifts. One wonders if pressure to generate vocabulary has also resulted in the rather large number of consonants (twenty-nine in the Old South Arabian group). Vocabulary is also generated by simply adding yet more meanings to existing words. This has caused Arabic, for example, to be considered to be very "rich," in the sense that the same word can mean many things. The language is thus highly context dependent, and ideal for poetry. This semantic accretion and the phenomenon of accidental convergence of roots often make it unclear what one might consider to be the base meaning of a root.

On the other hand, the root consonants are so obvious in Semitic that dictionaries traditionally list words under their roots. The discreteness of the triliterals, and the persistence of the basic forms, facilitate the analysis undertaken by Sembase.

Sembase Status
Last updated: 27 March, 2016

Sembase has a long way to go. Registering Aramaic is the current challenge. Post-Biblical Hebrew is being done in conjunction with Aramaic, since some of the best lexical sources are for Judaic literature, whether Hebrew or Aramaic, which undertake exhaustive coverage of both Jewish Literary Aramaic and Post-Biblical Hebrew. Dictionaries of this sort (cf. the works of Levy, Jastrow and Dalman) make sense because post-Biblical classical authors using either language incorporated extensive borrowing from the other (as well as from Greek and Latin).

Although the basic structure of the database is in place, it will need some editing. Since its inception, the 340 major semantic categories have been reduced to only around 135, and the subcategories are approaching 2,000. Even the fields for sorting are not final; I will modify the sort orders (alphabetical and phonological) when I have more languages in the table, to profit from what I have learned. At present, it has all of Biblical Hebrew and Geez. It has an aleph-through-ya' data set from several principal Arabic dictionaries, but Arabic will be comprehensively revisited when all other target languages are done. At that time, a protocol will be followed to glean from currently spoken dialects, with an emphasis on those in areas where a Semitic language was spoken at the time of the rise of Islam (the Arab conquests). The modern Semitic languages of Ethiopia will be done only as found useful. Note that the entries summarized below do not just represent the number of roots and words. Any given root or word can be entered into more than one semantic category, depending on its semantic extension. The data below prepresent the number of entries.

For the most part, the language and dialect categories are discrete. The glaring exception is Aramaic. This language spread far and wide as it became the lingua franca of the region. The notion that the language evolved into different dialects in an internally linguistically straight-forward manner is clearly an oversimplification. The language spread to populations speaking other languages, usually Semitic, and those languages strongly influenced the dialectical development, not simply through borrowing, but also through the influence of the linguistic culture of the substrate populations on modifications of Aramaic grammar, syntax and phonology. I am using over fifty lexical sources for Aramaic alone. Some dialectical categories are discrete, such as the dialects of Syriac, Palmyrene, Nabatean and the dialect of Sefire. Jewish Literary Aramaic is a designation indicating the lexical source, referring to the material from the works of Jastrow, Levy, Dalman, etc. The material of Qumran could at least partly be categorized as Judean Aramaic, and Yerushalmi as Palestinian Literary Aramaic. Yerushalmi, Onkelos, Neofiti and Qumran materials are coded as such when the source seems to me to be of interest. The data entry procedure, adopted to determine what will be included in this survey, has been to first enter Syriac (Smith, Brockelmann, Sokoloff, etc.) and Jewish Literary Aramaic (as defined above, by its source lexica). Material from other dialects/periods is selected only if it has not already been included in the base survey. A word or root in Mandaic that is already recorded as Syriac or JLA is not selected, but only words or roots of interest that were not found in Syriac/JLA. However identical or nearly identical material can be included if a) there is some interest in the fact that it is also found in another dialect, or b) it exhibits some linguistically interesting morphological or phonological peculiarity. Some JLA entries could have been coded as Jewish Babylonian Aramaic, but are not. Thus JLA, JBA and JPA are not discrete categories, but the codes do direct the user to sources. All coding will be reviewed later, and I expect that "Old Aramaic" and "Inscriptional Aramaic" will be largely merged, with notes in the entry as to provenance. This coding complexity is limited to Aramaic. I should note, however, that Post-Biblical Hebrew occasionally includes Medieval Hebrew, which is not a separate category.

Note that "Hatrian Aramaic" has been replaced by East Mesopotamian Aramaic, following Beyer (1998), which includes inscriptions from Hatra, Assur (Asshur), Dura-Europos, Jaddala, Takrit, Qabr Abu Naif, 'Abrat al-Saghira, Sa'adiya & Tur 'Abdin.

At the time of this update, the data summary is as follows:

Records by Language

Arabic 34,414
-- Literary Arabic: 34,370
-- Algerian Arabic: 1
-- Damascene Arabic: 8
-- Dathini Arabic: 4
-- Egyptian Arabic: 5
-- Hawrani Arabic: 5
-- Iraqi Arabic: 1
-- Moroccan Arabic: 1
-- neo Arabic: 6
-- Tunisian Arabic: 1
-- Yemeni Arabic: 12
Hebrew 10,647
-- Biblical Hebrew: 7,012
-- Post-Biblical Hebrew: 3,560
-- Qumran Hebrew: 47
-- Ben Sira: 4
-- Inscriptional Hebrew: 18
-- Samaritan Hebrew: 5
-- Geniza Hebrew: 1
Geez 7,614
-- Literary Geez: 7,610
-- Inscriptional Geez: 4
Tigre 335
Tigrinya 173
Amharic 185
Gurage 5
Harari 5
Mehri 4
Canaanite 6
Aramaic in progress (total to date with c. 60% done) 13,055
-- Common to JLA, JBA & Syriac: 2,046
-- Syriac: 4,725
-- Jewish Literary Aramaic (JLA): 2,932
-- Jewish Babylonian Aramaic (JBA): 661
-- Samaritan Aramaic: 617
-- Mandaen Aramaic: 473
-- Christian Palestinian Aramaic (CPA): 164
-- Jewish Palestinian Aramaic: 311
-- Judean Aramaic: 14
-- Palmyrene Aramaic: 89
-- Nabatean Aramaic: 62
-- EMA (Hatra+) Aramaic: 50
-- Ezra: 28
-- Daniel: 57
-- Targum Onkelos: 55
-- Targum Neofiti: 52
-- Targum Jonathan: 27
-- Hagiographa: 38
-- Targum, unspecified: 54
-- Inscriptional Aramaic: 101
-- Imperial Aramaic: 58
-- Old Syriac: 11
-- Qumran Aramaic: 56
-- Sefire Aramaic: 34
-- Sam'ali Aramaic: 39
-- Egyptian Aramaic: 101
-- NSA (Neo Syriac Aramaic) Aramaic: 89
-- Elephantine Aramaic: 17
-- Old Aramaic: 87

Database Total 66,444

