4.2. Morphological Analysis
Morphological analysis is required for stemming and lemmatization where both aim to conflate variants of words into a unique unit, a stem or a root. These text units are then used to represent documents and queries for document indexing and query/document matching. Stemming is usually applied since lemmatization is more computationally consuming for just slight effectiveness improvements in IR tasks [
52]. Stemming has also been applied in Amharic IR systems [
25,
26]. In our work, we make the hypothesis that stemming is insufficient for Amharic and that more sophisticated text analysis should be used because of the complexity of the language.
Morphological variants of Amharic, especially verbs can have more than one stem types. From a given Amharic root, more than 10 basic stems could be generated [
53]. For example, the morphological variants ነገረ /
nəgərə ‘he told’/, ተናጋሪ /
tənagari ‘orator’/, and አናገረ /
ʔənagərə ‘he communicated’/ have the basic stems ነገር- /
nəgərɨ-/, ናጋር- /
nagarɨ-/ and ናገር /
nagərɨ-/, respectively. As a result, stemming provides different stems though the word variants are semantically similar, which means that Amharic verbal stems need one more reduction analysis for extracting words’ roots. Indeed, verbal stems are formed from roots and all variants of an Amharic verb have one common root. For example, the common root for the aforementioned examples of morphological variants is ን-ግ-ር /
n-g-r/. Therefore, roots are more appropriate than stems for Amharic as roots conflate word variants more accurately. Thus, we developed a new root-based representation for text representation for Amharic IR. We test our hypothesis experimentally by considering stem-based and root-based morphological analyses where extracted basic stems and roots of words from documents and queries. Basic stems serve for the formation of derived stems and surface forms of words. In Amharic, basic stems are usually derived from roots by inserting vowels between radicals.
4.2.1. Stem-Based Morphological Analysis
We created the stem-based index using the basic stems of words. Several words can be formed by attaching affixes to stems. Therefore, we performed morphological analysis for extracting stems from the rest of morphemes. We conflated variants of words to their basic stems. For example, the morphological analysis used to extract the stems of the words አልተመለሱም /
ʔəlɨtəmələsumɨ ‘they have not returned’/ and ከቤተሰቦቹ /
kəbetəsəbotʃu ‘from his families’/ are shown as follows (see Notations at the end of this article).
አልተመለሱም | ከቤተሰቦቹ |
አል_ተ_መለስ_ኡ_ም | ከ_ቤተሰብ_ኦች_ኡ |
ʔəlɨ_tə_mələsɨ_ʔu_mɨ | kə_betəsəbɨ_ʔotʃɨ_ʔu |
[neg]-[pas]-[stem]-[3,p]-[ncmp] | [pre]-[stem]-[p]-[3,s,m] |
Furthermore, basic stems are used to form derived stems, which in turn are used for the formation of surface forms of words. The derived stems include causative (አ- /
ʔə-/ and አስ- /
ʔəsɨ-/), passive (ተ- /
tə-/), infinitive (መ- /
mə-/), and reduplicative types of verbal stems. For example, the variants ተደወለ /
tədəwələ ‘is called’/, ከመደወሏ /
kəmədəwəlwa ‘as soon as she called’/, በአስደወለ /
bəʔəsɨdəwələ ‘since he caused to call’/, and ሲደዋውሉ /
sidəwawɨlu ‘as they called each other’/ have the derived stems ተደውል- /
tədəwɨlɨ-/, መደወል- /
mədəwəlɨ-/, አስደወል- /
ʔəsɨdəwəlɨ-/ and ደዋወል- /
dəwawəlɨ-/, respectively and a basic stem ደወል- /
dəwəlɨ-/. With regard to their meaning, there is no conceptual difference between derived and basic stems. Moreover, the origins of derived stems are basic stems, and basic stems are the shortest and the most common stems for many variants (see
Table 3 where core meaning is ‘kill’). For the stem-based indexing and retrieval, we represent the variants of words by their basic stems.
Although basic stems are better to conflate more variants than derived stems, more than one type of basic stems exist for variants (see
Table 3). Therefore, in case of verbs, it is impossible to conflate Amharic variants even using basic stems. However, variants of primary nouns, adjectives, and adverbs have one common basic stem. Therefore, verbs and words derived from verbs need further morphological analysis to be represented by a common form. The morphological analysis of some words requires palatalization to extract basic stems. This has been achieved after separating -ኢ and -ኢያ from a stem. For example, the morphological analyses of the words የጎጂዎች /
jəgodʒiwotʃɨ ‘of harmful [
p]’/, ገዳዮች /
gədajotʃɨ ‘killers’/ and መጨረሻ /
mətʃ’ərəʃa ‘end’/ are presented as follows.
የጎጂዎች | ገዳዮች | መጨረሻ |
የ_ጎድ_ኢ_ዎች | ገድል_ኢ_ኦች | መ_ጨረስ_ኢያ |
jə_godɨ
_
ʔi_wotʃɨ | gədɨlɨ
_
ʔi_ʔotʃɨ | mə_tʃ’ərəsɨ_ʔija |
[gen]-[stem]-[pal]-[p] | [stem]-[pal]-[p] | [inf]-[stem]-[pal] |
4.2.2. Root-Based Morphological Analysis
Roots are the bases for the formation of verbal stems and many Amharic words as the origin of verbs and words derived from the verbal stems is root. Stem and root have the same form for primary nouns, adjectives, adverbs, and functional words. For Amharic nouns, adjectives, and adverbs derived directly from verbal roots and stems, we proposed the roots of their corresponding verbs as index and query terms. For example, the morphological analysis of the verb ከሰበርኳቸውማ /
kəsəbərɨkwatʃəwɨma ‘if I break them even’/ and the derived noun ስብራቴ /
sɨbɨrate ‘my state of being broken’/ are presented below.
ከሰበርኳቸውማ | ስብራቴ |
ከ_ሰበር_ኩ_አቸው_እማ | ስብር_አት_ኤ |
kə_səbərɨ_ku_ʔətʃəwɨ_ʔɨma | sɨbɨrɨ_ʔətɨ_ʔe |
[pre]-[stem]-[1,s]-[3,p]-[foc] | [stem]-[nom]-[1,s] |
ከ_ስ-ብ-ር_ኩ_አቸው_እማ | ስ-ብ-ር_አት_ኤ |
[pre]-[root]-[1,s]-[3,p]-[foc] | [root]-[nom]-[1,s,pos] |
Adjectives derived from primary nouns are represented using the root of the corresponding nouns. Nouns derived from primary adjectives are also represented using the root of the corresponding adjectives. The root representation of primary nouns, adjectives, adverbs and functional words is different from verbal root representation. For example, the root of the noun መኪናዋ /məkinawa ‘her car’/, ደጋግ /dəgagɨ ‘generous’/, and ሌሎች /lelotʃɨ ‘others’/ are መኪና /məkina/, ደግ /dəgɨ/, and ሌላ /lela/, respectively. However, the root of the verbals such as መልስ /məlɨsɨ ‘answer’/ and ረዥም /rəʒɨmɨ ‘long’/ are ም-ል-ስ /m-s-l/ and ር-ዝ-ም /r-z-m/, respectively. The reduced form of some variants of a verb is represented by the corresponding radical form. For instance, the root of the verb ሞተ /motə ‘he died’/ and ኖረ /norə ‘he lived’/ are ም-ው-ት and ን-ው-ር, respectively. Morphological variants of Amharic words, especially verbs can have more than one stem, but still a common root. All variants of an Amharic verb can thus be represented by a single root during indexing. On the other hand, semantically unrelated words hardly ever have a common root. To sum up, basic stem text representation is robust to represent primary noun, adjectives and adverbs, and functional words whereas root is robust to represent all types of words, including verbs.
4.3. Amharic Stopword Identification and Removal
One of the major preprocessing tasks in IR and many other text processing applications is stopword removal. Accordingly, stopword lists have been constructed for many languages. However, standard stopword list is unavailable for Amharic IR yet. The common trend for identifying and removing stopwords is to do it before applying morphological analysis on words in a text. This is also what has been carried out in the previous Amharic IR studies. We think this is an inappropriate way to consider stopwords for Amharic. Some Amharic stopwords do not necessarily exist as standalone words and others may appear with other words as prefix and suffix. For example, ‘the’ is usually considered as a stopword in English; its Amharic equivalent is a suffix ‘-ኡ /-ʔu/’ or ‘-ው /-wɨ/’ that does not appear as a standalone word. Accordingly, ‘the house’ and ‘the student’, for instance, are equivalent to ‘ቤት /betɨ/+ -ኡ /-ʔu/’ → ‘ቤቱ /betu/’ and ‘ተማሪ /təmari/+ -ው /-wɨ/’ → ‘ተማሪው /təmariwɨ/’, respectively. Terms can appear in various morphological structures as there can be several sequences of affixes representing articles, prepositions, numbers, etc. For instance, the stopword ውስጥ /wɨsɨt’ɨ ‘in’/ has the following variants: ውስጣዊ /wɨsɨt’awi/, ውስጣችን /wɨsɨt’atʃɨnɨ/, ውስጥና /wɨsɨt’ɨna/, ውስጥም /wɨsɨt’ɨmɨ/, የውስጥ /jəwɨsɨt’ɨ/, ለውስጥ /ləwɨsɨt’ɨ/, በውስጥ /bəwɨsɨt’ɨ/, ከውስጥ /kəwɨsɨt’ɨ/, የውስጥና /jəwɨsɨt’ɨna/, etc. Furthermore, some stopwords merge with each other or other words to form new words. Thus, it is impossible to find and remove most Amharic stopwords before the application of morphological analysis. This calls for a different consideration for Amharic stopword identification and removal in comparison with morphologically simpler languages such as English.
As we designed stem-based and root-based Amharic IR system, we also constructed stopword lists based on stem and root forms. In either of the cases, stopwords are identified after morphological segmentation of words from a large corpus of documents representing various domains and sources. For example, words such as ስላላመጣቸው /
sɨlalamət’atʃəw ‘since he did not bring them’/ and ከትልልቆችም /
kətɨlɨlɨkʼotʃɨm ‘even from big ones’/ undergo the following morphological process to extract attached stem-based stopwords.
ከትልልቆችም | ስላላመጣቸው |
ከ_ትልቅ_ኦች_ም | ስለ_አል_አ_መጥ_አቸው |
kə_tɨlɨkʼ_otʃɨ_mɨ | sɨlə_ʔəlɨ_ʔə_mət’ɨ_ʔətʃəwɨ |
[pre]-[stem]-[p]-[foc] | [com]-[neg]-[cau]-[stem]-[3,p] |
Similar to the stem-based stopword list, the root-based stopwords were built based on the root-based morphological process as shown in the following example.
ከትልልቆችም | ስላላመጣቸው |
ከ_ት-ል-ቅ_ኦች_ም | ስለ_አል_አ_ም-ጥ_አቸው |
kə_t-l-kʼ_otʃɨ_mɨ | sɨlə_ʔəlɨ_ʔə_m-t’_ʔətʃəwɨ |
[pre[-[root]-[p]-[foc] | [pre]-[neg]-[cau]-[root]-[p] |
Statistical information about terms plays a significant role for identification of stopwords. However, the notion of term depends on the characteristics of languages. For morphologically simple languages such as English, stems can be considered as terms. However, this is not exactly the case with morphologically complex languages such as Amharic. We hypothesize that morphemes used to form Amharic words can be used as a basis for computing term statistics. Thus, in this work, we consider morphemes as terms. Both the stem-based and root-based stopwords were created based on the aggregation of the morpheme statistical information (frequency, mean, variance and entropy) from Amharic corpus. Accordingly, for the entire documents that we collected, we compute frequency, mean, variance and entropy of each morpheme in order to generate corpus-based Amharic stopword list as detailed below.
The frequency of morpheme is represented by document frequency and collection frequency. The document frequency of a morpheme is the number of documents where the morpheme occurs whereas the collection frequency is the total morpheme frequency throughout the corpus. In this work, we used un-normalized morpheme frequency. Morphemes were ranked according to their document frequency and collection frequency. Then, a threshold value was used to determine stopwords. Morphemes that are evenly distributed throughout the collection and satisfy the threshold value are considered as stopwords. The document frequency df of each morpheme is computed as:
where
Mi is the
ith morpheme in the corpus,
Di is the
ith document in the corpus,
N is the total number of morphemes in the collection. If a morpheme appears in a given document, its status is 1 otherwise 0. The collection frequency
cf of each morpheme is computed as:
where
MFDi is the morpheme frequency in each document,
N is the total number of documents in the corpus.
The mean value of each morpheme is used to measure the overall distribution of morphemes in the whole corpus. The mean probability
mp of each unique morpheme in all documents is computed as:
where
N is total number of documents and
p(
Mi) is morpheme probability which is computed as:
where
MF is morpheme frequency in each document and
TM is the total number of morphemes in a document.
The variance of morphemes is used to check the distribution of morphemes throughout the documents in the corpus. Variance
v is computed as:
where
v(
Mi) is the
ith morpheme variance,
N is the total number of distinct morphemes in the document,
n(
Mi) is normalized morpheme frequency in a document and
m(
Mi) is mean value computed as follows.
where
MF is morpheme frequency and
N is the number of words in a document.
Entropy is used to measure the information value
e of each morpheme in the corpus. This method is based on the amount of information a morpheme carries. Stopwords are known to have low explanatory values. If the entropy value of a word is high, then the information value of the word is low. The entropy value of each morpheme in the corpus is calculated as:
where
p(
Mi) is the probability of morpheme frequency and is calculated by dividing the morpheme frequency with the total number of morphemes in the document.
The stopwords were selected based on the aggregation of document frequency, mean, variance and entropy values of morphemes. Initially, four lists each containing the top 250 morphemes were selected using the statistical information of morphemes in the corpus. Out of these lists, 180 morphemes located across all the four lists were then selected through empirical analysis (see
Table 4).
The final stopword list also includes few words which were selected manually based on Amharic subword class criteria. These words are functional words used mainly for the formation of phrases, sentences and paragraphs. These words are characterized by the lack of meaning by their own, inability to undergo morphological derivation and inflection, lack of morphemes for various parameters, and they have small word size [
53]. It includes words such as ወደ /
wədə ‘towards’/, እንደ /
ʔɨndə ‘like’/, ስለ /
sɨlə ‘about’/, እስከ /
ʔɨskə ‘up to’/, ወዘተ /
wəzətə ‘and so on’/, ጎሽ /
goʃ ‘bravo’/, ዋ /
wa ‘warning’/, ይልቅ /
jɨlɨk ‘instead’/, etc. Based on this criterion, we selected 42 words. Thus, the final stopword list contains 222 terms.
The stem-based and root-based stopwords are then removed from the respective documents and queries during the process of indexing and query processing. The stem-based and root-based stopword lists we have built are re-usable, and they are one of the outputs of this research as we used a large set of documents from different domains and sources.
4.6. Matching and Ranking
In the proposed system, the extraction of index and query terms from documents and user queries uses the same workflow with text preprocessing, morphological analysis and stopword removal. Searching for relevant documents is based on matching query terms (representing information need of users) with index terms (representing documents). We used exact vocabulary term matching which searches documents that contain the query terms without analyzing the semantics of words and without considering the semantic connections between them. The retrieval probability of a relevant document for a given query is different in case of stem-based and root-based retrievals. In general, the retrieval probability of a relevant document for a given query in case of root-based matching is higher than stem-based matching. For example, consider the sample document (
Appendix A) and the following four query terms derived from ስ-ብ-ር /
s-b-r/:
ስብራት /sɨbɨratɨ ‘the state of something which is broken’/
መሰበር /məsəbərɨ ‘to break’/
ሰባራ /səbara ‘something which is broken’/
አሳባሪ /ʔəsabari ‘cause to break’/.
The document (
Appendix A) is relevant to the four queries. The retrieval probability of the document in the case of stem-based matching is 0.032 for term query (1), 0.017 for (2), 0.004 for (3) and 0 for (4). However, in case of root-based matching, the retrieval probability of the document (at
Appendix A) to the four queries is similar (0.057), which is better than stem-based matching. Furthermore, the stem-based text representation retrieves non-relevant documents since it may conflate semantically unrelated words.
We use Lemur toolkit for ranking. For a given query
Q and a collection of retrieved documents
D, the Lemur toolkit ranks retrieval results based on their possible relevance. It implements both BM25 and language modeling, where the document length is considered. BM25 ranks documents based on the following equation:
where
f (
qi,
D) is
qi’s term frequency in the document
D, |
D| is the length of the document
D in words, and
avgdl is the average document length in the text collection from which documents are drawn whereas
k1 and
b are free parameters.
IDF (
qi) is the inverse document frequency weight of the query term
qi. From Equation (10),
Score (
D,
Q) would be higher when query terms in
Q have higher frequencies in document
D. For each
qi in
Q, the following inequality holds true for Amharic documents.
where
fr,
fs, and
fw denote term frequency in root-based, stem-based and word-based representations, respectively. Thus, it can be inferred that the root-based representation of queries and documents provide better information for ranking as
fr(
qi,
D) provides the highest possible score.
For language modeling, the similarity between a document
D and a query
Q is measured by the Kullback-Leibler (KL) divergence between the document model
Dθ and the query model
Qθ. The Kullback-Leibler (KL) divergence ranking function captures the term occurrence distributions and computed as:
where
w is word,
v is word vector,
p(
w|Qθ) is estimated query term,
p (
w|) is the smoothed probability of a term seen in the document.