Handwriting-Based Text Line Segmentation from Malayalam Documents

P V, Pearlsy; Sankar, Deepa

doi:10.3390/app13179712

Open AccessArticle

Handwriting-Based Text Line Segmentation from Malayalam Documents

by

Pearlsy P V

^*

and

Deepa Sankar

Division of Electronics Engineering, School of Engineering, Cochin University of Science and Technology, Kochi 682022, India

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(17), 9712; https://doi.org/10.3390/app13179712

Submission received: 13 July 2023 / Revised: 22 August 2023 / Accepted: 23 August 2023 / Published: 28 August 2023

(This article belongs to the Special Issue Current Trends and Future Perspectives on Computer Vision and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

The proposed technique and the database created will be useful for the development of an optical character recognition system for Malayalam handwritten documents.

Abstract

Optical character recognition systems for Malayalam handwritten documents have become an open research area. A major hindrance in this research is the unavailability of a benchmark database. Therefore, a new database of 402 Malayalam handwritten document images and ground truth images of 7535 text lines is developed for the implementation of the proposed technique. This paper proposes a technique for the extraction of text lines from handwritten documents in the Malayalam language, specifically based on the handwriting of the writer. Text lines are extracted based on horizontal and vertical projection values, the size of the handwritten characters, the height of the text lines and the curved nature of the Malayalam alphabet. The proposed technique is able to overcome incorrect segmentation due to the presence of characters written with spaces above or below other characters and the overlapping of lines because of ascenders and descenders. The performance of the proposed method for text line extraction is quantitatively evaluated using the MatchScore value metric and is found to be 85.507%. The recognition accuracy, detection rate and F-measure of the proposed method are found to be 99.39%, 85.5% and 91.92%, respectively. It is experimentally verified that the proposed method outperforms some of the existing language-independent text line extraction algorithms.

Keywords:

handwritten document; Malayalam; text line extraction; horizontal projection

1. Introduction

The Malayalam language is spoken by the people of Kerala State in India. After the pandemic situation created by the novel coronavirus, it has become a necessity to encode the local language, conventionally written in pen and paper, in an electronic format. Optical character recognition (OCR) systems convert handwritten documents to a computer-editable digital form. This is highly beneficial for individuals to share documents among different offices, banks, and teacher notes when conducting online classes in the local language. Malayalam is a language with a rich character set and it is written from left to right. Some of the vowel diacritics in the language extend above or below the full length of a normal alphabet. The language contains compound letters formed of two different characters. The characters have a looped and curved nature. Some characters, like ‘Chandrakkala’, in the language are written with a space above another character. Because of the large variations in handwriting styles, large gaps are created above or below the characters where such letters are written. Some of these possible cases are illustrated in Figure 1. All these factors make the recognition of handwritten documents in this language a complex problem [1,2]. Therefore, the development of OCR for the recognition of unconstrained handwritten document images for the Malayalam language has not progressed yet. The work presented in this paper extracts the text lines from handwritten Malayalam by considering different handwriting styles in the language. Most of the research is conducted on the recognition of isolated handwritten Malayalam characters. As far as unconstrained handwritten documents are considered, the recognition of individual characters can be performed in a hierarchical way. First, text lines are extracted; then, words are segmented from each line; after this, characters are segmented from each word. These characters are recognized and converted into digital form. Using this digital representation, the handwritten characters are encoded into a printable text format. The success of this entire process is mainly dependent on the correctness of the text line extraction step. In this paper, a novel technique for the extraction of text lines from Malayalam handwritten documents is developed.

The paper is organized as follows. Section 2 describes the related work in this area. Section 3 presents the proposed methods for text line extraction. The experimental setup, results and discussion are detailed in Section 4, while Section 5 concludes the work.

2. Related Works

Optical character recognition (OCR) converts pdf files, scanned documents, images containing text and printed and handwritten documents into editable electronic documents. OCR can be implemented for printed character recognition and handwritten character recognition. The latter is further categorized into online and offline recognition systems. In online systems, real-time recognition of characters is performed while the user is writing. Offline recognition is performed on document images or pdfs. Therefore, it is a more complex problem than online recognition. OCR for languages like English and Chinese is available in handheld devices and personal computers [3,4]. In developing OCR for handwritten documents, text line segmentation is a major step and is a challenging problem. Ref. [5] gives an overview of the different methods used for text line extraction, such as projection profiles, smearing methods, Hough transform-based methods and stochastic methods using the Viterbi algorithm. A Khandelwal et al. proposed a method to extract text lines by considering the height of the connected components and neighborhood connectivity [6]. In [7], connected components in the image are partitioned into three subsets based on their height. Every connected component in a subset is bounded by an equally sized block. Text lines are extracted by applying a Hough transform to these blocks. Ref. [8] presents text line extraction from handwritten documents using natural learning techniques based on the Hough transform. The Hough transform is applied to the minima points of connected components in a small strip of the image. Minima points are then clustered by applying a moving window algorithm to extract the text lines. A Souhar et al. [9] performed text line segmentation using a watershed transform for Arabic documents. Text lines were linked by connected component analysis and the watershed transform was applied to detect the text lines. Another approach uses a baseline detected using a projection profile, and then a watershed transform is applied on the extracted path to segment the text lines. Deep learning architectures like convolutional neural networks (CNN) and generative adversarial networks (GAN), trained using annotated images, are used to extract the text lines in [10,11]. B K Barakat et al. [12] proposed an unsupervised deep learning technique for the extraction of text lines. It makes use of the difference in the number of foreground pixels in the text line region and the space between the text lines. Ref. [13] presents a learning free algorithm for text line segmentation. On convolving the input image and second derivative of an anisotropic Gaussian filter, blob lines are detected, which strike through the text lines. Text lines are extracted by assigning connected components to the blob lines detected. Text line extraction from handwritten documents in Indian languages such as Oriya, Bangla and Kannada has been performed by applying a projection profile on the document image [14,15,16]. Works on the recognition of handwritten Malayalam characters have mainly focused on classifying isolated letters. Jomy John et al. [17] proposed a method to recognize the individual characters in Malayalam using the gradient and curvature features of handwritten characters. In [18], handwritten Malayalam character recognition is performed by determining the position and number of horizontal and vertical lines in the skeletonized character set. A chain code histogram from the chain code representation of the boundary of a skeletonized character image is used as a feature vector for character recognition in [19]. Ref. [20] presents the recognition of handwritten Malayalam vowel characters using a Hidden Markov Model tool kit. P. Jino et al. proposed stacked long short-term memory (LSTM) neural networks for the recognition of Malayalam letters [21]. In [22], the count of the zero-crossings of the wavelet coefficient is used as a feature to classify characters. A 1D wavelet transform is applied on the horizontal and vertical projection of the character image and the transform coefficients are considered as the feature vectors for classification in [23]. A benchmark database for isolated Malayalam handwritten characters was created in the work published in [24], and it also showcased the alphabets in the Malayalam language. Malayalam OCR for the recognition of printed characters is available online [25,26]. The software e-Aksharayan converts printed documents in seven Indian languages, including Malayalam, into an editable text form [27]. C Shanjana et al. [28] proposed a technique for the recognition of characters from unconstrained Malayalam handwritten documents. They segmented text lines, words and characters in the handwritten document image using horizontal and vertical projection.

A novel method is proposed in this paper to extract the text lines from Malayalam handwritten documents. The proposed method is detailed in the next section.

3. Proposed Method

The different steps involved in the newly developed technique for the extraction of text lines are shown in Figure 2. The main steps in the proposed method can be summarized as preprocessing, the detection and segmentation of overlapping lines and the detection and segmentation of incorrectly segmented short lines to the correct lines. These steps are explained in the following section.

3.1. Preprocessing

The handwritten document is first scanned and a color image in PNG format is obtained. It is converted into a grayscale image, which is binarized using Otsu’s global thresholding algorithm [29]. In MATLAB, the function graythresh() returns an image-dependent threshold value, calculated using Otsu’s global thresholding algorithm. This threshold is used by the function imbinarize() to convert the grayscale image into binary with black characters on a white background. The binarized image is then inverted so that the image can be subjected to horizontal and vertical projection to segment the text lines. The preprocessed image thus obtained is named

I_{b}

. The image

I_{b}

is divided into three vertical stripes of equal width. A morphological operation is performed on each vertical stripe to eliminate the isolated pixels, i.e., 1 surrounded by 0s. Following this, a morphological bridge operation is performed on all the vertical stripes to connect or bridge the pixels that are not connected. A summary of the overall processes in performing text line extraction after the preprocessing step is explained below.

Text lines are extracted in each vertical stripe separately using the horizontal projection (HP) method described in [28]. The extracted text lines from each vertical stripe may be correctly or incorrectly segmented lines. Incorrect segmentation is due to the presence of overlapping lines or due to some special characters such as ‘Chandrakkala’ in the Malayalam language and the wide separation of some letters written below a particular character. The latter case contributes to the incorrect segmentation of such characters into separate lines called short lines. All these possibilities are checked for and corrected by the proposed technique. As a first step, the extracted line in the vertical stripe is checked regarding whether it is an overlapped line. If not, then it is checked for a short line. If a line is detected as an overlapped line, it is further divided into individual lines. The individual lines obtained after segmenting the overlapped lines are checked for an incorrectly segmented short line. If detected as a short line, it is joined to the correct line. The line segments obtained in the three vertical stripes after all these procedures are joined with the corresponding line segments in the other vertical stripes. These methods are developed with the assumption that none of the lines begin as a paragraph with large indentation from the margin. Moreover, it is assumed that, except for the last line in the page, no lines terminate before the end of a line in the page. These steps are taken to ensure that all three vertical stripes contain parts of every line except the last line. A detailed description of the developed techniques to detect and process overlapping lines and short lines is given in the following subsections.

3.2. Detection of Overlapping Lines

Find the average value of the height of the characters ( $A v g_{c_{h t}}$ ) in the preprocessed handwritten image, $I_{b}$ . This is obtained by finding the average height of the connected components in $I_{b}$ .
Identify the region containing the lines in each vertical stripe. Then, find the median value of the number of rows in each region, which indicates the line height, $L_{h t}$ . The obtained value is the median value of the line heights, $M_{L_{h t}}$ , in a vertical stripe.
The threshold value, $T_{o v}$ , for identifying the overlapping lines is calculated based on the values obtained from step 1 and step 2. Threshold $T_{o v}$ is computed as follows.
If ( $M_{L_{h t}}$ $>$ $4 \times A v g_{c_{h t}}$ ),

$T_{o v} = 1.5 \times M_{L_{h t}} .$

(1)

Otherwise,

$T_{o v} = 1.9 \times M_{L_{h t}} .$

(2)
Compare the height of each line with the threshold value, $T_{o v}$ .
If the height of the line segment is above or equal to $T_{o v}$ , it is detected as an overlapped line.
If a line is detected as an overlapped line segment, then the number of lines $L_{c n t}$ in the overlapped line is calculated as follows.
If ( $M_{L_{h t}} > 5 \times A v g_{c_{h t}}$ ),

$L_{c n t} = \frac{L_{h t}}{5 \times A v g_{c h t}} .$

(3)

If ( $M_{L_{h t}} > 4 \times A v g_{c_{h t}}$ and $M_{L_{h t}} \leq 5 \times A v g_{c_{h t}}$ ),

$L_{c n t} = \frac{L_{h t}}{M_{L_{h t}}} .$

(4)

If ( $M_{L_{h t}} > 3 \times A v g_{c_{h t}}$ and $M_{L_{h t}} \leq 4 \times A v g_{c_{h t}}$ ),

$L_{c n t} = \frac{L_{h t}}{1.4 \times M_{L_{h t}}} .$

(5)

If ( $M_{L_{h t}} > 2 \times A v g_{c_{h t}}$ and $M_{L_{h t}} \leq 3 \times A v g_{c_{h t}}$ ),

$L_{c n t} = \frac{L_{h t}}{1.7 \times M_{L_{h t}}} .$

(6)

The size of the characters written may vary from person to person. This is considered while framing Equations (1)–(6) to compute the threshold value

T_{o v}

and the number of lines in the overlapped segment

L_{c n t}

in the proposed technique. Once a line segment is identified as having overlapping lines, it has to be segmented into individual lines. The steps involved in segmenting the overlapping lines are shown in Figure 3 and a detailed description is given in the next subsection.

3.3. Separation of Overlapping Lines

A novel technique is developed to separate overlapping lines present in the extracted text line segment from vertical stripes. The number of times that the method is applied for the separation of these overlapping lines depends on the number of overlapping lines in the extracted text line. This is obtained using Equations (3)–(6). The technique then determines the region,

R_{o v}

, where the initial segmentation is to be carried out. Let

r_{1}

and

r_{2}

be the rows between which the overlapping of lines occurs, which are determined by Equations (7)–(10).

If (

M_{L_{h t}} > 4 \times A v g_{c_{h t}}

),

r_{1} = M_{L_{h t}} - \frac{A v g_{c_{h t}}}{2},

(7)

r_{2} = M_{L_{h t}} + \frac{A v g_{c_{h t}}}{2} .

(8)

Otherwise,

r_{1} = M_{L_{h t}} + \frac{A v g_{c_{h t}}}{2},

(9)

r_{2} = M_{L_{h t}} + A v g_{c_{h t}} .

(10)

The reason for overlapping lines is the presence of ascenders and descenders in the language. The ascenders and descenders will be present in the region between

r_{1}

and

r_{2}

. The parts of the characters that are present in this region must be identified as ascenders or descenders. Accordingly, the overlapping lines have to be segmented. Each line from the region containing the overlapping lines is segmented out one by one.

For this, the beginning and end of each character in the region

R_{o v}

is found using the horizontal projection method. The beginning of a character is the first non-zero horizontal projection (HP) value that exists between

r_{1}

and

r_{2}

. Similarly, the last non-zero HP value is the end of the character in the vertical direction. These positions can be named

c h_{o_{b e g}}

and

c h_{o_{e n d}}

. This region is specified as the character region in the flowchart shown in Figure 3. In order to identify the presence of ascenders and descenders in this region, the row number of the largest two HP values between

c h_{o_{b e g}}

and

c h_{o_{e n d}}

is found. Let the row containing the highest HP value be

r h_{p_{1}}

and the second highest HP value be

r h_{p_{2}}

. Now, the text line region containing the overlapping lines is divided into three parts, the upper region (

U_{o v}

), lower region (

L_{o v}

) and character region. The region above the character region is termed the upper region,

U_{o v}

, and that below the character region is termed the lower region, denoted as

L_{o v}

. The portion of the characters located in the region of overlapping lines is associated with the appropriate line as per the following rule.

If (

r h_{p_{1}} < r h_{p_{2}}

),

Join rows

c h_{o_{b e g}}

to

r h_{p_{1}} - 1

to the bottom of

U_{o v}

;

Join rows

r h_{p_{2}} + 1

to

c h_{o_{e n d}}

to the top of

L_{o v}

.

Otherwise,

Join rows

c h_{o_{b e g}}

to

r h_{p_{2}} - 1

to the bottom of

U_{o v}

;

Join rows

r h_{p_{1}} + 1

to

c h_{o_{e n d}}

to the top of

L_{o v}

.

Thus, the upper region and lower region are updated. The rows between

r h_{p_{1}}

and

r h_{p_{2}}

are divided columnwise to separate the portion of the characters present in this region. This portion contains the ascenders and descenders that are responsible for the overlapping. The columnwise segmentation is performed using the vertical projection (VP) method. For each such segmented character, if the HP value in the first row is greater than that in the last row, the character is joined to the lower region, considering it as an ascender. Otherwise, it is considered as a descender and joined to the upper region.

After segmenting the overlapped lines, it is checked whether they are short lines. The technique developed to detect the short lines is described in the next subsection.

3.4. Detection of Incorrectly Segmented Short Lines

As discussed in Section 3.1, short lines are the result of incorrect segmentation due to the spaces created by characters written above or below a letter. Depending on the handwriting, this spacing will vary. Short lines contain sparsely scattered fragments of characters and therefore the height will be smaller. Hence, the threshold to detect short lines depends on the height of the line, the width of the characters present in the line and the number of non-zero vertical projection values. Because of this dependency, thresholds for detecting short lines are developed and are given in Equations (11)–(14).

Determine the threshold value $t_{s h_{1}}$ using the equation

$t_{s h_{1}} = 1.9 \times A v g_{c h_{w i d}}$

(11)

where $A v g_{c h_{w i d}}$ is the average character width in a document image.
A second threshold value $t_{s h_{2}}$ is determined as follows.
If ( $M_{L_{h t}} \leq 3 \times A v g_{c_{h t}}$ ),

$t_{s h_{2}} = \frac{t_{s h_{1}}}{3} .$

(12)

If ( $M_{L_{h t}} > 3 \times A v g_{c_{h t}} and M_{L_{h t}} \leq 4 \times A v g_{c_{h t}}$ ),

$t_{s h_{2}} = \frac{t_{s h_{1}}}{4} .$

(13)

If ( $M_{L_{h t}} > 4 \times A v g_{c_{h t}}$ ),

$t_{s h_{2}} = \frac{t_{s h_{1}}}{7} .$

(14)
Compute the number of non-zero vertical projection values, VPN, in the given text line segmented from the handwritten document. This is determined to check the sparsity of character fragments present in the short line.
A short line is detected
If ( $V P N < 2 \times t_{s h_{1}}$ and $V P N > t_{s h_{2}}$ )
Or
If ( $V P N 〈9 \times t_{s h_{1}} and V P N〉 t_{s h_{2}} and L_{h t} < 0.5 \times M_{L_{h t}}$ ).
Depending on the handwriting, the values of $A v g_{c_{h t}},$ $A v g_{c h_{w i d}}$ , $M_{L_{h t}}$ may vary from one document to another.

3.5. Joining of Incorrectly Segmented Line to the Correct Line

If the detected short line is the first line in a vertical stripe, we join it with the next line. If it is the last line, we join it with the previous line. If the line is between the first and last lines, we find the position at which the HP value is highest in the line. If the position is after the middle of the line, then the line is considered incorrectly segmented due to the character ‘Chandrakkala’. Therefore, it is joined to the next line; otherwise, it is joined to the previous line.

4. Results and Discussion

The proposed techniques are implemented using the software MATLAB R2022b. A new database of Malayalam handwritten documents, LIPI, is created as an initial step in this research work. All the techniques proposed in this paper are validated using images taken from this database. As discussed in Section 1, the proposed methods are developed by considering the specific characteristics of the Malayalam alphabet and different handwriting styles. Therefore, a publicly available database of other languages cannot be used to test the proposed method. A brief description of the newly developed database, LIPI, is given in the next subsection.

4.1. Database for Malayalam Handwritten Documents

The database is created by collecting Malayalam handwritten documents from professionals and undergraduate and postgraduate students in the age group between 18 and 45. The articles for the manuscript are collected from leading Malayalam newspapers and textbooks that cover all the alphabets, consonant diacritics and conjunct consonants in the Malayalam language. Initially, a manuscript of the article is written on A4 size paper without any constraints on the pen and the script of the Malayalam language. At present, people write using a combination of old Malayalam script and new Malayalam script. In total, 402 handwritten documents collected from 200 people are scanned using an Epson L310 flatbed scanner with 200 dpi. The time taken to scan one document is approximately 29 s. A faithful representation of the images is obtained. Some observations are that, in 1% of the scanned image, some straight lines are seen at the bottom of the paper where text is not present. These lines are automatically removed during the processing of the vertical stripes to eliminate short lines. All images are scanned at the same rate. No prominent errors are obtained, according to the handwriting style of the author. The text that is written at the right end of the paper without a margin may be lost as it crosses the scanning boundary.

The scanned image is in PNG format and its size is 2338 × 1654. An overall description of the images in the database is given in Table 1. The ground truth images for all the lines in the document are created manually using the freehand crop tool in the MATLAB software.

A sample image from the newly developed LIPI database is shown in Figure 4 and the image after binarization is shown in Figure 5. The ground truth images created for text lines 1 and 5 in the image of Figure 5 are shown in Figure 6a,b, respectively. Ground truth images are created for each of the 7535 text lines in the handwritten document images. As shown in Figure 6a,b, the exact position of the text lines in the document images is retained in the ground truth images.

4.2. Implementation Results

The image obtained from the scanner is a color image, as in Figure 4. It is converted to grayscale and then to binary. The binary image is inverted to make the background black and foreground (handwritten characters) white. The binary image obtained after such processing is shown in Figure 5. The binary image is divided into three vertical stripes so as to have almost straight lines, as illustrated in Figure 7. In the proposed method, the number of vertical stripes is fixed at three. If the page is not divided into vertical stripes, then the text lines will be slanting. A similar case arises when the number of vertical stripes is 2. In both these cases, text line extraction based on the horizontal projection method will not yield good accuracy. If the number of vertical stripes is increased above three, the width of each vertical stripe decreases accordingly. Then, there is a possibility of missing the portion of the text line present in these vertical stripes completely, which is against the assumptions based on which the algorithms are designed.

The text lines extracted from vertical stripes using the horizontal projection (HP) method consist of overlapping and short lines, as shown in Figure 8. Multiple lines or overlapping lines are present in vertical stripe 1 (

V S_{1}

) and vertical stripe 2 (

V S_{2}

) in lines 12 and 13, respectively. The average height of the characters for a document in the database images is between 11 and 23 and is dependent on the handwriting of the different authors. Based on the average height of the characters, the threshold

T_{o v}

, computed using Equations (1) and (2) to detect overlapping lines, is in the range of 91.5 to 140.65 for all the images in the database. This is because of the size variations in the characters written by different authors. To demonstrate the segmentation of overlapping lines using the proposed technique, such lines in

V S_{2}

in Figure 8 are shown separately in Figure 9. It is observed that the HP value is non-zero between the overlapping lines because of ascenders and descenders in the language. To segment the overlapping lines, a region is found using the proposed method and it is shown using dotted lines for a better understanding in Figure 9. The overlapping lines are segmented correctly by applying the proposed techniques discussed in Section 3.2 and Section 3.3 and is shown in Figure 10. For the overlapping lines shown in Figure 9 of size 162 × 552, the region for the segmentation of the lines is found by computing the values

r_{1}, r_{2}

using Equations (7)–(10) and is obtained as 63.5 and 72.5, respectively.

r_{1}

and

r_{2}

are marked manually using dotted lines in Figure 9. Rows 64 to 72 represent the character region, as discussed in Section 3.3. Therefore, rows 1 to 63 are the upper region, and rows 73 to 162 are the lower region. From the database of 402 handwritten document images, 629 overlapping lines are detected, of which 441 are segmented correctly. The overlapping lines in

V S_{1}

and

V S_{2}

in Figure 8 are segmented correctly and the result of the segmentation is depicted in Figure 11.

The remaining short lines in Figure 11 are addressed using the proposed methods discussed in Section 3.4 and Section 3.5. The height of the short lines is very small and the characters present in such lines can be as small as a single dot. The short lines are joined perfectly to the appropriate lines, as shown in Figure 12. The thresholds

t_{s h_{1}}

and

t_{s h_{2}}

, computed according to the proposed method in Section 3.4 to detect short lines, are in the range of 19 to 39.9 and 2.7145 to 13.3, respectively. These two thresholds are dependent on the average character width per page, which lies between 10 and 22 for the document images in the database. While using the HP method, the gap between a Malayalam letter and a character such as ‘Chandrakkala’ placed above it, or as in compound characters where a letter is placed below another letter, results in the segmentation of such characters into separate lines called short lines. Since both these cases are not present in Figure 11, the possibility of ‘Chandrakkala’, a compound character segmented as a short line, is given in Figure 13, Figure 14, Figure 15 and Figure 16. In Figure 13, ‘Chandrakkala’ is incorrectly segmented due to the small gap between the characters above which it is written, and this gap will vary depending on individual handwriting styles. The proposed techniques are able to exactly join ‘Chandrakkala’ to the correct line, which is shown in Figure 14. While writing a compound character with one letter below another letter, a space is created due to the writing style. Due to this gap, the letter written below is segmented as a short line, which is illustrated in Figure 15. The proposed techniques perfectly rejoin the letter, which is depicted in Figure 16. From the images in the LIPI database, 2607 short lines are detected, out of which 2577 short lines are joined perfectly to the appropriate lines.

To obtain the complete line, the lines extracted in the three vertical stripes are joined together with the corresponding lines. The result of complete text line extraction is shown in Figure 17.

When a text line is extracted from a vertical stripe, the upper and lower positions of the line in the vertical stripe are stored in an array. When short lines are detected and joined to the correct lines, the position of the newly formed line is stored and that of the short line will be deleted. When a short line is joined to the correct line, the total number of lines extracted from the vertical stripe is reduced accordingly. Similarly, as for the overlapping lines, the positions of newly formed lines after segmenting the overlapping lines will be updated and the total number of lines in the vertical stripe will be increased accordingly. After completing the line extraction separately from each vertical stripe, it has to be joined to the corresponding part of this line in the other two vertical stripes.

One of the challenges encountered is that the lines extracted from the vertical stripe may not have the same height. To overcome this, a template of black pixels with the same size as the binarized document image is created. Then, a text line from vertical stripe 1 is placed in the position from which it is extracted. Similarly, the corresponding part of this text line from vertical stripe 2 and vertical stripe 3 is placed and a complete line is formed. As can be observed, the extracted text line appears identical to the text line in the inverted binarized image. Then, the text line formed is stripped off from the template and the output is shown in Figure 17.

The process of joining the text lines extracted from the three vertical stripes is performed successfully, exactly matching the text line in the binarized image. An error occurs only if the text lines are not extracted correctly from any of the three vertical stripes.

4.3. Analysis of Word Area and Text Line Density in Malayalam Handwritten Documents

The self-similarity and complexity of different shapes in space can be quantitatively expressed using the Minkowski Dimension [30]. To calculate this, consider a grid of boxes covering an object in space. Measure the number of boxes that cover the object and repeat the same using boxes scaled at different sizes. The Minkowski Dimension is calculated using Equation (15).

Minkowski Dimension, MD = \frac{- \log n (s)}{\log (s)}

(15)

where

n (s)

is the number of boxes with a box size as

s

.

The Minkowski Dimension of words in ten Malayalam handwritten images is obtained and is shown in Figure 18. Ten pages written by different authors, with the same content, are considered for the analysis. The possible range of values of the Minkowski Dimension, MD, is between 0 and 2 for objects in two-dimensional space. The Minkowski Dimension gives a measure of the area occupied by the word or how densely the characters fill the word. If the value of the Minkowski Dimension lies between 0 and 1, it indicates that the characters fill the word space less densely or it indicates more void spaces in the word. It also indicates that the spacing between the characters in the word is not uniform. If the Minkowski Dimension is between 1 and 2, it indicates that the characters are placed more densely in a word and uniform spacing between the characters inside the word. It gives more insight into the regularity of the strokes, spacing, loops, curves and connections within the word. The Minkowski Dimension can be used to analyze the handwriting styles of different authors using the fractal properties of the handwritten text. Figure 18 shows the values of the Minkowski Dimension for words in ten document images written by 10 different writers, with the same content. The values of the Minkowski Dimension displayed in Figure 18 range between 0.4524 and 1.1581.

The Minkowski Dimension is very useful in document analysis as it gives insight into the complexity of characters, words and text lines.

Text line density is a measure that is useful in the segmentation of text lines. It is found using Equation (16):

Text line density = \frac{A r e a o f t e x t l i n e s i n t h e d o c u m e n t}{A r e a o f t h e d o c u m e n t}

(16)

Text line density analysis gives more information about the authors, writing tools and structures of documents. The text line density of 15 Malayalam handwritten pages written by 15 authors is shown in Figure 19. It ranges between

14.06 %

and

29.7 %

, which indicates that the text lines are less densely packed, with larger spacing and large margins.

The accuracy of the detection and joining of short lines, the detection and segmentation of overlapping lines and text line extraction are given in Table 2. The performance of the text line extraction process is quantitatively evaluated using the metrics discussed in the next subsection.

4.4. Performance Evaluation

The performance of the newly developed text line extraction method is quantitatively evaluated using standard metrics such as the MatchScore, Detection Rate (DR), Recognition Accuracy (RA) and F-measure (FM) [31].

4.4.1. MatchScore

The MatchScore value gives a quantitative measure of the number of ON pixels in the ground truth line matching the ON pixels in the detected line. A MatchScore table is constructed by computing the MatchScore value of a detected text line with all ground truth text lines of the corresponding document. The MatchScore value between the

i t h

detected line and

j t h

ground truth line is calculated using Equation (17):

MatchScore (i, j) = \frac{C (G_{j} \cap D_{i} \cap I_{b})}{C ((G_{j} \cup D_{i}) \cap I_{b})}

(17)

where

C (X)

gives the number of elements in set

X

,

I_{b}

is the preprocessed image or binary image that is used to extract the text lines,

G_{j}

is the set of pixels in the

j t h

ground truth line and

D_{i}

is the set of pixels in the

i t h

detected line. The range of MatchScore values is between 0 and 1, where 1 indicates the best match. The detected text line or the extracted text line is correct if there exists a one-to-one match between the detected text line and ground truth line and the MatchScore value is greater than a threshold value,

T_{m s}

. The threshold value chosen is 0.9999, compared to the existing works in [31,32]. It is observed that if the MatchScore is greater than 0.9999, the detected text line and ground truth line are perfectly matched. The total number of ground truth (GT) lines is 7535, and 6482 lines are detected from the 402 document images. The detected lines having a one-to-one match with the ground truth lines, with a MatchScore value greater than 0.9999, amount to 6443. This shows that 85.5% of all the correctly detected text lines are perfectly matched with the corresponding ground truth line in all the documents considered for simulation. The number of ground truth (GT) lines and detected (D) lines and the % of detected lines with a MatchScore higher than

T_{m s}

are depicted in Table 2.

4.4.2. Detection Rate (DR), Recognition Accuracy (RA) and F-Measure (FM)

The Detection Rate (DR) is the ratio of the number of correctly detected lines to the given number of ground truth lines in the handwritten image document and is computed as in Equation (18).

Detection Rate, DR = \frac{N_{c}}{N_{g}}

(18)

where

N_{c}

is the number of correctly detected lines and

N_{g}

is the number of ground truth lines. The detection rate for the experiment is found to be 85.5%. The accuracy of the detected lines is calculated using the recognition accuracy metric as in Equation (19):

Recognition Accuracy, RA = \frac{N_{C}}{N_{d}}

(19)

where

N_{d}

is the number of detected lines. The implementation of the proposed method results in recognition accuracy of 99.39%.

The harmonic mean of DR and RA is called the F-measure (FM) and it is computed using Equation (20).

F - measure, FM = \frac{2 \times D R \times R A}{D R + R A}

(20)

The value of FM lies between 0 and 1. If FM is 1, it means that DR and RA are the maximum. For the LIPI database, the proposed method results in an F-measure of 91.92%. If DR or RA is zero, then FM is zero. The observed values of DR, RA and FM for the experiment in percentages are given in Table 3.

The proposed method is compared with language-independent text line extraction algorithms like A* Path Planning [33] and the piecewise painting algorithm [34]. The accuracy of these text line extraction algorithms on the newly developed Malayalam handwritten document image database LIPI is displayed in Table 4. In total, 378 document images from the LIPI database are selected to perform the experiment. The proposed method can extract 4912 text lines perfectly, out of 5599, whereas the result is 3258 and 1495 text lines for the A* Path Planning and piecewise painting algorithms. The proposed method shows higher accuracy of 87.7% compared to 58.19% and 26.7% for A* Path Planning and the piecewise painting algorithm on the LIPI database. From the experiments conducted, it is observed that the piecewise painting algorithm incorrectly segments ascenders and descenders in the Malayalam language, the character ‘Chandrakkala’ and overlapping lines. This is the reason for its low accuracy of 26.7%. An interesting observation is that the A* Path Planning algorithm segments the text lines containing the character ‘Chandrakkala’ and the ascenders and descenders almost correctly. However, it fails to segment some of the text lines. Therefore, the accuracy is only 58.19%.

To perform the comparison, the database used is the newly created Malayalam handwritten document image database, LIPI. Since the proposed method is specific to the Malayalam language, it will not show good performance for databases of other languages. A database for Malayalam handwritten document images is not publicly available. Therefore, a comparison with another database in the Malayalam language is also not possible. However, the proposed technique is expected to perform well even if new Malayalam handwritten document images are added to the dataset.

The detection rate, recognition accuracy and F-measure of the proposed method, A* Path Planning and piecewise painting algorithm are plotted in Figure 20, Figure 21 and Figure 22, respectively. The proposed method has the highest Detection Rate of 87.7%, while this value is 58.199% and 26.7% for the A* Path Planning and piecewise painting algorithms, as given in Figure 20. The A* Path Planning and piecewise painting algorithms have low detection rates for the document images in the newly developed LIPI database because these algorithms fail to segment some of the text lines. Moreover, the Recognition Accuracy and F-measure values are the highest for the proposed method, as depicted in Figure 21 and Figure 22, respectively. From the detected lines, the number of text lines that are perfectly matched with the ground truth lines are indicated by the Recognition Accuracy. As given in Figure 21, the piecewise painting algorithm has recognition accuracy of only 33.94%. This is because the algorithm fails to segment characters such as ‘Chandrakkala’ and ascenders and descenders in the Malayalam language correctly. The Recognition Accuracy obtained for A* Path Planning is 86.35%, which indicates that this language-independent algorithm can effectively segment the ascenders, descenders and ‘Chandrakkala’ correctly.

A new database for Malayalam handwritten document images has been created in this work. To the best of our knowledge, a database for Malayalam handwritten document images is not available publicly. The language-specific text line extraction algorithm proposed in this work segments 85.507% of the text lines perfectly from the document images. Moreover, a total of 7535 ground truth images are created for the text lines in the document images, which are used to evaluate the method proposed in this paper. From the comparisons performed, it is observed that the proposed technique outperforms the language-independent A* Path Planning and piecewise painting algorithms.

5. Conclusions

The diverse handwriting styles of individual writers make text line extraction from handwritten documents a difficult task. In this paper, a novel method based on the size variations of the written alphabet due to different handwriting styles is proposed to extract the text lines from handwritten Malayalam documents. Various thresholds are developed to perform text line extraction by measuring the average height and width of written characters in a document image. Therefore, these thresholds vary dynamically with each handwriting style. In the proposed technique, horizontal projection (HP) values are used to identify the positions at which to perform text line extraction. The two main problems encountered while using horizontal projection values are the extracted line segments with overlapped multiple lines and the varying gaps due to different handwriting styles between two characters when one character is written below another, resulting in the segmentation of such lines into two separate lines. These are addressed and solved effectively using the proposed method. Overall, 85.507% of the extracted text lines from the newly created LIPI database of Malayalam handwritten document images perfectly match the ground truth lines when evaluated using the metric MatchScore. Moreover, the technique proposed in this paper outperforms language-independent text line extraction algorithms like A* Path Planning and the piecewise painting algorithm on the LIPI database. Due to the unavailability of a Malayalam handwritten document image database, a new database of 402 images is created and is named LIPI. Another major contribution is the ground truth images created for the 7535 text lines in the document images. The proposed method is an initial step in digitizing Malayalam handwritten documents, which will be highly beneficial in enabling individuals to share handwritten documents in their local language.

Author Contributions

Conceptualization, P.P.V. and D.S.; Data curation, P.P.V. and D.S.; Formal analysis, P.P.V. and D.S.; Investigation, P.P.V. and D.S.; Methodology, P.P.V. and D.S.; Project administration, P.P.V. and D.S.; Resources, P.P.V. and D.S.; Software, P.P.V.; Supervision, D.S.; Validation, P.P.V. and D.S.; Visualization, P.P.V. and D.S.; Writing—original draft, P.P.V. and D.S.; Writing—review and editing, P.P.V. and D.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The database presented in this work is available in https://github.com/pearlsypv/LIPI-Database, accessed on 20 Ausgust 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liritzis, I.; Iliopoulos, I.; Andronache, I.; Kokkaliari, M.; Xanthopoulou, V. Novel Archaeometrical and Historical Transdisciplinary Investigation of Early 19th Century Hellenic Manuscript Regarding Initiation to Secret “Philike Hetaireia”. Mediterr. Archaeol. Archaeom. 2023, 23, 135–164. [Google Scholar] [CrossRef]
Andronache, I.; Liritzis, I.; Jelinek, H.F. Fractal Algorithms and RGB Image Processing in Scribal and Ink Identification on an 1819 Secret Initiation Manuscript to the “Philike Hetaereia”. Sci. Rep. 2023, 13, 1735. [Google Scholar] [CrossRef] [PubMed]
Srihari, S.N.; Yang, X.; Ball, G.R. Offline Chinese Handwriting Recognition: An Assessment of Current Technology. Front. Comput. Sci. China 2007, 1, 137–155. [Google Scholar] [CrossRef]
Memon, J.; Sami, M.; Khan, R.A.; Uddin, M. Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR). IEEE Access 2020, 8, 142642–142668. [Google Scholar] [CrossRef]
Likforman-Sulem, L.; Zahour, A.; Taconet, B. Text Line Segmentation of Historical Documents: A Survey. Int. J. Doc. Anal. Recognit. 2007, 9, 123–138. [Google Scholar] [CrossRef]
Khandelwal, A.; Choudhury, P.; Sarkar, R.; Basu, S.; Nasipuri, M.; Das, N. Text Line Segmentation for Unconstrained Handwritten Document Images Using Neighborhood Connected Component Analysis. In Pattern Recognition and Machine Intelligence; Chaudhury, S., Mitra, S., Murthy, C.A., Sastry, P.S., Pal, S.K., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5909, pp. 369–374. ISBN 978-3-642-11163-1. [Google Scholar]
Louloudis, G.; Gatos, B.; Halatsis, C. Text Line Detection in Unconstrained Handwritten Documents Using a Block-Based Hough Transform Approach. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Parana, Brazil, 23–26 September 2007; Volume 2, pp. 599–603. [Google Scholar]
Lee, S.-W. Advances in Handwriting Recognition; Series in Machine Perception and Artificial Intelligence; World Scientific: Singapore, 1999; Volume 34, ISBN 978-981-02-3715-8. [Google Scholar]
Souhar, A.; Boulid, Y.; Ameur, E.; Ouagague, M. Segmentation of Arabic Handwritten Documents into Text Lines Using Watershed Transform. Int. J. Interact. Multimed. Artif. Intell. 2017, 4, 96. [Google Scholar] [CrossRef]
Barakat, B.; Droby, A.; Kassis, M.; El-Sana, J. Text Line Segmentation for Challenging Handwritten Document Images Using Fully Convolutional Network. arXiv 2021, arXiv:2101.08299. [Google Scholar] [CrossRef]
Kundu, S.; Paul, S.; Kumar Bera, S.; Abraham, A.; Sarkar, R. Text-Line Extraction from Handwritten Document Images Using GAN. Expert Syst. Appl. 2020, 140, 112916. [Google Scholar] [CrossRef]
Barakat, B.K.; Droby, A.; Alaasam, R.; Madi, B.; Rabaev, I.; Shammes, R.; El-Sana, J. Unsupervised Deep Learning for Text Line Segmentation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 2304–2311. [Google Scholar]
Kurar Barakat, B.; Cohen, R.; Droby, A.; Rabaev, I.; El-Sana, J. Learning-Free Text Line Segmentation for Historical Handwritten Documents. Appl. Sci. 2020, 10, 8276. [Google Scholar] [CrossRef]
Tripathy, N.; Pal, U. Handwriting Segmentation of Unconstrained Oriya Text. In Proceedings of the Ninth International Workshop on Frontiers in Handwriting Recognition, Tokyo, Japan, 26–29 October 2004; pp. 306–311. [Google Scholar]
Pal, U.; Datta, S. Segmentation of Bangla Unconstrained Handwritten Text. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, Edinburgh, UK, 3–6 August 2003; Volume 1, pp. 1128–1132. [Google Scholar]
Mamatha, H.R.; Srikantamurthy, K. Morphological Operations and Projection Profiles Based Segmentation of Handwritten Kannada Document. Int. J. Appl. Inf. Syst. 2012, 4, 13–19. [Google Scholar] [CrossRef]
Kannan, B.; Jomy, J.; Pramod, K.V. A System for Offline Recognition of Handwritten Characters in Malayalam Script. Int. J. Image Graph. Signal Process. 2013, 5, 53–59. [Google Scholar] [CrossRef]
Rahiman, M.A.; Rajasree, M.S.; Masha, N.; Rema, M.; Meenakshi, R.; Kumar, G.M. Recognition of Handwritten Malayalam Characters Using Vertical & Horizontal Line Positional Analyzer Algorithm. In Proceedings of the 2011 3rd International Conference on Electronics Computer Technology, Kanyakumari, India, 8–10 April 2011; pp. 268–274. [Google Scholar]
John, J.; Pramod, K.V.; Balakrishnan, K. Offline Handwritten Malayalam Character Recognition Based on Chain Code Histogram. In Proceedings of the 2011 International Conference on Emerging Trends in Electrical and Computer Technology, Nagercoil, India, 23–24 March 2011; pp. 736–741. [Google Scholar]
Gayathri, P.; Ayyappan, S. Off-Line Handwritten Character Recognition Using Hidden Markov Model. In Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), New Delhi, India, 24–27 September 2014; pp. 518–523. [Google Scholar]
Jino, P.J.; John, J.; Balakrishnan, K. Offline Handwritten Malayalam Character Recognition Using Stacked LSTM. In Proceedings of the 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT), Kannur, India, 6–7 July 2017; pp. 1587–1590. [Google Scholar]
Raju, G. Recognition of Unconstrained Handwritten Malayalam Characters Using Zero-Crossing of Wavelet Coefficients. In Proceedings of the 2006 International Conference on Advanced Computing and Communications, Mangalore, India, 20–23 December 2006; pp. 217–221. [Google Scholar]
John, R.; Raju, G.; Guru, D.S. 1D Wavelet Transform of Projection Profiles for Isolated Handwritten Malayalam Character Recognition. In Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007), Sivakasi, India, 13–15 December 2007; pp. 481–485. [Google Scholar]
Manjusha, K.; Kumar, M.A.; Soman, K.P. On Developing Handwritten Character Image Database for Malayalam Language Script. Eng. Sci. Technol. Int. J. 2019, 22, 637–645. [Google Scholar] [CrossRef]
Optical Character Recognition. Available online: https://ocr.smc.org.in/ (accessed on 12 July 2023).
Malayalam Typing Utility. Available online: https://kuttipencil.in/ (accessed on 12 July 2023).
OCR for Indian Languages. Available online: https://ocr.tdil-dc.gov.in/ (accessed on 16 September 2021).
Shanjana, C.; James, A. Offline Recognition of Malayalam Handwritten Text. Procedia Technol. 2015, 19, 772–779. [Google Scholar] [CrossRef]
Gonzales, R.C.; Wintz, P. Digital Image Processing; Wesley Longman Publishing Co., Inc.: Boston, MA, USA, 1987. [Google Scholar]
Marana, A.N.; Da Fontoura Costa, L.; Lotufo, R.A.; Velastin, S.A. Estimating Crowd Density with Minkowski Fractal Dimension. In Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, AZ, USA, 15–19 March 1999; Volume 6, pp. 3521–3524. [Google Scholar]
Gatos, B.; Stamatopoulos, N.; Louloudis, G. ICDAR2009 Handwriting Segmentation Contest. Int. J. Doc. Anal. Recognit. 2011, 14, 25–33. [Google Scholar] [CrossRef]
Papavassiliou, V.; Stafylakis, T.; Katsouros, V.; Carayannis, G. Handwritten Document Image Segmentation into Text Lines and Words. Pattern Recognit. 2010, 43, 369–377. [Google Scholar] [CrossRef]
Surinta, O.; Holtkamp, M.; Karabaa, F.; Oosten, J.-P.V.; Schomaker, L.; Wiering, M. A Path Planning for Line Segmentation of Handwritten Documents. In Proceedings of the 2014 14th International Conference on Frontiers in Handwriting Recognition, Crete, Greece, 1–4 September 2014; pp. 175–180. [Google Scholar]
Alaei, A.; Pal, U.; Nagabhushan, P. A New Scheme for Unconstrained Handwritten Text-Line Segmentation. Pattern Recognit. 2011, 44, 917–928. [Google Scholar] [CrossRef]

Figure 1. Sample of ascender, descender, Chandrakkala, and compound letters with varying gaps depending on handwriting in Malayalam.

Figure 2. Flow of the processes involved in text line extraction of Malayalam handwritten documents.

Figure 3. Segmentation of overlapping lines from each vertical stripe into individual lines.

Figure 4. Sample image of Malayalam handwritten document from LIPI database.

Figure 5. Sample image converted to binary.

Figure 6. (a) Ground truth of the first text line in the image shown in Figure 5; (b) ground truth of the fifth text line in the image shown in Figure 5.

Figure 7. Binarized handwritten document image divided into three vertical stripes.

Figure 8. Text lines extracted from vertical stripes.

Figure 9. Manually marked region of segmentation found using the proposed technique.

Figure 10. Result of applying the proposed method to segment overlapping lines.

Figure 11. Result after segmenting the overlapping lines detected in the vertical stripes.

Figure 12. Results after joining short lines to the appropriate text lines in the vertical stripe.

Figure 13. Short line containing character ‘Chandrakkala’ above the alphabet.

Figure 14. Result of applying proposed method to join short line containing ‘Chandrakkala’ back to the original line.

Figure 15. Short line containing the character written below an alphabet.

Figure 16. Result of applying proposed method to join short line back to the original line.

Figure 17. Text lines extracted from handwritten document image in Figure 4.

Figure 18. Minkowski dimension for words in 10 handwritten Malayalam pages.

Figure 19. Text line density in 15 Malayalam handwritten pages.

Figure 20. Detection Rate of proposed method, A* Path Planning and piecewise painting algorithm.

Figure 21. Recognition Accuracy of proposed method, A* Path Planning and piecewise painting algorithm.

Figure 22. F-measure of proposed method, A* Path Planning and piecewise painting algorithm.

Table 1. Overview of the database.

No. of Writers	Resolution of the Image	No. of Malayalam Handwritten Document Images	Average Number of Lines per Page	Total No. of Text Lines in the Document Images	No. of Ground Truth Images Created for the Text Lines Extracted from the Documents
200	2338 × 1654	402	18	7535	7535

Table 2. Accuracy of the proposed method in segmenting text lines, overlapping lines and short lines.

Type of Text Line	No. of Lines	No. of Correctly Segmented Lines	Accuracy (%)
Text lines	7535	6443	85.507
Overlapping lines	629	441	70.11
Short lines	2607	2577	98.85

Table 3. Evaluation of text line segmentation using the detection rate (DR), recognition accuracy (RA) and F-measure (FM).

No. of GT Lines	No. of Detected Lines	DR (%)	RA (%)	FM (%)
7535	6482	85.5	99.39	91.92

Table 4. Comparison of accuracy obtained for different existing text line extraction methods and the proposed method.

Sl. No.	Algorithm for Text Line Extraction	No. of Correctly Segmented Text Lines	Accuracy (%)
1	A* Path Planning	3258	58.19
2	Piecewise Painting	1495	26.7
3	Proposed Method	4912	87.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

P V, P.; Sankar, D. Handwriting-Based Text Line Segmentation from Malayalam Documents. Appl. Sci. 2023, 13, 9712. https://doi.org/10.3390/app13179712

AMA Style

P V P, Sankar D. Handwriting-Based Text Line Segmentation from Malayalam Documents. Applied Sciences. 2023; 13(17):9712. https://doi.org/10.3390/app13179712

Chicago/Turabian Style

P V, Pearlsy, and Deepa Sankar. 2023. "Handwriting-Based Text Line Segmentation from Malayalam Documents" Applied Sciences 13, no. 17: 9712. https://doi.org/10.3390/app13179712

APA Style

P V, P., & Sankar, D. (2023). Handwriting-Based Text Line Segmentation from Malayalam Documents. Applied Sciences, 13(17), 9712. https://doi.org/10.3390/app13179712

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Handwriting-Based Text Line Segmentation from Malayalam Documents

Abstract

Featured Application

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. Preprocessing

3.2. Detection of Overlapping Lines

3.3. Separation of Overlapping Lines

3.4. Detection of Incorrectly Segmented Short Lines

3.5. Joining of Incorrectly Segmented Line to the Correct Line

4. Results and Discussion

4.1. Database for Malayalam Handwritten Documents

4.2. Implementation Results

4.3. Analysis of Word Area and Text Line Density in Malayalam Handwritten Documents

4.4. Performance Evaluation

4.4.1. MatchScore

4.4.2. Detection Rate (DR), Recognition Accuracy (RA) and F-Measure (FM)

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI