Parallel and Distributed Computing: Algorithms and Applications

A topical collection in Algorithms (ISSN 1999-4893). This collection belongs to the section "Parallel and Distributed Algorithms".

Viewed by 60924

Editors


E-Mail Website
Collection Editor
Department of Informatics, University of Piraeus, 185 34 Pireas, Greece
Interests: design and analysis of algorithms; parallel and distributed computing; mobile ad hoc networks; sensor networks
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Collection Editor
Department of Informatics and Computer Engineering, University of West Attica, 122 43 Athens, Greece
Interests: design of algorithms; parallel and distributed computing; pervasive computing; sensor networks; security and privacy issues in pervasive environments
Special Issues, Collections and Topics in MDPI journals

Topical Collection Information

Dear Colleagues,

It is an undeniable fact that parallel and distributed computing is ubiquitous now in nearly all computational scenarios ranging from mainstream computing to high-performance and/or distributed architectures such as cloud architectures and supercomputers. The ever-increasing complexity of parallel/distributed systems requires effective algorithmic techniques for unleashing the enormous computational power of these systems and attaining the promising performance of parallel/distributed computing. Moreover, the new possibilities offered by the high-performance systems pave the way to a new genre of applications that were considered as far-fetched a short while ago.

This Topical Collection is focused on all algorithmic aspects of parallel and distributed computing and applications. Essentially, every scenario where multiple operations or tasks are executed at the same time is within the scope of this Topical Collection. Topics of interest include (but are not limited to) the following:

  • Theoretical aspects of parallel and distributed computing;
  • Design and analysis of parallel and distributed algorithms;
  • Algorithm engineering in parallel and distributed computing;
  • Load balancing and scheduling techniques;
  • Green computing;
  • Algorithms and applications for big data, machine learning and artificial intelligence;
  • Game-theoretic approaches in parallel and distributed computing;
  • Algorithms and applications on GPUs and multicore or manycore platforms;
  • Cloud computing, edge/fog computing, IoT and distributed computing;
  • Scientific computing;
  • Simulation and visualization;
  • Graph and irregular applications.

Dr. Charalampos Konstantopoulos
Prof. Dr. Grammati Pantziou
Collection Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the collection website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Algorithms is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Parallel algorithms 
  • Distributed algorithms 
  • GPUs 
  • Multicore and manycore architectures 
  • Supercomputing 
  • Data centers 
  • Big data 
  • Cloud architectures 
  • IoT

Published Papers (23 papers)

2024

Jump to: 2023, 2022, 2021

26 pages, 3378 KiB  
Article
Parallel PSO for Efficient Neural Network Training Using GPGPU and Apache Spark in Edge Computing Sets
by Manuel I. Capel, Alberto Salguero-Hidalgo and Juan A. Holgado-Terriza
Algorithms 2024, 17(9), 378; https://doi.org/10.3390/a17090378 - 26 Aug 2024
Viewed by 1023
Abstract
The training phase of a deep learning neural network (DLNN) is a computationally demanding process, particularly for models comprising multiple layers of intermediate neurons.This paper presents a novel approach to accelerating DLNN training using the particle swarm optimisation (PSO) algorithm, which exploits the [...] Read more.
The training phase of a deep learning neural network (DLNN) is a computationally demanding process, particularly for models comprising multiple layers of intermediate neurons.This paper presents a novel approach to accelerating DLNN training using the particle swarm optimisation (PSO) algorithm, which exploits the GPGPU architecture and the Apache Spark analytics engine for large-scale data processing tasks. PSO is a bio-inspired stochastic optimisation method whose objective is to iteratively enhance the solution to a (usually complex) problem by approximating a given objective. The expensive fitness evaluation and updating of particle positions can be supported more effectively by parallel processing. Nevertheless, the parallelisation of an efficient PSO is not a simple process due to the complexity of the computations performed on the swarm of particles and the iterative execution of the algorithm until a solution close to the objective with minimal error is achieved. In this study, two forms of parallelisation have been developed for the PSO algorithm, both of which are designed for execution in a distributed execution environment. The synchronous parallel PSO implementation guarantees consistency but may result in idle time due to global synchronisation. In contrast, the asynchronous parallel PSO approach reduces the necessity for global synchronization, thereby enhancing execution time and making it more appropriate for large datasets and distributed environments such as Apache Spark. The two variants of PSO have been implemented with the objective of distributing the computational load supported by the algorithm across the different executor nodes of the Spark cluster to effectively achieve coarse-grained parallelism. The result is a significant performance improvement over current sequential variants of PSO. Full article
Show Figures

Figure 1

23 pages, 5573 KiB  
Article
Research on Distributed Fault Diagnosis Model of Elevator Based on PCA-LSTM
by Chengming Chen, Xuejun Ren and Guoqing Cheng
Algorithms 2024, 17(6), 250; https://doi.org/10.3390/a17060250 - 7 Jun 2024
Viewed by 851
Abstract
A Distributed Elevator Fault Diagnosis System (DEFDS) is developed to tackle frequent malfunctions stemming from the widespread distribution and aging of elevator systems. Due to the complexity of elevator fault data and the subtlety of fault characteristics, traditional methods such as visual inspections [...] Read more.
A Distributed Elevator Fault Diagnosis System (DEFDS) is developed to tackle frequent malfunctions stemming from the widespread distribution and aging of elevator systems. Due to the complexity of elevator fault data and the subtlety of fault characteristics, traditional methods such as visual inspections and basic operational tests fall short in detecting early signs of mechanical wear and electrical issues. These conventional techniques often fail to recognize subtle fault characteristics, necessitating more advanced diagnostic tools. In response, this paper introduces a Principal Component Analysis–Long Short-Term Memory (PCA-LSTM) method for fault diagnosis. The distributed system decentralizes the fault diagnosis process to individual elevator units, utilizing PCA’s feature selection capabilities in high-dimensional spaces to extract and reduce the dimensionality of fault features. Subsequently, the LSTM model is employed for fault prediction. Elevator models within the system exchange data to refine and optimize a global prediction model. The efficacy of this approach is substantiated through empirical validation with actual data, achieving an accuracy rate of 90% and thereby confirming the method’s effectiveness in facilitating distributed elevator fault diagnosis. Full article
Show Figures

Figure 1

19 pages, 773 KiB  
Article
Distributed Data-Driven Learning-Based Optimal Dynamic Resource Allocation for Multi-RIS-Assisted Multi-User Ad-Hoc Network
by Yuzhu Zhang and Hao Xu
Algorithms 2024, 17(1), 45; https://doi.org/10.3390/a17010045 - 19 Jan 2024
Cited by 2 | Viewed by 2231
Abstract
This study investigates the problem of decentralized dynamic resource allocation optimization for ad-hoc network communication with the support of reconfigurable intelligent surfaces (RIS), leveraging a reinforcement learning framework. In the present context of cellular networks, device-to-device (D2D) communication stands out as a promising [...] Read more.
This study investigates the problem of decentralized dynamic resource allocation optimization for ad-hoc network communication with the support of reconfigurable intelligent surfaces (RIS), leveraging a reinforcement learning framework. In the present context of cellular networks, device-to-device (D2D) communication stands out as a promising technique to enhance the spectrum efficiency. Simultaneously, RIS have gained considerable attention due to their ability to enhance the quality of dynamic wireless networks by maximizing the spectrum efficiency without increasing the power consumption. However, prevalent centralized D2D transmission schemes require global information, leading to a significant signaling overhead. Conversely, existing distributed schemes, while avoiding the need for global information, often demand frequent information exchange among D2D users, falling short of achieving global optimization. This paper introduces a framework comprising an outer loop and inner loop. In the outer loop, decentralized dynamic resource allocation optimization has been developed for self-organizing network communication aided by RIS. This is accomplished through the application of a multi-player multi-armed bandit approach, completing strategies for RIS and resource block selection. Notably, these strategies operate without requiring signal interaction during execution. Meanwhile, in the inner loop, the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm has been adopted for cooperative learning with neural networks (NNs) to obtain optimal transmit power control and RIS phase shift control for multiple users, with a specified RIS and resource block selection policy from the outer loop. Through the utilization of optimization theory, distributed optimal resource allocation can be attained as the outer and inner reinforcement learning algorithms converge over time. Finally, a series of numerical simulations are presented to validate and illustrate the effectiveness of the proposed scheme. Full article
Show Figures

Figure 1

18 pages, 492 KiB  
Article
GPU Algorithms for Structured Sparse Matrix Multiplication with Diagonal Storage Schemes
by Sardar Anisul Haque, Mohammad Tanvir Parvez and Shahadat Hossain
Algorithms 2024, 17(1), 31; https://doi.org/10.3390/a17010031 - 12 Jan 2024
Viewed by 2316
Abstract
Matrix–matrix multiplication is of singular importance in linear algebra operations with a multitude of applications in scientific and engineering computing. Data structures for storing matrix elements are designed to minimize overhead information as well as to optimize the operation count. In this study, [...] Read more.
Matrix–matrix multiplication is of singular importance in linear algebra operations with a multitude of applications in scientific and engineering computing. Data structures for storing matrix elements are designed to minimize overhead information as well as to optimize the operation count. In this study, we utilize the notion of the compact diagonal storage method (CDM), which builds upon the previously developed diagonal storage—an orientation-independent uniform scheme to store the nonzero elements of a range of matrices. This study exploits both these storage schemes and presents efficient GPU-accelerated parallel implementations of matrix multiplication when the input matrices are banded and/or structured sparse. We exploit the data layouts in the diagonal storage schemes to expose a substantial amount of fine-grained parallelism and effectively utilize the GPU shared memory to improve the locality of data access for numerical calculations. Results from an extensive set of numerical experiments with the aforementioned types of matrices demonstrate orders-of-magnitude speedups compared with the sequential performance. Full article
Show Figures

Figure 1

2023

Jump to: 2024, 2022, 2021

27 pages, 2365 KiB  
Article
Finding Bottlenecks in Message Passing Interface Programs by Scalable Critical Path Analysis
by Vladimir Korkhov, Ivan Gankevich, Anton Gavrikov, Maria Mingazova, Ivan Petriakov, Dmitrii Tereshchenko, Artem Shatalin and Vitaly Slobodskoy
Algorithms 2023, 16(11), 505; https://doi.org/10.3390/a16110505 - 31 Oct 2023
Viewed by 1730
Abstract
Bottlenecks and imbalance in parallel programs can significantly affect performance of parallel execution. Finding these bottlenecks is a key issue in performance analysis of MPI programs especially on a large scale. One of the ways to discover bottlenecks is to analyze the critical [...] Read more.
Bottlenecks and imbalance in parallel programs can significantly affect performance of parallel execution. Finding these bottlenecks is a key issue in performance analysis of MPI programs especially on a large scale. One of the ways to discover bottlenecks is to analyze the critical path of the parallel program: the longest execution path in the program activity graph. There are a number of methods of finding the critical path; however, most of them suffer a performance drop when scaled. In this paper, we analyze several methods of critical path finding based on classical Dijkstra and Delta-stepping algorithms along with the proposed algorithm based on topological sorting. Corresponding algorithms for each approach are presented including additional enhancements for increasing performance. The implementation of the algorithms and resulting performance for several benchmark applications (NAS Parallel Benchmarks, CP2K, OpenFOAM, LAMMPS, and MiniFE) are analyzed and discussed. Full article
Show Figures

Figure 1

19 pages, 1143 KiB  
Article
Algorithm for Enhancing Event Reconstruction Efficiency by Addressing False Track Filtering Issues in the SPD NICA Experiment
by Gulshat Amirkhanova, Madina Mansurova, Gennadii Ososkov, Nasurlla Burtebayev, Adai Shomanov and Murat Kunelbayev
Algorithms 2023, 16(7), 312; https://doi.org/10.3390/a16070312 - 22 Jun 2023
Viewed by 1239
Abstract
This paper introduces methods for parallelizing the algorithm to enhance the efficiency of event recovery in Spin Physics Detector (SPD) experiments at the Nuclotron-based Ion Collider Facility (NICA). The problem of eliminating false tracks during the particle trajectory detection process remains a crucial [...] Read more.
This paper introduces methods for parallelizing the algorithm to enhance the efficiency of event recovery in Spin Physics Detector (SPD) experiments at the Nuclotron-based Ion Collider Facility (NICA). The problem of eliminating false tracks during the particle trajectory detection process remains a crucial challenge in overcoming performance bottlenecks in processing collider data generated in high volumes and at a fast pace. In this paper, we propose and show fast parallel false track elimination methods based on the introduced criterion of a clustering-based thresholding approach with a chi-squared quality-of-fit metric. The proposed strategy achieves a good trade-off between the effectiveness of track reconstruction and the pace of execution on today’s advanced multicore computers. To facilitate this, a quality benchmark for reconstruction is established, using the root mean square (rms) error of spiral and polynomial fitting for the datasets identified as the subsequent track candidate by the neural network. Choosing the right benchmark enables us to maintain the recall and precision indicators of the neural network track recognition performance at a level that is satisfactory to physicists, even though these metrics will inevitably decline as the data noise increases. Moreover, it has been possible to improve the processing speed of the complete program pipeline by 6 times through parallelization of the algorithm, achieving a rate of 2000 events per second, even when handling extremely noisy input data. Full article
Show Figures

Figure 1

30 pages, 19621 KiB  
Article
Probability Density Estimation through Nonparametric Adaptive Partitioning and Stitching
by Zach D. Merino, Jenny Farmer and Donald J. Jacobs
Algorithms 2023, 16(7), 310; https://doi.org/10.3390/a16070310 - 21 Jun 2023
Cited by 1 | Viewed by 1709
Abstract
We present a novel nonparametric adaptive partitioning and stitching (NAPS) algorithm to estimate a probability density function (PDF) of a single variable. Sampled data is partitioned into blocks using a branching tree algorithm that minimizes deviations from a uniform density within blocks of [...] Read more.
We present a novel nonparametric adaptive partitioning and stitching (NAPS) algorithm to estimate a probability density function (PDF) of a single variable. Sampled data is partitioned into blocks using a branching tree algorithm that minimizes deviations from a uniform density within blocks of various sample sizes arranged in a staggered format. The block sizes are constructed to balance the load in parallel computing as the PDF for each block is independently estimated using the nonparametric maximum entropy method (NMEM) previously developed for automated high throughput analysis. Once all block PDFs are calculated, they are stitched together to provide a smooth estimate throughout the sample range. Each stitch is an averaging process over weight factors based on the estimated cumulative distribution function (CDF) and a complementary CDF that characterize how data from flanking blocks overlap. Benchmarks on synthetic data show that our PDF estimates are fast and accurate for sample sizes ranging from 29 to 227, across a diverse set of distributions that account for single and multi-modal distributions with heavy tails or singularities. We also generate estimates by replacing NMEM with kernel density estimation (KDE) within blocks. Our results indicate that NAPS(NMEM) is the best-performing method overall, while NAPS(KDE) improves estimates near boundaries compared to standard KDE. Full article
Show Figures

Graphical abstract

10 pages, 327 KiB  
Article
A Multithreaded Algorithm for the Computation of Sample Entropy
by George Manis, Dimitrios Bakalis and Roberto Sassi
Algorithms 2023, 16(6), 299; https://doi.org/10.3390/a16060299 - 15 Jun 2023
Cited by 2 | Viewed by 1419
Abstract
Many popular entropy definitions for signals, including approximate and sample entropy, are based on the idea of embedding the time series into an m-dimensional space, aiming to detect complex, deeper and more informative relationships among samples. However, for both approximate and sample [...] Read more.
Many popular entropy definitions for signals, including approximate and sample entropy, are based on the idea of embedding the time series into an m-dimensional space, aiming to detect complex, deeper and more informative relationships among samples. However, for both approximate and sample entropy, the high computational cost is a severe limitation. Especially when large amounts of data are processed, or when parameter tuning is employed premising a large number of executions, the necessity of fast computation algorithms becomes urgent. In the past, our research team proposed fast algorithms for sample, approximate and bubble entropy. In the general case, the bucket-assisted algorithm was the one presenting the lowest execution times. In this paper, we exploit the opportunities given by the multithreading technology to further reduce the computation time. Without special requirements in hardware, since today even our cost-effective home computers support multithreading, the computation of entropy definitions can be significantly accelerated. The aim of this paper is threefold: (a) to extend the bucket-assisted algorithm for multithreaded processors, (b) to present updated execution times for the bucket-assisted algorithm since the achievements in hardware and compiler technology affect both execution times and gain, and (c) to provide a Python library which wraps fast C implementations capable of running in parallel on multithreaded processors. Full article
Show Figures

Graphical abstract

16 pages, 1679 KiB  
Article
Fully Parallel Homological Region Adjacency Graph via Frontier Recognition
by Fernando Díaz-del-Río, Pablo Sanchez-Cuevas, María José Moron-Fernández, Daniel Cascado-Caballero, Helena Molina-Abril and Pedro Real
Algorithms 2023, 16(6), 284; https://doi.org/10.3390/a16060284 - 31 May 2023
Viewed by 1753
Abstract
Relating image contours and regions and their attributes according to connectivity based on incidence or adjacency is a crucial task in numerous applications in the fields of image processing, computer vision and pattern recognition. In this paper, the crucial incidence topological information of [...] Read more.
Relating image contours and regions and their attributes according to connectivity based on incidence or adjacency is a crucial task in numerous applications in the fields of image processing, computer vision and pattern recognition. In this paper, the crucial incidence topological information of 2-dimensional images is extracted in an efficient manner through the computation of a new structure called the HomDuRAG of an image; that is, the dual graph of the HomRAG (a topologically consistent extended version of the classical RAG). These representations are derived from the two traditional self-dual square grids (in which physical pixels play the role of 2-dimensional cells) and encapsulate the whole set of topological features and relations between the three types of objects embedded in a digital image: 2-dimensional (regions), 1-dimensional (contours) and 0-dimensional objects (crosses). Here, a first version of a fully parallel algorithm to compute this new representation is presented, whose timing complexity order (in the worst case and supposing one processing element per 0-cell) is O(log(M×N)) , M and N being the height and width of the image. Efficient implementations of this parallel algorithm would allow images to be processed in real time, as well as permit us to uncover fast algorithms for contour detection and segmentation, opening new perspectives within the image processing field. Full article
Show Figures

Figure 1

27 pages, 958 KiB  
Article
Parallel Algorithm for Solving Overdetermined Systems of Linear Equations, Taking into Account Round-Off Errors
by Dmitry Lukyanenko
Algorithms 2023, 16(5), 242; https://doi.org/10.3390/a16050242 - 7 May 2023
Cited by 5 | Viewed by 2678
Abstract
The paper proposes a parallel algorithm for solving large overdetermined systems of linear algebraic equations with a dense matrix. This algorithm is based on the use of a modification of the conjugate gradient method, which is able to take into account rounding errors [...] Read more.
The paper proposes a parallel algorithm for solving large overdetermined systems of linear algebraic equations with a dense matrix. This algorithm is based on the use of a modification of the conjugate gradient method, which is able to take into account rounding errors accumulated during calculations when making a decision to terminate the iterative process. The parallel algorithm is constructed in such a way that it takes into account the capabilities of the message passing interface (MPI) parallel programming technology, which is used for the software implementation of the proposed algorithm. The programming examples are shown using the Python programming language and the mpi4py package, but all programs are built in such a way that they can be easily rewritten using the C/C++/Fortran programming languages. The advantage of using the modern MPI-4.0 standard is demonstrated. Full article
Show Figures

Graphical abstract

19 pages, 1208 KiB  
Article
Asynchronous Gathering in a Dangerous Ring
by Stefan Dobrev, Paola Flocchini, Giuseppe Prencipe and Nicola Santoro
Algorithms 2023, 16(5), 222; https://doi.org/10.3390/a16050222 - 26 Apr 2023
Cited by 2 | Viewed by 1394
Abstract
Consider a set of k identical asynchronous mobile agents located in an anonymous ring of n nodes. The classical Gather (or Rendezvous) problem requires all agents to meet at the same node, not a priori decided, within a finite amount of time. [...] Read more.
Consider a set of k identical asynchronous mobile agents located in an anonymous ring of n nodes. The classical Gather (or Rendezvous) problem requires all agents to meet at the same node, not a priori decided, within a finite amount of time. This problem has been studied assuming that the network is safe for the agents. In this paper, we consider the presence in the ring of a stationary process located at a node that disables any incoming agent without leaving any trace. Such a dangerous node is known in the literature as a black hole, and the determination of its location has been extensively investigated. The presence of the black hole makes it deterministically unfeasible for all agents to gather. So, the research concern is to determine how many agents can gather and under what conditions. In this paper we establish a complete characterization of the conditions under which the problem can be solved. In particular, we determine the maximum number of agents that can be guaranteed to gather in the same location depending on whether k or n is unknown (at least one must be known). These results are tight: in each case, gathering with one more agent is deterministically unfeasible. All our possibility proofs are constructive: we provide mobile agent algorithms that allow the agents to gather within a predefined distance under the specified conditions. The analysis of the time costs of these algorithms show that they are optimal. Our gathering algorithm for the case of unknown k is also a solution for the black hole location problem. Interestingly, its bounded time complexity is Θ(n); this is a significant improvement over the existing O(nlogn) bounded time complexity. Full article
Show Figures

Figure 1

2022

Jump to: 2024, 2023, 2021

21 pages, 653 KiB  
Article
A Methodology to Design Quantized Deep Neural Networks for Automatic Modulation Recognition
by David Góez, Paola Soto, Steven Latré, Natalia Gaviria and Miguel Camelo
Algorithms 2022, 15(12), 441; https://doi.org/10.3390/a15120441 - 22 Nov 2022
Cited by 4 | Viewed by 2309
Abstract
Next-generation communication systems will face new challenges related to efficiently managing the available resources, such as the radio spectrum. DL is one of the optimization approaches to address and solve these challenges. However, there is a gap between research and industry. Most AI [...] Read more.
Next-generation communication systems will face new challenges related to efficiently managing the available resources, such as the radio spectrum. DL is one of the optimization approaches to address and solve these challenges. However, there is a gap between research and industry. Most AI models that solve communication problems cannot be implemented in current communication devices due to their high computational capacity requirements. New approaches seek to reduce the size of DL models through quantization techniques, changing the traditional method of operations from a 32 (or 64) floating-point representation to a fixed point (usually small) one. However, there is no analytical method to determine the level of quantification that can be used to obtain the best trade-off between the reduction of computational costs and an acceptable accuracy in a specific problem. In this work, we propose an analysis methodology to determine the degree of quantization in a DNN model to solve the problem of AMR in a radio system. We use the Brevitas framework to build and analyze different quantized variants of the DL architecture VGG10 adapted to the AMR problem. The evaluation of the computational cost is performed with the FINN framework of Xilinx Research Labs to obtain the computational inference cost. The proposed design methodology allows us to obtain the combination of quantization bits per layer that provides an optimal trade-off between the model performance (i.e., accuracy) and the model complexity (i.e., size) according to a set of weights associated with each optimization objective. For example, using the proposed methodology, we found a model architecture that reduced 75.8% of the model size compared to the non-quantized baseline model, with a performance degradation of only 0.06%. Full article
Show Figures

Figure 1

28 pages, 699 KiB  
Article
Recent Developments in Low-Power AI Accelerators: A Survey
by Christoffer Åleskog, Håkan Grahn and Anton Borg
Algorithms 2022, 15(11), 419; https://doi.org/10.3390/a15110419 - 8 Nov 2022
Cited by 9 | Viewed by 7153
Abstract
As machine learning and AI continue to rapidly develop, and with the ever-closer end of Moore’s law, new avenues and novel ideas in architecture design are being created and utilized. One avenue is accelerating AI as close to the user as possible, i.e., [...] Read more.
As machine learning and AI continue to rapidly develop, and with the ever-closer end of Moore’s law, new avenues and novel ideas in architecture design are being created and utilized. One avenue is accelerating AI as close to the user as possible, i.e., at the edge, to reduce latency and increase performance. Therefore, researchers have developed low-power AI accelerators, designed specifically to accelerate machine learning and AI at edge devices. In this paper, we present an overview of low-power AI accelerators between 2019–2022. Low-power AI accelerators are defined in this paper based on their acceleration target and power consumption. In this survey, 79 low-power AI accelerators are presented and discussed. The reviewed accelerators are discussed based on five criteria: (i) power, performance, and power efficiency, (ii) acceleration targets, (iii) arithmetic precision, (iv) neuromorphic accelerators, and (v) industry vs. academic accelerators. CNNs and DNNs are the most popular accelerator targets, while Transformers and SNNs are on the rise. Full article
Show Figures

Figure 1

25 pages, 751 KiB  
Article
Modeling Different Deployment Variants of a Composite Application in a Single Declarative Deployment Model
by Miles Stötzner, Steffen Becker, Uwe Breitenbücher, Kálmán Képes and Frank Leymann
Algorithms 2022, 15(10), 382; https://doi.org/10.3390/a15100382 - 19 Oct 2022
Cited by 5 | Viewed by 2138
Abstract
For automating the deployment of composite applications, typically, declarative deployment models are used. Depending on the context, the deployment of an application has to fulfill different requirements, such as costs and elasticity. As a consequence, one and the same application, i.e., its components, [...] Read more.
For automating the deployment of composite applications, typically, declarative deployment models are used. Depending on the context, the deployment of an application has to fulfill different requirements, such as costs and elasticity. As a consequence, one and the same application, i.e., its components, and their dependencies, often need to be deployed in different variants. If each different variant of a deployment is described using an individual deployment model, it quickly results in a large number of models, which are error prone to maintain. Deployment technologies, such as Terraform or Ansible, support conditional components and dependencies which allow modeling different deployment variants of a composite application in a single deployment model. However, there are deployment technologies, such as TOSCA and Docker Compose, which do not support such conditional elements. To address this, we extend the Essential Deployment Metamodel (EDMM) by conditional components and dependencies. EDMM is a declarative deployment model which can be mapped to several deployment technologies including Terraform, Ansible, TOSCA, and Docker Compose. Preprocessing such an extended model, i.e., conditional elements are evaluated and either preserved or removed, generates an EDMM conform model. As a result, conditional elements can be integrated on top of existing deployment technologies that are unaware of such concepts. We evaluate this by implementing a preprocessor for TOSCA, called OpenTOSCA Vintner, which employs the open-source TOSCA orchestrators xOpera and Unfurl to execute the generated TOSCA conform models. Full article
Show Figures

Figure 1

30 pages, 1418 KiB  
Article
A Dynamic Distributed Deterministic Load-Balancer for Decentralized Hierarchical Infrastructures
by Spyros Sioutas, Efrosini Sourla, Kostas Tsichlas, Gerasimos Vonitsanos and Christos Zaroliagis
Algorithms 2022, 15(3), 96; https://doi.org/10.3390/a15030096 - 18 Mar 2022
Cited by 2 | Viewed by 2391
Abstract
In this work, we propose D3-Tree, a dynamic distributed deterministic structure for data management in decentralized networks, by engineering and extending an existing decentralized structure. Conducting an extensive experimental study, we verify that the implemented structure outperforms other well-known hierarchical tree-based [...] Read more.
In this work, we propose D3-Tree, a dynamic distributed deterministic structure for data management in decentralized networks, by engineering and extending an existing decentralized structure. Conducting an extensive experimental study, we verify that the implemented structure outperforms other well-known hierarchical tree-based structures since it provides better complexities regarding load-balancing operations. More specifically, the structure achieves an O(logN) amortized bound (N is the number of nodes present in the network), using an efficient deterministic load-balancing mechanism, which is general enough to be applied to other hierarchical tree-based structures. Moreover, our structure achieves O(logN) worst-case search performance. Last but not least, we investigate the structure’s fault tolerance, which hasn’t been sufficiently tackled in previous work, both theoretically and through rigorous experimentation. We prove that D3-Tree is highly fault-tolerant and achieves O(logN) amortized search cost under massive node failures, accompanied by a significant success rate. Afterwards, by incorporating this novel balancing scheme into the ART (Autonomous Range Tree) structure, we go one step further to achieve sub-logarithmic complexity and propose the ART+ structure. ART+ achieves an O(logb2logN) communication cost for query and update operations (b is a double-exponentially power of 2 and N is the total number of nodes). Moreover, ART+ is a fully dynamic and fault-tolerant structure, which supports the join/leave node operations in O(loglogN) expected WHP (with high proability) number of hops and performs load-balancing in O(loglogN) amortized cost. Full article
Show Figures

Figure 1

17 pages, 432 KiB  
Article
Tries-Based Parallel Solutions for Generating Perfect Crosswords Grids
by Virginia Niculescu and Robert Manuel Ştefănică
Algorithms 2022, 15(1), 22; https://doi.org/10.3390/a15010022 - 13 Jan 2022
Cited by 2 | Viewed by 3142
Abstract
A general crossword grid generation is considered an NP-complete problem and theoretically it could be a good candidate to be used by cryptography algorithms. In this article, we propose a new algorithm for generating perfect crosswords grids (with no black boxes) that relies [...] Read more.
A general crossword grid generation is considered an NP-complete problem and theoretically it could be a good candidate to be used by cryptography algorithms. In this article, we propose a new algorithm for generating perfect crosswords grids (with no black boxes) that relies on using tries data structures, which are very important for reducing the time for finding the solutions, and offers good opportunity for parallelisation, too. The algorithm uses a special tries representation and it is very efficient, but through parallelisation the performance is improved to a level that allows the solution to be obtained extremely fast. The experiments were conducted using a dictionary of almost 700,000 words, and the solutions were obtained using the parallelised version with an execution time in the order of minutes. We demonstrate here that finding a perfect crossword grid could be solved faster than has been estimated before, if we use tries as supporting data structures together with parallelisation. Still, if the size of the dictionary is increased by a lot (e.g., considering a set of dictionaries for different languages—not only for one), or through a generalisation to a 3D space or multidimensional spaces, then the problem still could be investigated for a possible usage in cryptography. Full article
Show Figures

Figure 1

2021

Jump to: 2024, 2023, 2022

13 pages, 1466 KiB  
Article
Parallel Computing of Edwards—Anderson Model
by Mikhail Alexandrovich Padalko, Yuriy Andreevich Shevchenko, Vitalii Yurievich Kapitan and Konstantin Valentinovich Nefedev
Algorithms 2022, 15(1), 13; https://doi.org/10.3390/a15010013 - 27 Dec 2021
Cited by 4 | Viewed by 2885
Abstract
A scheme for parallel computation of the two-dimensional Edwards—Anderson model based on the transfer matrix approach is proposed. Free boundary conditions are considered. The method may find application in calculations related to spin glasses and in quantum simulators. Performance data are given. The [...] Read more.
A scheme for parallel computation of the two-dimensional Edwards—Anderson model based on the transfer matrix approach is proposed. Free boundary conditions are considered. The method may find application in calculations related to spin glasses and in quantum simulators. Performance data are given. The scheme of parallelisation for various numbers of threads is tested. Application to a quantum computer simulator is considered in detail. In particular, a parallelisation scheme of work of quantum computer simulator. Full article
Show Figures

Figure 1

21 pages, 811 KiB  
Article
An O(log2N) Fully-Balanced Resampling Algorithm for Particle Filters on Distributed Memory Architectures
by Alessandro Varsi, Simon Maskell and Paul G. Spirakis
Algorithms 2021, 14(12), 342; https://doi.org/10.3390/a14120342 - 26 Nov 2021
Cited by 8 | Viewed by 4081
Abstract
Resampling is a well-known statistical algorithm that is commonly applied in the context of Particle Filters (PFs) in order to perform state estimation for non-linear non-Gaussian dynamic models. As the models become more complex and accurate, the run-time of PF applications becomes increasingly [...] Read more.
Resampling is a well-known statistical algorithm that is commonly applied in the context of Particle Filters (PFs) in order to perform state estimation for non-linear non-Gaussian dynamic models. As the models become more complex and accurate, the run-time of PF applications becomes increasingly slow. Parallel computing can help to address this. However, resampling (and, hence, PFs as well) necessarily involves a bottleneck, the redistribution step, which is notoriously challenging to parallelize if using textbook parallel computing techniques. A state-of-the-art redistribution takes O((log2N)2) computations on Distributed Memory (DM) architectures, which most supercomputers adopt, whereas redistribution can be performed in O(log2N) on Shared Memory (SM) architectures, such as GPU or mainstream CPUs. In this paper, we propose a novel parallel redistribution for DM that achieves an O(log2N) time complexity. We also present empirical results that indicate that our novel approach outperforms the O((log2N)2) approach. Full article
Show Figures

Figure 1

21 pages, 4019 KiB  
Article
Parallel Implementation of the Algorithm to Compute Forest Fire Impact on Infrastructure Facilities of JSC Russian Railways
by Nikolay Viktorovich Baranovskiy, Aleksey Podorovskiy and Aleksey Malinin
Algorithms 2021, 14(11), 333; https://doi.org/10.3390/a14110333 - 15 Nov 2021
Cited by 3 | Viewed by 2601
Abstract
Forest fires have a negative impact on the economy in a number of regions, especially in Wildland Urban Interface (WUI) areas. An important link in the fight against fires in WUI areas is the development of information and computer systems for predicting the [...] Read more.
Forest fires have a negative impact on the economy in a number of regions, especially in Wildland Urban Interface (WUI) areas. An important link in the fight against fires in WUI areas is the development of information and computer systems for predicting the fire safety of infrastructural facilities of Russian Railways. In this work, a numerical study of heat transfer processes in the enclosing structure of a wooden building near the forest fire front was carried out using the technology of parallel computing. The novelty of the development is explained by the creation of its own program code, which is planned to be put into operation either in the Information System for Remote Monitoring of Forest Fires ISDM-Rosleskhoz, or in the information and computing system of JSC Russian Railways. In the Russian Federation, it is forbidden to use foreign systems in the security services of industrial facilities. The implementation of the deterministic model of heat transfer in the enclosing structure with the complexity of the algorithm O (2N2 + 2K) is presented. The program is implemented in Python 3.x using the NumPy and Concurrent libraries. Calculations were carried out on a multiprocessor cluster in the Sirius University of Science and Technology. The results of calculations and the acceleration coefficient for operating modes for 1, 2, 4, 8, 16, 32, 48 and 64 processes are presented. The developed algorithm can be applied to assess the fire safety of infrastructure facilities of Russian Railways. The main merit of the new development should be noted, which is explained by the ability to use large computational domains with a large number of computational grid nodes in space and time. The use of caching intermediate data in files made it possible to distribute a large number of computational nodes among the processors of a computing multiprocessor system. However, one should also note a drawback; namely, a decrease in the acceleration of computational operations with a large number of involved nodes of a multiprocessor computing system, which is explained by the write and read cycles in cache files. Full article
Show Figures

Figure 1

24 pages, 1026 KiB  
Article
Load Balancing Strategies for Slice-Based Parallel Versions of JEM Video Encoder
by Héctor Migallón, Otoniel López-Granado, Miguel O. Martínez-Rach, Vicente Galiano and Manuel P. Malumbres
Algorithms 2021, 14(11), 320; https://doi.org/10.3390/a14110320 - 1 Nov 2021
Cited by 1 | Viewed by 1994
Abstract
The proportion of video traffic on the internet is expected to reach 82% by 2022, mainly due to the increasing number of consumers and the emergence of new video formats with more demanding features (depth, resolution, multiview, 360, etc.). Efforts are therefore being [...] Read more.
The proportion of video traffic on the internet is expected to reach 82% by 2022, mainly due to the increasing number of consumers and the emergence of new video formats with more demanding features (depth, resolution, multiview, 360, etc.). Efforts are therefore being made to constantly improve video compression standards to minimize the necessary bandwidth while retaining high video quality levels. In this context, the Joint Collaborative Team on Video Coding has been analyzing new video coding technologies to improve the compression efficiency with respect to the HEVC video coding standard. A software package known as the Joint Exploration Test Model has been proposed to implement and evaluate new video coding tools. In this work, we present parallel versions of the JEM encoder that are particularly suited for shared memory platforms, and can significantly reduce its huge computational complexity. The proposed parallel algorithms are shown to achieve high levels of parallel efficiency. In particular, in the All Intra coding mode, the best of our proposed parallel versions achieves an average efficiency value of 93.4%. They also had high levels of scalability, as shown by the inclusion of an automatic load balancing mechanism. Full article
Show Figures

Figure 1

10 pages, 284 KiB  
Article
A Parallel Algorithm for Dividing Octonions
by Aleksandr Cariow and Janusz P. Paplinski
Algorithms 2021, 14(11), 309; https://doi.org/10.3390/a14110309 - 24 Oct 2021
Viewed by 1942
Abstract
The article presents a parallel hardware-oriented algorithm designed to speed up the division of two octonions. The advantage of the proposed algorithm is that the number of real multiplications is halved as compared to the naive method for implementing this operation. In the [...] Read more.
The article presents a parallel hardware-oriented algorithm designed to speed up the division of two octonions. The advantage of the proposed algorithm is that the number of real multiplications is halved as compared to the naive method for implementing this operation. In the synthesis of the discussed algorithm, the matrix representation of this operation was used, which allows us to present the division of octonions by means of a vector–matrix product. Taking into account a specific structure of the matrix multiplicand allows for reducing the number of real multiplications necessary for the execution of the octonion division procedure. Full article
Show Figures

Figure 1

24 pages, 1776 KiB  
Article
Rough Estimator Based Asynchronous Distributed Super Points Detection on High Speed Network Edge
by Jie Xu and Wei Ding
Algorithms 2021, 14(10), 277; https://doi.org/10.3390/a14100277 - 25 Sep 2021
Viewed by 1918
Abstract
Super points detection plays an important role in network research and application. With the increase of network scale, distributed super points detection has become a hot research topic. The key point of super points detection in a multi-node distributed environment is how to [...] Read more.
Super points detection plays an important role in network research and application. With the increase of network scale, distributed super points detection has become a hot research topic. The key point of super points detection in a multi-node distributed environment is how to reduce communication overhead. Therefore, this paper proposes a three-stage communication algorithm to detect super points in a distributed environment, Rough Estimator based Asynchronous Distributed super points detection algorithm (READ). READ uses a lightweight estimator, the Rough Estimator (RE), which is fast in computation and takes less memory to generate candidate super points. Meanwhile, the famous Linear Estimator (LE) is applied to accurately estimate the cardinality of each candidate super point, so as to detect the super point correctly. In READ, each node scans IP address pairs asynchronously. When reaching the time window boundary, READ starts three-stage communication to detect the super point. This paper proves that the accuracy of READ in a distributed environment is no less than that in the single-node environment. Four groups of 10 Gb/s and 40 Gb/s real-world high-speed network traffic are used to test READ. The experimental results show that READ not only has high accuracy in a distributed environment, but also has less than 5% of communication burden compared with existing algorithms. Full article
Show Figures

Figure 1

22 pages, 6397 KiB  
Article
Accelerating In-Transit Co-Processing for Scientific Simulations Using Region-Based Data-Driven Analysis
by Marcus Walldén, Masao Okita, Fumihiko Ino, Dimitris Drikakis and Ioannis Kokkinakis
Algorithms 2021, 14(5), 154; https://doi.org/10.3390/a14050154 - 12 May 2021
Viewed by 2430
Abstract
Increasing processing capabilities and input/output constraints of supercomputers have increased the use of co-processing approaches, i.e., visualizing and analyzing data sets of simulations on the fly. We present a method that evaluates the importance of different regions of simulation data and a data-driven [...] Read more.
Increasing processing capabilities and input/output constraints of supercomputers have increased the use of co-processing approaches, i.e., visualizing and analyzing data sets of simulations on the fly. We present a method that evaluates the importance of different regions of simulation data and a data-driven approach that uses the proposed method to accelerate in-transit co-processing of large-scale simulations. We use the importance metrics to simultaneously employ multiple compression methods on different data regions to accelerate the in-transit co-processing. Our approach strives to adaptively compress data on the fly and uses load balancing to counteract memory imbalances. We demonstrate the method’s efficiency through a fluid mechanics application, a Richtmyer–Meshkov instability simulation, showing how to accelerate the in-transit co-processing of simulations. The results show that the proposed method expeditiously can identify regions of interest, even when using multiple metrics. Our approach achieved a speedup of 1.29× in a lossless scenario. The data decompression time was sped up by 2× compared to using a single compression method uniformly. Full article
Show Figures

Figure 1

Back to TopTop