Next Article in Journal
Using Artificial Intelligence to Predict the Aerodynamic Properties of Wind Turbine Profiles
Previous Article in Journal
Automation Bias and Complacency in Security Operation Centers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hardware-Based Implementation of Algorithms for Data Replacement in Cache Memory of Processor Cores

by
Larysa Titarenko
1,2,
Vyacheslav Kharchenko
3,
Vadym Puidenko
3,4,
Artem Perepelitsyn
3,* and
Alexander Barkalov
1
1
Institute of Metrology, Electronics and Computer Science, University of Zielona Gora, ul. Prof. Z. Szafrana, 2, 65-516 Zielona Gora, Poland
2
Department of Infocommunication Engineering, Faculty of Infocommunications, Kharkiv National University of Radio Electronics, Nauky Avenue 14, 61166 Kharkiv, Ukraine
3
Department of Computer Systems, Networks and Cybersecurity, National Aerospace University “KhAI”, 17, Chkalov Str., 61070 Kharkiv, Ukraine
4
Kharkiv Radio Engineering Professional College, Sums’ka Str., 18/20, 61057 Kharkiv, Ukraine
*
Author to whom correspondence should be addressed.
Computers 2024, 13(7), 166; https://doi.org/10.3390/computers13070166
Submission received: 27 May 2024 / Revised: 25 June 2024 / Accepted: 2 July 2024 / Published: 5 July 2024

Abstract

:
Replacement policies have an important role in the functioning of the cache memory of processor cores. The implementation of a successful policy allows us to increase the performance of the processor core and the computer system as a whole. Replacement policies are most often evaluated by the percentage of cache hits during the cycles of the processor bus when accessing the cache memory. The policies that focus on replacing the Least Recently Used (LRU) or Least Frequently Used (LFU) elements, whether instructions or data, are relevant for use. It should be noted that in the paging cache buffer, the above replacement policies can also be used to replace address information. The pseudo LRU (PLRU) policy introduces replacing based on approximate information about the age of the elements in the cache memory. The hardware implementation of any replacement policy algorithm is the circuit. This hardware part of the processor core has certain characteristics: the latency of the search process for a candidate element for replacement, the gate complexity, and the reliability. The characteristics of the PLRUt and PLRUm replacement policies are synthesized and investigated. Both are the varieties of the PLRU replacement policy, which is close to the LRU policy in terms of the percentage of cache hits. In the current study, the hardware implementation of these policies is evaluated, and the possibility of adaptation to each of the policies in the processor core according to a selected priority characteristic is analyzed. The dependency of the rise in the delay and gate complexity in the case of an increase in the associativity of the cache memory is shown. The advantage of the hardware implementation of the PLRUt algorithm in comparison with the PLRUm algorithm for higher values of associativity is shown.

1. Introduction

The cache memory of data or instructions is an essential component of the processor cores of computer systems. It is known that the loading of data from the cache that contains data or instructions into the processor core is faster than the loading from the corresponding segments or pages of the working memory of the computer system. It should be added that the event of a miss in the cache memory activates the corresponding policy of replacing the data or instructions. This is an effective algorithm for selecting an element for replacement. Among known replacement policies, the PLRU algorithm can be highlighted. It is implemented in the cache memory microarchitectures of i486 [1] and Pentium [2] processors. It is also implemented in Intel Core i7 processor cores of the following microarchitectures: Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, and Skylake [3]. The synthesized circuits of the PLRU replacement policy, based on accepted research results, can be implemented with the use of FPGA.

1.1. Motivation

The main three parameters of the cache memory of the processor core are the informational size, the associativity (number of ways), and the size of the row element in bytes. The associativity (number of ways) of the cache memory of modern processor cores is 4, 8, and 16. But this is not the limit of this sequence, because there is a tendency to increase the size of the cache memory of modern processor cores that will lead to an increase in the number of ways of selecting data elements as replacement candidates. Also, this trend would affect the hardware characteristics of the PLRU policy circuits, which are divided into the policies PLRUt and PLRUm. It should be noted that the characteristics of the circuits are the delay in the selection of a data element candidate for replacement, the hardware complexity measured in gates, and the reliability. However, most of the existing publications are not devoted to the study of these characteristics.
The description of the automata model and the further synthesis of the hardware implementation of the data replacement algorithm will allow us to determine the exact values of the mentioned characteristics and to investigate their dependence on the growth of associativity. It should also be added that the synthesized and researched PLRUt and PLRUm policy circuits provide an opportunity to conclude that there is adaptation of the cache memory of the processor core to each of the policies depending on the selected priority characteristic.

1.2. State of the Art

Replacement policies in the cache memory of the processor core are mainly characterized by the intensity of hits or misses. To understand the possible solutions, it is necessary to perform the analysis of known implementations and applications of the replacement policies.
The performed analysis of references shows that publications with this topic are focused on the following:
  • The study of replacement policies during the caching of instructions and data in the cache memory of the CPU;
  • The study of replacement policies during the caching in the device with FPGA;
  • Research on the adaptive associativity of a reconfigurable cache scheme;
  • The study of replacement policies in content-oriented computer networks;
  • Research on data caching schemes for wireless networks;
  • The study of the efficiency of different methods of caching of key-value storage for Blockchain in FPGA;
  • The study of a cache-based matrix technique using the Hadoop Distributed File System (HDFS), where the EC volume consists of a Reed–Solomon code with parameters of (6, 3).
An analysis of existing solutions shows that there are FPGA-based implementations of configuration memory controllers with LRU and LFU replacement policies [4]. But the results of reconfiguration are known only for hit rates in sequences for cache sizes of 8, 16, and 32 blocks of a non-adaptive mechanism with the following policies: LRU, Random, and LFU.
Other VHDL implementations of LRU and PLRU policies with the use of two different methods for a reconfigurable cache memory show a 100% hit rate for an associative set of an eight-way cache memory [5].
The use of a multi-stage LRU cache replacement algorithm allows us to reduce the indirect costs during execution and to achieve greater accuracy both in the memory and in the CPU due to an increase in the ratio of cache hits compared to the LRU policy [6]. The investigation of the use of the PLRU policy with dependency on the parameters of the cache memory shows the reduction in the cache miss rate and instruction cache cycles [7]. Randomization greatly simplifies the integration of PLRU logic [8]. The performance of Rand-PLRU is comparable to that of the PLRU policy [9]. But there is no detailed investigation of the dependency of latency, complexity, and reliability for different parameters of the implementation of modifications of this policy.
The simulation of high-performance cache implementations in FPGA shows that the two-way multi-associative cache, PLRUt, is the best choice because it requires a smaller number of stored bits but offers the same performance. The considered options include the number of ways, lines, and words per line. The PLRUm policy shows the highest performance in all three cache levels, while the LRU policy gives the worst performance [10].
The PAC-PLRU replacement policy not only uses but also intelligently aggregates the prediction information. The use of PAC-PLRU with a prefetch block reduces the average L2 cache miss rate compared to an ordinary PLRU policy [11]. Most important is that PAC-PLRU requires only minor changes in the existing cache architecture to achieve these benefits. At the same, the simulation of the LRU policy shows the lowest cache miss rate for a private L1 data cache for the various benchmarks, regardless of the number of cores [12].
While the FSM model of the replacement logic of the PLRU policy of the four-way cache memory is well described [13], the synthesis of the hardware implementation with the estimation for a wide list of the parameter values of delay, complexity, and reliability is still an unsolved problem. But the values for conceptual models of the PLRU replacement policy for the cache memory of the processor core with four ways and eight ways are available for comparison [14]. Known variants of minimization of the PLRU algorithm for a q-way associative cache memory allow us to increase the characteristics of the speed and reliability of the logic circuit of the LRU block [15].
Implementations of the automata model of the pair of compatible adaptive algorithms with discrete functions and structural block diagrams are also known [16]. The results of the implementation of an adaptive associative reconfigurable cache scheme based on the decision tree in the GEM5 show in a simulation the dependency of the values of frequency and associativity for cache memories with a different number of ways [17].
The modifications of existing and new replacement policies find an application in network caching, like the policy Immature Used [18], and in wireless networks [19]. The simulation of the replacement policy taking into account the heterogeneity allows us to consider the improvement in the access speed and overall performance of asymmetric multicore systems with a separated Last Level cache memory of systems with different numbers of processor codes [20].
Research on data caching on the FPGA accelerator card from Xilinx programmed with Vitis 2020 is a possible way of experimentally investigating the performance of the implementation of replacement policies [21]. The use of FPGA also allows us to implement methods of caching for network purposes of key-value storage [22], as well as for distributed file systems [23]. But even for minimal hardware implementation of the PLRU algorithm, only a small number of ways are available for comparison [24].
Thus, a review of replacement policies shows the relevance of their study and use not only in the cache memory of computer systems with CPU and FPGA modules but also in network caching. Modern FPGAs are powerful enough for the implementation of the processor [25], with the possibility of evaluating the parameters [26]. The considered studies of replacement policies in the cache memory of computer systems are focused on the dependence of cache hit or miss ratios on the reconfiguration periods of adaptation, the associativity (number of ways), and cache memory volumes. Research on replacement policies in content-oriented computer networks focuses on performance evaluation (cache hit ratio, path length, delay, and link load). With regard to complexity evaluation studies, the complexity of PLRUt and PLRUm replacement policies is estimated as additional logic (LUT) in addressing lines in the cache and the total number of FF bits (PIM module) [7], and the cache circuit reconfiguration control logic is estimated [14]. Research on the delay during the search for a candidate element for replacement in a cache line, gate complexity, and reliability in a review of modern sources of studies on replacement policies also needs more clear and detailed explanation.
The summarized directions of the research on caching in each considered reference are presented in Table 1.

1.3. Objectives and Structure

The goal of this article is to investigate and develop hardware solutions for implementing algorithms for data replacement in the cache memory of processor cores to improve the key characteristics of time delays, complexity, and reliability.
The main objectives are as follows:
  • To analyze the structures of synchronous digital automata for further synthesis of the minimal hardware implementations of PLRUt and PLRUm replacement policies;
  • To develop a minimal logic model for further research in a simulation environment;
  • To evaluate the parameters of hardware implementations from the point of view of the delay of the transient process with the use of logical elements of NAND basis;
  • To evaluate the complexity measured in logical elements and the reliability of hardware implementations;
  • To analyze the possibility of the adaptation of the processor core cache memory to each of the PLRUt and PLRUm policies depending on the selected priority characteristic.
According to the research results, it will be possible to ascertain the possibility of adapting the cache memory of the processor core to each of the PLRUt or PLRUm policies depending on the selected priority characteristics, such as delay, gate complexity, and reliability. Further research on hardware solutions will contribute to the development of a method of adapting replacement policy mechanisms to change various algorithms depending on the parameters of the computing process.
The structure of this article is as follows: the methodology of the investigation is described in Section 2; Section 3 is dedicated to developing automata models and the technique of hardware implementation of PLRUt and PLRUm algorithms; the summarized results of the research on algorithm hardware implementation are presented in Section 4 for ways of q = 8, q = 16, and q = 32; Section 5 and Section 6 discuss the results and conclude this study correspondingly.

2. Methodology and Stages of This Study

The research methodology consists of the following:
  • A formal description of the structure of the synchronous digital automata followed by the synthesis of logical models of the circuits of PLRUt and PLRUm replacement algorithms, which are based on the theory of digital automata;
  • The minimization of defined switching functions;
  • An evaluation of gate complexity and the reliability of the circuits;
  • The experimental implementation of the model in FPGA and an investigation of the parameters.
The methodology is implemented in the following stages:
Stage 1—a formal description of the structure of the synchronous digital automata implementation of the replacement algorithms PLRUt and PLRUm.
Stage 2—the synthesis of logical models for these replacement algorithms.
Stage 3—the study of computer models for simulating automata and checking the correctness of the implementation of PLRUt and PLRUm replacement algorithms.
Stage 4—the study of the characteristics of the means of implementation of these replacement algorithms.
Stage 5—an analysis of the results of the comparison of the characteristics of means implementing the PLRUt and PLRUm algorithms and determining the possibility of the adaptation of the cache memory of the processor core by choosing one of the algorithms depending on the characteristics of the circuits.

3. Hardware Implementation of PLRUt and PLRUm

The PLRUt and PLRUm replacement policies are variants of the PLRU replacement policy, which is close to the LRU policy in terms of cache hit percentage. The replacement policy PLRUt has the index t that indicates that the Least Recently Used element for replacement is determined based on the binary tree algorithm. The PLRUm replacement policy has the index m that indicates that the Least Recently Used element to replace is determined based on the age bit.

3.1. Automata Model of Algorithm PLRUt

The automata model of the hardware implementation of the PLRUt replacement algorithm [15] (Figure 1) consists of a combination circuit of the CSdec-way decoder and memory elements of the synchronous JK flip-flop type. The operation of the automata is described by the following transition and output functions ((1) and (2)):
Q i + = f Φ i Q , Ψ i Q , H , R ,
L r = Λ r Q
where Q = Q 0 ,   Q 1 , ,   Q m 1 is the set of the internal states of the automaton at the current moment of time t; Q + = Q 0 + , Q 1 + , , Q m 1 + is the set of internal states of the automaton at the next moment t+; Φ = Φ 0 Q , Φ 1 Q , , Φ m 1 Q is the set of simple switching functions for the excitation of the information inputs J of synchronous flip-flops; Ψ = Ψ 0 Q ,   Ψ 1 Q , ,   Ψ m 1 Q is the set of simple switching functions for the excitation of the information inputs K of synchronous flip-flops; Λ = Λ 0 Q ,   Λ 1 Q ,   ,   Λ q 1 Q is the set of simple switching functions for the selection of q-ways in the set of lines with the data element L; Φ i Q ,   Ψ i Q ,   Λ r Q ,   Q i ,   Q i + 0 , 1 ; Lr represents the selected data item in the cache memory at the current direction r; i   0 ,   1 ,   2 , , m 1 ; r 0 ,   1 ,   2 , , q 1 ; H is the signal of a cache hit/miss; R is the reset signal; m is the number of memory elements of the synchronous JK flip-flop type; q = 2 m is the number of ways in the cache memory; m ,   q N 0 .

3.2. Synthesis of Hardware Implementation of PLRUt

The synthesized computer models of the PLRUt data replacement algorithm with q = 4 and q = 8 are described in [15] and [24]. These models provided the possibility to observe the corresponding sequences of changes in the q-indexes L0 L2 L1 L3 and L0 L4 L2 L6 L1 L5 L3 L7 of the data element L for replacement in the case of a cache hit event.
The observed sequence of changes in q-indexes allowed to synthesis of a general algorithm (Figure 2), changes in q-indexes, according to which it is possible to synthesize the computer models of the PLRUt data replacement algorithm with q = 16 and q = 32.
Synthesized computer models of the PLRUt data replacement algorithm with q = 16 and q = 32 are described by logical equations, (3)–(6) and (7)–(11), where JK-type synchronous flip-flops are used as memory elements. The synthesized model is shown in Figure 3, taking into account the truth tables of the ways q of the decoders.
When studying computer models, the following sequences of q-index changes were observed:
q = 16: L0 L8 L2 L10 L4 L12 L6 L14 L1 L9 L3 L11 L5 L13 L7 L15
q = 32: L0 L16 L2 L18 L4 L20 L6 L22 L8 L24 L10 L26 L12 L28 L14 L30
   L1 L17 L3 L19 L5 L21 L7 L23 L9 L25 L11 L27 L13 L29 L15 L31
The observed corresponding sequences of changes in the q-index correspond to the general algorithm shown in Figure 2.
The logical equations of the computer model of the hardware implementation of the PLRUt algorithm for q = 16:
J 0 = K 0 = Q 3 & Q 2 & Q 1
J 1 = K 1 = Q 3
J 2 = K 2 = Q 3 & Q 1
J 3 = K 3 = 1
The logical equations of the computer model of the hardware implementation of the PLRUt algorithm for q = 32:
J 0 = K 0 = Q 4 Q 3 Q 2 Q 1 = J 3 & Q 3
J 1 = K 1 = Q 4
J 2 = K 2 = Q 4 & Q 1
J 3 = K 3 = Q 4 & Q 2 & Q 1 = J 2 & Q 2
J 4 = K 4 = 1

3.3. Research of Hardware Implementation for Algorithm PLRUt

The synthesized computer models (including Figure 3) [15,24] allow us to create Table 2.
The probability of fault-free operation is determined by Equation (12) and allows us to calculate the corresponding probabilities of fault-free operation:
P t = e λ × t ,
where e is a non-prime number; λ is the probability of failure-free operation of one gate or logical element; t is the working time before failure measured in hours.
The complexity values given in Table 2 and the calculated corresponding probabilities of failure-free operation allow us to create appropriate graphs of the dependence of the complexity of the hardware implementations of the PLRUt algorithm on the ways q = 4, q = 8, q = 16, and q = 32 (Figure 4) and the probabilities of failure-free operation in the time sections t = 100, t = 1000, t = 10,000, and t = 100,000 (Figure 5).
The total delay in the transient processes of hardware implementations of PLRUt and PLRUm algorithms with the way q = 4 is determined by Expression (13):
τ t o t a l = τ m e m + τ d e c ,
where τ is the delay in the transient process of the logical element NAND; τmem is the delay in the transients of the synchronous flip-flop of JK type; τdec is the delays in the transient processes of the combinational logic of the CSdec-way decoder; τtotal is the total delay in the transient processes of the hardware implementation of PLRUt and PLRUm algorithms with ways q.
The value of the complexity of these hardware implementations can be calculated using Formulas (14)–(16), including for the hardware implementations of the PLRUt and PLRUm algorithms:
C t o t a l = C m e m + C d e c + C a d d _ l o g ,
where Cmem is the complexity measured in the logical elements or gates of the memory elements; Cdec is the complexity measured in the logical elements or gates of the way decoder; Cadd_log is the synthesized or predicted additional gate complexity.
C m e m = N × k ,
where N is the number of memory elements; k is the complexity measured in the logical elements or the gates for one memory element (k = 4 for a memory element implemented based on the synchronous flip-flop of JK type).
C d e c = q × v ,
where q is the number of ways of the cache memory; v is the complexity measured in the logical elements or the gates for the qi-way decoder.
The partial complexity v measured in the logical elements or the gates for the way decoder qi is provided in Table 3. The columns v 2and, v 3and, and v 4and are the logic elements of AND with 2, 3, and 4 inputs, respectively.
The value of additional gate complexity Cadd_log is provided in Table 4. The synthesis of hardware implementations of the PLRUt algorithm for q = 4, q = 8, q = 16, and q = 32 shows that when the number of ways is doubled, the value of the additional gate complexity Cadd_log increases by 1. Therefore, it allows us to predict the additional complexity measured in logical elements or the gates for q ≥ 64 (Table 4).
Thus, a more complete picture of the dependence of the complexity of hardware implementations of the PLRUt algorithm on the number of ways q in the associative cache memory of the processor allows us to represent it in the form of a curve (Figure 6).
Thus, it is possible to reproduce the indicators of delay, complexity, and reliability of circuits of the PLRUt algorithm’s implementation for the range from q = 4 to q = 2048 (Table 5).
However, it is necessary to take into account that the total delay in the hardware implementation of the PLRUt algorithm consists of the sum of two components, the delay of the flip-flop of JK type and the delay of the way decoder, that are provided in Table 4.
Table 5 allows us to represent the dependence of probabilities of failure-free operation of implementations of the PLRUt algorithm for the range from q = 4 to q = 2048 (Figure 7). The way selection delay in Figure 8 shows the step-like rise between q = 16 and q = 32.

3.4. Automata Model of Hardware Implementation of PLRUm Algorithm

The automata model of the hardware implementation of the PLRUm replacement algorithm (Figure 9) consists of a combinational circuit of the CSdec-way decoder and the memory elements of the synchronous flip-flop of JK type. The operation of the automaton is described by the following transition and output functions ((17) and (18)):
A i + = f Φ i A , Ψ i A , H , R ,
L r = Λ r A ,
where A = A 0 ,   A 1 , ,   A q 1 is the set of internal automata states at the current moment of time t; A + = A 0 + , A 1 + , , A q 1 + is the set of internal states at the next moment t+; Φ = Φ 0 A , Φ 1 A , , Φ q 1 A is a set of simple switching functions for the excitation of information inputs J of the synchronous flip-flops; Ψ = Ψ 0 A ,   Ψ 1 A , ,   Ψ q 1 A is a set of simple switching functions for the excitation of information inputs K of flip-flops; Λ = Λ 0 A ,   Λ 1 A , ,   Λ q 1 A is a set of q-way selection for simple switching functions with elements of the recently used data L; Φ i A ,   Ψ i A ,   Λ r A ,   A i ,   A i + 0 , 1 ; Lr is the selected data element in the cache memory by the current way r; i   0 ,   1 ,   2 , , q 1 ; r 0 ,   1 ,   2 , , q 1 ; q is the number of memory elements; H is a cache hit/miss signal; R is a reset signal; q is the number of ways in the cache memory; q N 0 .

3.5. Synthesis of Hardware Implementation of PLRUm Algorithm with Ways of q = 4

The synthesis of the hardware implementation is based on the idea of using the FIFO queue (Figure 10) to fill by using a logical value of 1 the additional bits of Ai in the case of an event of a cache hit (H = 1) (Figure 10). In the case of an event of a cache miss, the hardware implementation should automatically choose the way i + 1, according to which the recently used data element L with the corresponding bit Ai+1 = 0 will be selected.
The computer model of the hardware implementation of the PLRUm algorithm with ways q = 4 is provided in Figure 11.
This model of the hardware implementation of the PLRUm algorithm with the parameter of ways q = 4 is represented by the following logical Equations (19)–(24):
J 0 = K 3 = 1 , K 0 = K 1 = K 2 = A 3 ,
J 1 = A 0 ,
J 2 = A 1 & A 0 ,
J 3 = A 2 ,
L 0 = A 0 ¯ ,
L n + 1 = A n + 1 ¯ i = 0 n A i ,
The study of the computer model of the PLRUm algorithm for q = 4 shows the compliance of their work with the theoretical expectations determined by the idea of the PLRUm replacement algorithm.

3.6. Research of Hardware Implementation of PLRUm Algorithm with Ways q = 4

3.6.1. Evaluation of Latency and Complexity

According to Expression (13), in the case of an event of a cache hit (H = 1), the total delay in the hardware implementation of the PLRUm algorithm with ways of q = 4 will be as follows:
τ t o t a l = 3 τ + τ = 4 τ
Figure 12 shows the functional circuit of a JK flip-flop with static synchronization with a gate complexity of k = 4.
Expressions (14)–(16) allow us to summarize the gate complexity of the hardware implementation of the PLRUm algorithm for a number of ways of q = 4 (Table 6).

3.6.2. Evaluation of Reliability

Based on Expression (12) and taking into account Table 6, it is possible to represent the evaluation of the fault-free operation of the PLRUm algorithm hardware implementation with ways q = 4 in the form of a graph (Figure 13).

4. Summarized Results of Study of Hardware Implementation of PLRUm Algorithm with Ways of q = 8, q = 16, and q = 32

The synthesis of hardware implementations of the PLRUm algorithm with ways of q = 8, q = 16, and q = 32 is based on the synthesis of the hardware implementation of the PLRUm algorithm with ways of q = 4.
The computer model of the hardware implementation of the PLRUm algorithm with ways of q = 8 (Figure 14) is based on the following logical Equations (23)–(26):
J 0 = K 7 = 1 , K 0 = K 1 = K 2 = K 3 = K 4 = K 5 = K 6 = A 7 ,
J m + 1 = A m ,
During the study of the computer models of the PLRUm algorithm for q = 8 and q = 16, the compliance of the results with the theoretical expectations determined by the idea of the PLRUm replacement algorithm is observed.

4.1. Research of Hardware Implementations of PLRUm Algorithm with Ways of q = 8, q = 16, and q = 32

4.1.1. Evaluation of Complexity and Reliability

Based on Expressions (14)–(16), it is possible to summarize the complexity measured in logical elements or the gates of hardware implementations of the PLRUm algorithm with ways of q = 8 and q = 16 (Table 7).
The computer model of the hardware implementation of the PLRUm algorithm with ways of q = 16 allows us to perform an accurate estimation of the complexity of the CSdec combinational circuit and hardware implementation of the PLRUm algorithm with ways of q = 32 without the process of synthesis. It should be noted that at the same time, the complexity of memory elements Cmem will increase by exactly two times compared to the hardware implementation of the PLRUm algorithm with ways of q = 16.
With the use of Expressions (14)–(16), it is possible to summarize the gate complexity of the hardware implementation of the PLRUm algorithm with ways of q = 32 (Table 8).

4.1.2. Evaluation of Delay

Based on Expression (13), in the case of an event of a cache hit (H = 1), the total delay in PLRUm hardware implementation with ways of q = 8, q = 16, and q = 32 will be as follows:
τ t o t a l   q = 8 = 3 τ + 2 τ = 5 τ
τ t o t a l   q = 16 = 3 τ + 2 τ = 5 τ
τ t o t a l   q = 32 = 3 τ + 3 τ = 6 τ
Based on Table 6, Table 7 and Table 8, it is possible to estimate the complexity and reliability of the hardware implementations of the PLRUm algorithm with ways of q = 4, q = 8, q = 16, and q = 32 and represent them in the form of the graphs (Figure 15).

4.2. Possible Limitations and Challenges of Hardware Implementation of Both Algorithms

The current study is primarily targeted toward performing the investigation of parameters of the hardware implementation of both PLRUt and PLRUm algorithms. This understanding allows us to estimate the characteristics for different values of ways of parameter q without physical implementation and to determinate the dependency on this q parameter.
The performed investigation allows us to highlight the possible limitations or problems during the implementation of the hardware of the PLRUt algorithm. In this study, the investigation of probabilities of failure-free operation of the hardware implementation of the PLRUt algorithm is shown in Figure 7. It illustrates the probabilities of failure-free operation of the hardware implementation of the PLRUt algorithm for q = 4, q = 8, q = 16, q = 32, q = 64, q = 128, q = 256, q = 512, q = 1024, and q = 2048. Here, it can be seen that the reliability of the hardware implementation of the algorithm for q = 64, q = 128, q = 256, q = 512, q = 1024, and q = 2048 rapidly decreases from 0.8 to almost 0.
This is also valid for the hardware implementation of the PLRUm algorithm. The investigation results for this algorithm are provided in Figure 15.
Therefore, the implementation of the hardware of these algorithms is reasonable for the parameters of ways of q = 4, q = 8, q = 16, and q = 32 (Figure 16).

5. Discussion

5.1. Comparative Analysis

The results of the performed investigation of the synthesized hardware implementations of the PLRUt and PLRUm algorithms of one cache memory line and the results of Table 3 of the replacement policy resource utilization from [10] are summarized in Table 9.
The results of the performed investigation in Table 9 are provided for ways of q = 4 and q = 8, where a smaller number of LUTs and a smaller or equal number of FFs are observed for both algorithms compared to the results of Table 3. The replacement policy resources [10] or research data for ways of q = 16 and q = 32 are not available.
The proposed method of synthesizing the hardware implementation of the PLRUt algorithm allows us to obtain results for a higher number of ways (Table 10).
The universal form of the synthesized hardware implementations of the PLRUt algorithm for q = 4, q = 8, q = 16, and q = 32 allows us to determine the values of the delay, gate complexity, and reliability for the entire range of 4 ≤ q ≤ 2048.

5.2. Adaptation

One of the goals of this study is to determine the possibility of the adaptation [16] of the cache memory of the processor core to each of the PLRUt or PLRUm policies depending on one of the following priority characteristics selected in the core: delay, complexity, or reliability. The characteristic values summarized in Table 11 provide an opportunity to build comparative graphs for each of the characteristics (Figure 16).
Based on the comparison of the hardware implementation of the algorithms provided in Figure 16, it is possible to conclude that the PLRUt algorithm is better than the hardware implementation of the PLRUm algorithm in terms of all characteristics. It makes the adaptation impossible, but based on the advantages, it increases the probability of the use of the hardware implementation of this algorithm for the cache memory of the CPU or implementation in FPGA.

5.3. Further Development

The obtained results allow us to make improvements to the hardware implementation of the algorithms of the replacement policy. The fast verification of the workability of such algorithms can be performed with the use of FPGA to create the hardware implementation of the required circuit.
The results are also promising for the improvement in the architectures of the existing processors and for prototyping the superscalar processor models. Modern FPGA chips allow for the implementation of the super-parallelized scalable and parameterizable dedicated data processing systems for intensive searches of data in dynamic HBM memory and the implementation of services with dedicated architectures [25].
The creation of a dedicated architecture before the production of the chip allows us to perform practical measurements to compare the results with the results of theoretical research [26].
The possibility of the implementation not only of the logic of a cache memory replacement policy but also of the queue of the commands for execution allows us to perform the next step of improving the performance of the processors while ensuring backward comparability with existing software for the processors. This step allows us to reduce the performance by means of out-of-order execution of commands without data dependency.
The practical experience of the creation of the dedicated architecture with powerful FPGA chips and the development flow with the software for prototyping allows us to apply the same approach for the creation of hardware models of such algorithms for proving the concepts and quickly checking the optimizations. It is possible to do this because at the moment, most FPGAs are produced by the same companies that are involved in the production of modern processors. The same technological process is used during the production. At the same time, the total performance of some FPGAs is higher than that of CPUs [27].
Thus, the finding of a new model of prediction and of candidates for replacement in the cache memory, and the creation of a queue of commons for execution are the most promising directions for the next steps of future research. Involving modern developed environments for FPGA project prototyping allows us to reduce the time taken for the implementation and testing of such theoretical models. Combining both theoretical research with practical experiments is the best way to enhance processor architectures with a prototyping process with low costs and efforts.

6. Conclusions

The main contribution of this study lies in developing a method of hardware implementation of algorithms for data replacement in the cache memory of processor cores, as well as the accessible assessments of complexity, latency, and reliability that allow us to obtain the proof of concept of the adaptive solutions.
This paper proposes the description of automata models and the results of the synthesis of hardware solutions of PLRUt and PLRUm data replacement algorithms. It allows us to determine the parameters and values of characteristics, such as the delay in the selection of a data element for a candidate for replacement, the complexity measured in logical elements, and the reliability of the equipment, as well as to investigate their dependence on the growth of the associativity. With the growth of the associativity of the cache memory, both the delay parameters and the complexity measured in logical elements increase.
Implementations of both algorithms are compared and considered in detail. The hardware implementation of the data replacement algorithm PLRUt has advantages in terms of values of the delay over the hardware implementation of the data replacement algorithm PLRUm as well as by one τ in the case of the associativity of 8, 16, and 32.
However, the implementation of the algorithms is equal at q = 4. According to the parameters of complexity and reliability, the hardware implementation of the PLRUt data replacement algorithm shows better results than the hardware implementation of the PLRUm data replacement algorithm at q = 4, 8, 16, and 32. This advantage increases the probability of using the implementation of this algorithm in the CPU cache memory module or for FPGA implementation.
Further research can be focused on adding and enhancing the set of algorithms for different replacement policies to determine their priority and selection based on preferred parameters. The possibility of rapid prototyping and testing with the use of FPGA helps to prove the workability of the algorithm. In addition, the development of a method of adapting the mechanisms of replacement policies for changing various algorithms depending on the parameters of the computing process is very promising.

Author Contributions

Conceptualization, A.B., V.K., and V.P.; methodology, V.K., V.P., and A.P.; software, V.P.; validation, L.T. and A.P.; formal analysis, A.B. and V.K.; investigation, V.P. and A.P.; resources, L.T.; data curation, V.P. and A.P.; writing—original draft preparation, V.P., A.P. and V.K.; writing—review and editing, L.T. and A.B.; visualization, V.P. and A.P.; supervision, V.K.; project administration, L.T., A.B. and V.K. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Ministry of Education and Science of Ukraine, state grant registration number 0122U001065, project title “Methods, software and hardware tools, and technologies for providing dependability of intelligent industrial Internet of Things”. This publication reflects the views only of the authors, and the Ministry of Education and Science of Ukraine cannot be held responsible for any use that may be made of the information contained therein.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors are grateful to the scientific staff of the Department of Computer Systems, Networks and Cybersecurity of the National Aerospace University “KhAI” and the Institute of Metrology, Electronics and Computer Science of the University of Zielona Gora for their invaluable inspiration, hard work, and creative analysis during the preparation of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Embedded Intel486 Processor Hardware Reference Manual; Intel Corporation: Santa Clara, CA, USA, 1997; p. 334.
  2. Brey, B.B. The Intel Microprocessors: 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium, Pentium Pro Processor, Pentium II, Pentium III, Pentium 4, and Core2 with 64-Bit Extensions: Architecture, Programming, and Interfacing; Pearson Prentice Hall: Old Bridge, NJ, USA, 2009; p. 925. [Google Scholar]
  3. Adkins, A.; Ammeson, B.; Anouna, J.; Garside, T.; Hunker, L.; Mailand, S. Intel Core i7 Memory Hierarchy. Available online: http://web.cs.wpi.edu/~cs4515/d15/Protected/LecturesNotes_D15/Week3_TeamA_i7-Presentation.pdf (accessed on 23 May 2023).
  4. Clemente, J.A.; Gran, R.; Chocano, A.; del Prado, C.; Resano, J. Hardware Architectural Support for Caching Partitioned Reconfigurations in Reconfigurable Systems. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2016, 24, 530–543. [Google Scholar] [CrossRef]
  5. Omran, S.S.; Amory, I.A. Implementation of LRU Replacement Policy for Reconfigurable Cache Memory Using FPGA. In Proceedings of the 2018 International Conference on Advanced Science and Engineering (ICOASE), Kurdistan Region, Iraq, 9–11 October 2018; pp. 13–18. [Google Scholar]
  6. Inoue, H. Multi-step LRU: Simd-based Cache Replacement for Lower Overhead and Higher Precision. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 174–180. [Google Scholar]
  7. Wang, C.; Wang, J.; Wang, M. Cache Performance Research for Embedded Processors. Phys. Procedia 2012, 25, 1322–1328. [Google Scholar] [CrossRef]
  8. Shimizu, A.; Townley, D.; Joshi, M.; Ponomarev, D. EA-PLRU: Enclave-aware Cache Replacement. In Proceedings of the HASP ’19: Proceedings of the 8th International Workshop on Hardware and Architectural Support for Security and Privacy, Phoenix, AZ, USA, 23 June 2019. [Google Scholar]
  9. Abel, A.; Reineke, J. Reverse engineering of cache replacement policies in Intel microprocessors and their evaluation. In Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2014), Monterey, CA, USA, 23–25 March 2014; IEEE Computer Society: Washington, DC, USA, 2014; pp. 141–142. [Google Scholar]
  10. Roque, J.V.; Lopes, J.D.; Véstias, M.P.; de Sousa, J.T. IOb-Cache: A High-Performance Configurable Open-Source Cache. Algorithms 2021, 14, 218. [Google Scholar] [CrossRef]
  11. Zhang, K.; Wang, Z.; Chen, Y.; Zhu, H.; Sun, X.-H. PAC-PLRU: A Cache Replacement Policy to Salvage Discarded Predictions from Hardware Prefetchers. In Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Newport Beach, CA, USA, 23–26 May 2011; pp. 265–274. [Google Scholar]
  12. Lentz, M.; Franklin, M. Performance of Private Cache Replacement Policies for Multicore Processors. In Proceedings of the 4th International Conference on Computer Science, Engineering and Applications, Dubai, United Arab Emirates, 7–8 March 2014; Volume 4, pp. 1–7. [Google Scholar]
  13. Perez, W.J.H.; Sanchez, E.; Reorda, M.S.; Tonda, A.; Medina, J.V. Functional Test Generation for the Plru Replacement Mechanism of Embedded Cache Memories. In Proceedings of the 2011 12th Latin American Test Workshop (LATW), Beach of Porto de Galinhas, Brazil, 27–30 March 2011; pp. 1–6. [Google Scholar]
  14. Grund, D.; Reineke, J. Toward Precise PLRU Cache Analysis. In Proceedings of the 10th International Workshop on Worst-Case Execution Time Analysis, WCET 2010, Brussels, Belgium, 6 July 2010; Volume 15, pp. 23–35. [Google Scholar]
  15. Puidenko, V.; Kharchenko, V. The minimizating of logical scheme for implementation of pseudo LRU by inter-type transition in trigger structures. Radioel. Comp. Syst. 2020, 2, 33–47. [Google Scholar]
  16. Puidenko, V. Automaton model, device synthesis and adaptive substitution algorithm for cache memory. Radioel. Comp. Syst. 2020, 4, 68–78. [Google Scholar]
  17. Zhu, W.; Zeng, X. Decision Tree-Based Adaptive Reconfigurable Cache Scheme. Algorithms 2021, 14, 176. [Google Scholar] [CrossRef]
  18. Rashid, S.; Razak, S.A.; Ghaleb, F.A. IMU: A Content Replacement Policy for CCN, Based on Immature Content Selection. Appl. Sci. 2022, 12, 344. [Google Scholar] [CrossRef]
  19. Sheraz, M.; Shafique, S.; Imran, S.; Asif, M.; Ullah, R.; Ibrar, M.; Khan, J.; Wuttisittikulkij, L. A Reinforcement Learning Based Data Caching in Wireless Networks. Appl. Sci. 2022, 12, 5692. [Google Scholar] [CrossRef]
  20. Fang, J.; Kong, H.; Yang, H.; Xu, Y.; Cai, M. A Heterogeneity-Aware Replacement Policy for the Partitioned Cache on Asymmetric Multi-Core Architectures. Micromachines 2022, 13, 2014. [Google Scholar] [CrossRef] [PubMed]
  21. Knoben, P.; Alachiotis, N. Improving Performance of Hardware Accelerators by Optimizing Data Movement: A Bioinformatics Case Study. Electronics 2023, 12, 586. [Google Scholar] [CrossRef]
  22. Siddiqui, M.F.; Ali, F.; Javed, M.A.; Khan, M.B.; Saudagar, A.K.J.; Alkhathami, M.; Abul Hasanat, M.H. An FPGA-Based Performance Analysis of Hardware Caching Techniques for Blockchain Key-Value Database. Appl. Sci. 2023, 13, 4092. [Google Scholar] [CrossRef]
  23. Shin, D.-J.; Kim, J.-J. Cache-Based Matrix Technology for Efficient Write and Recovery in Erasure Coding Distributed File Systems. Symmetry 2023, 15, 872. [Google Scholar] [CrossRef]
  24. Puidenko, V.; Kharchenko, V. The Minimizating of Hardware for Implementation of Pseudo LRU Algorithm for Cache Memory. In Proceedings of the 2020 IEEE 11th International Conference on Dependable Systems, Services and Technologies (DESSERT), Kyiv, Ukraine, 14–18 May 2020; pp. 65–71. [Google Scholar]
  25. Perepelitsyn, A. Method of Creation of FPGA based Implementation of Artificial Intelligence as a Service. Radioel. Comp. Syst. 2023, 3, 27–36. [Google Scholar] [CrossRef]
  26. Perepelitsyn, A.; Kulanov, V.; Zarizenko, I. Method of QoS Evaluation of FPGA as a Service. Radioel. Comp. Syst. 2022, 4, 153–160. [Google Scholar] [CrossRef]
  27. Barkalov, A.; Titarenko, L.; Mazurkiewicz, M. Foundations of Embedded Systems; Studies in Systems, Decision and Control, V. 195; Springer: Cham, Switzerland, 2019; 167p. [Google Scholar]
Figure 1. Automata model of hardware implementation of PLRUt algorithm.
Figure 1. Automata model of hardware implementation of PLRUt algorithm.
Computers 13 00166 g001
Figure 2. The general algorithm of changing the q-index according to the PLRUt algorithm.
Figure 2. The general algorithm of changing the q-index according to the PLRUt algorithm.
Computers 13 00166 g002
Figure 3. Computer model of hardware implementation of PLRUt algorithm for q = 16.
Figure 3. Computer model of hardware implementation of PLRUt algorithm for q = 16.
Computers 13 00166 g003
Figure 4. The dependence of the complexity of the PLRUt algorithm hardware implementations on the number of ways q in the associative cache memory of the processor.
Figure 4. The dependence of the complexity of the PLRUt algorithm hardware implementations on the number of ways q in the associative cache memory of the processor.
Computers 13 00166 g004
Figure 5. Probabilities of trouble-free operation of hardware implementations of PLRUt algorithm (a) for q = 4; (b) q = 8; (c) q = 16; (d) q = 32.
Figure 5. Probabilities of trouble-free operation of hardware implementations of PLRUt algorithm (a) for q = 4; (b) q = 8; (c) q = 16; (d) q = 32.
Computers 13 00166 g005aComputers 13 00166 g005b
Figure 6. The extended dependence of the complexity of hardware implementations of the PLRUt algorithm on the number of ways q in the multi-associative cache memory of the processor core.
Figure 6. The extended dependence of the complexity of hardware implementations of the PLRUt algorithm on the number of ways q in the multi-associative cache memory of the processor core.
Computers 13 00166 g006
Figure 7. The probabilities of failure-free operation of the hardware implementation of the PLRUt algorithm for q = 4, q = 8, q = 16, q = 32, q = 64, q = 128, q = 256, q = 512, q = 1024, and q = 2048.
Figure 7. The probabilities of failure-free operation of the hardware implementation of the PLRUt algorithm for q = 4, q = 8, q = 16, q = 32, q = 64, q = 128, q = 256, q = 512, q = 1024, and q = 2048.
Computers 13 00166 g007
Figure 8. The delays in way selection in the hardware implementation of the PLRUt algorithm for q = 4, q = 8, q = 16, q = 32, q = 64, q = 128, q = 256, q = 512, q = 1024, and q = 2048.
Figure 8. The delays in way selection in the hardware implementation of the PLRUt algorithm for q = 4, q = 8, q = 16, q = 32, q = 64, q = 128, q = 256, q = 512, q = 1024, and q = 2048.
Computers 13 00166 g008
Figure 9. The automata model of the hardware implementation of the PLRUm algorithm.
Figure 9. The automata model of the hardware implementation of the PLRUm algorithm.
Computers 13 00166 g009
Figure 10. The directed graph of transitions of filling by a logical value of 1 the additional bits of Ai at the cache hit event (H = 1) for q = 4.
Figure 10. The directed graph of transitions of filling by a logical value of 1 the additional bits of Ai at the cache hit event (H = 1) for q = 4.
Computers 13 00166 g010
Figure 11. The computer model of the hardware implementation of the PLRUm algorithm for q = 4.
Figure 11. The computer model of the hardware implementation of the PLRUm algorithm for q = 4.
Computers 13 00166 g011
Figure 12. A functional diagram of a JK flip-flop with static synchronization with a complexity of k = 4.
Figure 12. A functional diagram of a JK flip-flop with static synchronization with a complexity of k = 4.
Computers 13 00166 g012
Figure 13. A graph of the failure-free operation of the hardware implementation of the PLRUm algorithm with ways of q = 4.
Figure 13. A graph of the failure-free operation of the hardware implementation of the PLRUm algorithm with ways of q = 4.
Computers 13 00166 g013
Figure 14. The computer model of the hardware implementation of the PLRUm algorithm for q = 8.
Figure 14. The computer model of the hardware implementation of the PLRUm algorithm for q = 8.
Computers 13 00166 g014
Figure 15. Graphs of the delay, complexity, and reliability of hardware implementations of the PLRUm algorithm for ways of q = 4, q = 8, q = 16, and q = 32: (a) delays; (b) complexity; (c) reliability.
Figure 15. Graphs of the delay, complexity, and reliability of hardware implementations of the PLRUm algorithm for ways of q = 4, q = 8, q = 16, and q = 32: (a) delays; (b) complexity; (c) reliability.
Computers 13 00166 g015aComputers 13 00166 g015b
Figure 16. Graphs of the comparison of the obtained delay, complexity, and reliability of the hardware implementations of the PLRUm and PLRUt algorithms for parameters of ways of q = 4, q = 8, q = 16, and q = 32: (a) delays; (b) complexity; (c) reliability.
Figure 16. Graphs of the comparison of the obtained delay, complexity, and reliability of the hardware implementations of the PLRUm and PLRUt algorithms for parameters of ways of q = 4, q = 8, q = 16, and q = 32: (a) delays; (b) complexity; (c) reliability.
Computers 13 00166 g016
Table 1. A comparative analysis of the research directions in the considered studies.
Table 1. A comparative analysis of the research directions in the considered studies.
The Direction of Research in the ReferenceNumber of Considered References
[6,7,8,9,11,12,13,14,15,16,20,24][4,5,10,21][17][18] [19,27][22,25,26][23]
Study of replacement policies during the caching of instructions and data in the cache memory of the CPU
Study of replacement policies during the caching in the device based on FPGA
Research on the adaptive associativity of a reconfigurable cache scheme of the CPU
Study of replacement policies in content-oriented computer networks
Study scheme for wireless networks for growing data traffic and preventing overload
Study of the efficiency of different methods of caching of key-value storage for Blockchain in FPGA
Study of a cache-based matrix technique using the HDFS, where the EC volume consists of a Reed–Solomon code with the parameters of (6, 3)
Table 2. The complexity of the hardware implementation of the PLRUt algorithm.
Table 2. The complexity of the hardware implementation of the PLRUt algorithm.
Num. of qComponentsNumber
(Logical Elements per Line)
q = 4Logical elements AND-OR-NOT4
q = 4Logical elements NAND4
q = 4Logical elements AND4
Total (q = 4):12
q = 8Logical elements AND-OR-NOT6
q = 8Logical elements NAND6
q = 8Logical elements AND9
Total (q = 8):21
q = 16Logical elements AND-OR-NOT8
q = 16Logical elements NAND8
q = 16Logical elements AND18
Total (q = 16):34
q = 32Logical elements AND-OR-NOT10
q = 32Logical elements NAND10
q = 32Logical elements AND35
Total (q = 32):55
Table 3. The complexity v measured in the logical elements or the gates for the way decoder q.
Table 3. The complexity v measured in the logical elements or the gates for the way decoder q.
Num. of Ways, qLines,
N
v 2and v 3and v 4and v τdec
421001τ
830101τ
1640011τ
3251012
6462013
12871113
25681023
51290123
1024101124
2048110224
Table 4. The complexity of the hardware implementations of the PLRUt algorithm for q = 4, q = 8, q = 16, q = 32, q = 64, q = 128, q = 256, q = 512, q = 1024, and q = 2048.
Table 4. The complexity of the hardware implementations of the PLRUt algorithm for q = 4, q = 8, q = 16, q = 32, q = 64, q = 128, q = 256, q = 512, q = 1024, and q = 2048.
Num. of Ways, qTotal Complexity in Gates per Line,
CPLRUt
Complexity of Memory ElementsComplexity of
Decoder of Ways
of Memory
Synthesized and Predicted Additional Gate Complexity
CmemτmemCdecτdecCadd_logCPLRUt (%)
41284τ00.00
821128τ14.76
16341616τ25.88
3255206435.46
642202419241.82
1284172838451.20
2568063276860.74
512157936153670.05
1024414440409680.19
2048824544819290.11
Table 5. The indicators of delay, complexity, and reliability of hardware implementations of the PLRUt algorithm for the range of 4 ≤ q ≤ 2048.
Table 5. The indicators of delay, complexity, and reliability of hardware implementations of the PLRUt algorithm for the range of 4 ≤ q ≤ 2048.
Num. of Ways, qComplexity in Gates per Line,
CPLRUt
Reliability,
P (t = 10,000)
Delay,
τ
4120.9981
8210.9792
16340.9665
32550.9465
642200.8025
1284170.6590
2568060.4467
51215790.2062
102441440.0159
204882450.0003
Table 6. The gate complexity of the hardware implementation of the PLRUm algorithm with the parameter of ways of q = 4.
Table 6. The gate complexity of the hardware implementation of the PLRUm algorithm with the parameter of ways of q = 4.
Num. of qSymbolComponentsNumber of Gates (V)
per Line
q = 4CmemLogical elements NAND with 3 inputs8
q = 4Logical elements AND with 4 inputs8
q = 4Cadd_logLogical elements AND with 2 inputs1
q = 4CdecLogical elements AND with 2 inputs1
q = 4Logical elements AND with 3 inputs1
q = 4Logical elements AND with 4 inputs1
Ctotal (q = 4) V = 20
Table 7. The gate complexity of the hardware version of the PLRUm algorithm with ways of q = 8 and q = 16.
Table 7. The gate complexity of the hardware version of the PLRUm algorithm with ways of q = 8 and q = 16.
Num. of qSymbolComponentsNumber of Gates (V)
per Line
q = 8CmemLogical elements NAND with 3 inputs16
q = 8Logical elements AND with 4 inputs16
q = 8CdecLogical elements AND with 2 inputs6
q = 8Logical elements AND with 3 inputs2
q = 8Logical elements AND with 4 inputs3
Ctotal (q = 8) V = 43
q = 16CmemLogical elements NAND with 3 inputs32
q = 16Logical elements AND with 4 inputs32
q = 16CdecLogical elements AND with 2 inputs8
q = 16Logical elements AND with 3 inputs8
q = 16Logical elements AND with 4 inputs11
Ctotal (q = 16) V = 91
Table 8. The gate complexity of the hardware implementation of the PLRUm algorithm with the parameter of ways of q = 32.
Table 8. The gate complexity of the hardware implementation of the PLRUm algorithm with the parameter of ways of q = 32.
Num. of qSymbolComponentsNumber of Gates (V)
per Line
q = 32CmemLogical elements NAND with 3 inputs64
q = 32Logical elements AND with 4 inputs64
q = 32CdecLogical elements AND with 2 inputs13
q = 32Logical elements AND with 3 inputs18
q = 32Logical elements AND with 4 inputs44
Ctotal (q = 32) V = 203
Table 9. The results of the performed investigation into the replacement policy resources.
Table 9. The results of the performed investigation into the replacement policy resources.
PolicyWaysSingle Cache LineLUTLUT
(Table 3)
[10]
FFFF
(Table 3)
[10]
PLRUm418844
PLRUt416623
PLRUm81162288
PLRUt81111337
PLRUm1613216
PLRUt161204
PLRUm3216432
PLRUt321375
Table 10. The results of the performed investigation into the replacement policy resources for 64 ≤ q ≤ 2048.
Table 10. The results of the performed investigation into the replacement policy resources for 64 ≤ q ≤ 2048.
PolicyWaysSingle Cache LineLUTLUT
(Table 3)
[10]
FFFF
(Table 3)
[10]
PLRUt641706
PLRUt12811357
PLRUt25612648
PLRUt51215219
PLRUt10241103410
PLRUt20481205911
Table 11. The values of the characteristics of gate complexity, reliability, and delay in hardware implementations of PLRUm and PLRUt algorithms with ways of q = 4, q = 8, q = 16, and q = 32.
Table 11. The values of the characteristics of gate complexity, reliability, and delay in hardware implementations of PLRUm and PLRUt algorithms with ways of q = 4, q = 8, q = 16, and q = 32.
Num. of Ways, qGate Complexity, CPLRUtGate Complexity, CPLRUmReliability,
P (t = 10,000) PLRUt
Reliability,
P (t = 10,000) PLRUm
Delay,
τ
PLRUt
Delay,
τ
PLRUm
412200.99810.9802
821430.97920.9579
1634910.96650.9130
32552030.94650.8163
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Titarenko, L.; Kharchenko, V.; Puidenko, V.; Perepelitsyn, A.; Barkalov, A. Hardware-Based Implementation of Algorithms for Data Replacement in Cache Memory of Processor Cores. Computers 2024, 13, 166. https://doi.org/10.3390/computers13070166

AMA Style

Titarenko L, Kharchenko V, Puidenko V, Perepelitsyn A, Barkalov A. Hardware-Based Implementation of Algorithms for Data Replacement in Cache Memory of Processor Cores. Computers. 2024; 13(7):166. https://doi.org/10.3390/computers13070166

Chicago/Turabian Style

Titarenko, Larysa, Vyacheslav Kharchenko, Vadym Puidenko, Artem Perepelitsyn, and Alexander Barkalov. 2024. "Hardware-Based Implementation of Algorithms for Data Replacement in Cache Memory of Processor Cores" Computers 13, no. 7: 166. https://doi.org/10.3390/computers13070166

APA Style

Titarenko, L., Kharchenko, V., Puidenko, V., Perepelitsyn, A., & Barkalov, A. (2024). Hardware-Based Implementation of Algorithms for Data Replacement in Cache Memory of Processor Cores. Computers, 13(7), 166. https://doi.org/10.3390/computers13070166

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop