Shared memory multiprocessor system

description586 papers

group0 followers

lightbulbAbout this topic

A shared memory multiprocessor system is a computer architecture where multiple processors access a common memory space, allowing them to communicate and coordinate their operations efficiently. This system enables concurrent execution of processes, facilitating parallel computing and improving performance for applications that require high computational power.

lightbulbAbout this topic

Key research themes

1. How can persistent and low-latency shared memory be efficiently realized in distributed multiprocessor datacenter systems?

This theme investigates the integration of next-generation non-volatile memories (NVMs) into distributed shared memory (DSM) systems to provide a global persistent memory abstraction with low latency, reliability, and high availability in datacenter-scale multiprocessor environments. It matters because NVMs offer DRAM-like speeds combined with persistence and high density, which can significantly enhance large-scale application performance, persistence, and fault tolerance, but leveraging these benefits across distributed nodes requires novel system software and hardware designs.

Distributed shared persistent memory

by Yizhou Shan

2021

Key finding: Introduced Distributed Shared Persistent Memory (DSPM) framework and implemented Hotpot, a kernel-level system providing a global persistent shared memory space accessible via native load/store instructions in distributed... Read more

articleView Paper downloadDownload

Realization Features of System Software of Multiprocessor Computing Systems

by Eugene Fedorov

2022, Advances in Computer and Electrical Engineering

Key finding: Demonstrated system software techniques leveraging RDMA over InfiniBand for efficient remote memory access in multiprocessor computing systems, enabling memory data transfers without OS or application intervention on target... Read more

articleView Paper downloadDownload

Virtual memory and backing storage management in multiprocessor operating systems using object-oriented design techniques

by Roy Campbell

2022, ACM SIGPLAN Notices

Key finding: Developed the Choices OS architecture employing object-oriented design to implement a modular and extensible virtual memory and backing store system for shared memory and networked multiprocessors, enabling uniform and... Read more

articleView Paper downloadDownload

Cache Coherence Protocols in Distrubted Systems

by Subhi R M Zeebaree and

2020, Journal of Applied Science and Technology Trends (JASTT)

Key finding: Reviewed cache coherence protocols (SI, MI, MSI, MESI, MOSI, MOESI) in distributed multiprocessor environments highlighting their impact on maintaining data consistency, coherence overheads, and system performance, informing... Read more

articleView Paper downloadDownload

Shared Memory Multiprocessor-A Brief Study

by Rashmi Dewan

2022

Key finding: Provided a comprehensive overview of shared memory multiprocessor architectures including Uniform Memory Access (UMA) design, cache coherence models, and system software layers, emphasizing architectural and software... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What software and runtime techniques effectively manage scheduling, load balancing, and parallelism in shared-memory multiprocessor systems?

This theme focuses on dynamic scheduling approaches, task parallelism exploitation, and runtime mechanisms that optimize workload distribution and parallel execution on shared-memory multiprocessors. Efficient scheduling is essential to fully leverage the hardware parallelism, improve load balance, and increase application throughput in multiprocessor systems.

Lazy binary-splitting: a run-time adaptive work-stealing scheduler

by Uzi Vishkin

2021, ACM SIGPLAN …

Key finding: Presented Lazy Binary Splitting (LBS), an adaptive user-level scheduler that improves upon eager binary splitting by reducing the need for manual tuning of stop-splitting thresholds for nested parallel do-all loops, thereby... Read more

articleView Paper downloadDownload

Implicit Transactional Memory in Kilo-Instruction Multiprocessors

by Per Stenström

2023, Lecture Notes in Computer Science

Key finding: Proposed implicit transactional memory implemented using a multi-checkpoint mechanism allowing speculative execution beyond synchronization points without explicit software identification, which reduces serialization and... Read more

articleView Paper downloadDownload

A fast parallel matching algorithm for continuous interest management

by Georgios Theodoropoulos

2023, Proceedings of the 2010 Winter Simulation Conference

Key finding: Developed a parallel interest matching algorithm for distributed virtual environments executed on shared-memory multiprocessors that distributes the workload of space-time event matching across multiple processors,... Read more

articleView Paper downloadDownload

A New Algorithm for VHDL Parallel Simulation

by Santiago Benites Rodriguez

2023, ACM Transactions on Design Automation of Electronic Systems

Key finding: Proposed a parallel synchronous simulation algorithm for VHDL that increases parallelism by analyzing signal dependencies and relaxing synchronization barriers, enabling efficient execution on shared-memory multiprocessors... Read more

articleView Paper downloadDownload

PBB: A Parallel Bioinformatics Benchmark Suite for Shared Memory Multiprocessors

by Chuntao HONG

2022

Key finding: Created a diverse suite of seven parallel bioinformatics applications optimized with thread-level parallelism for shared memory multiprocessors, enabling evaluation of parallel programming techniques and workload... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can software-level memory management policies and algorithms mitigate memory contention and improve scalability in shared memory multiprocessors?

This theme studies operating system and hardware memory management strategies, including page allocation, memory bank partitioning, and transactional memory buffering, to reduce interference, contention, and coherence overhead in shared memory multiprocessors. These approaches aim to enhance throughput and energy efficiency by optimizing access to shared DRAM banks and maintaining cache coherence.

A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore Systems

by lei liu

2025

Key finding: Introduced Bank-level Partition Mechanism (BPM), a software-based page-coloring scheme implemented in the OS kernel that partitions DRAM banks across cores to eliminate bank-level memory interference, improving average system... Read more

articleView Paper downloadDownload

The Impact of Non-coherent Buffers on Lazy Hardware Transactional Memory Systems

by Jose Lizaraga Garcia

2024, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum

Key finding: Analyzed how coherent buffering in private caches causes inefficiencies in lazy HTM by prematurely exposing speculative writes to coherence mechanisms; showed that employing non-coherent write buffers can mitigate overhead,... Read more

articleView Paper downloadDownload

Exploiting thread-level parallelism in the iterative solution of sparse linear systems

by José Aliaga

2023, Parallel Computing

Key finding: Developed a parallel iterative solver for large sparse linear systems based on multilevel incomplete LU preconditioners leveraging nested dissection and task parallelism, employing dynamic scheduling for load balancing,... Read more

articleView Paper downloadDownload

Implicit Transactional Memory in Kilo-Instruction Multiprocessors

by Per Stenström

2023, Lecture Notes in Computer Science

Key finding: Demonstrated a low-complexity implicit transactional memory design that leverages multi-checkpoint execution to support large speculative memory accesses with sequential consistency, reducing synchronization overhead and... Read more

articleView Paper downloadDownload

The habanero multicore software research project

by Reman Barik

2024, Proceedings of the 24th ACM SIGPLAN conference companion on Object oriented programming systems languages and applications

Key finding: Proposed a two-level programming model combining a high-level coordination language (Concurrent Collections) and a lower-level parallel language (Habanero Java) to support flexible task distribution and mutual exclusion on... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Shared memory multiprocessor system

Using proxies to reduce controller contention in large shared-memory multiprocessors

by Sarah Bennett

2025, Lecture Notes in Computer Science

descriptionView Paper arrow_downwardDownload

Adaptive Proxies: Handling Widely-Shared Data in Shared-Memory Multiprocessors

by Sarah Bennett

2025, Lecture Notes in Computer Science

descriptionView Paper arrow_downwardDownload

Lock-free garbage collection for multiprocessors

by Eliot Moss

2025, IEEE Transactions on Parallel and Distributed Systems

descriptionView Paper arrow_downwardDownload

The Location Consistency memory model and cache protocol: Specification and verification

by José Amaral

2025

We use the Abstract State Machine methodology to give formal operational semantics for the Location Consistency memory model and cache protocol. With these formal models, we prove that the cache protocol satis es the memory model, but in... more

descriptionView Paper arrow_downwardDownload

The Cachemire Test Bench A Flexible And Effective Approach For Simulation Of Multiprocessors

by Per Stenström

2025, [1993] Proceedings 26th Annual Simulation Symposium

The approach of program-driven simulation of multiprocessors has generally been believed to be too slow in order to perform experiments and performance evaluations with realistic workloads. We show that the program-driven approach for... more

descriptionView Paper arrow_downwardDownload

The DiSOM distributed shared object memory

by Paulo Guedes

2025, Proceedings of the 6th workshop on ACM SIGOPS European workshop Matching operating systems to application needs - EW 6

descriptionView Paper arrow_downwardDownload

Adaptive flow control in time warp

by Richard Fujimoto

2025, Simuletter

It is well known that Time Warp may suffer from poor performance due to excessive rollbacks caused by overly optimistic execution. Here we present a simple flow control mechanism using only local information and GVT that limits the number of uncommitted messages generated by a processor, thus throttling overly optimistic TW execution. The flow control scheme is analogous to traditional networking flow control mechanisms. A "window" of messages defines the maximum number of uncommitted messages allowed to be scheduled by a process. Committing messages is analogous to acknowledgments in networking flow control. The initial size of the window is calculated using a simple analytical model that estimates the instantaneous number of messages that a process will eventually commit. This window is expanded so that the process may progress up to the next commit point (generally the next fossil collection), and to accommodate optimistic execution. The expansions to the window are based on monitoring TW performance statistics so the window size automatically adapts to changing program behaviors. The flow control technique presented here is simple and fully automatic. No global knowledge or synchronization (other than GVT) is required. We also develop an implementation of the flow control scheme for shared memory multiprocessors that uses dynamically sized pools of free message buffers. Experimental data indicates that the adaptive flow control scheme maintains high performance for "balanced workloads", and achieves as much as a factor of 7 speedup over unthrottled TW for certain irregular workloads. ,< 1 / 1 Introduction 1 Time Warp is a well known parallel discrete synchronization protocol that detects out-of-order executions of I a' events as they occur, and recovers using a rollback mechj anism [ll]. It is well known that Time Warp may suffer from long rollbacks due to overly optimistic execution. Depending on the cost and frequency of rollback, the rollback overheads may dominate processing time. In addition, logical processes (LPs) further ahead in virtual time consume memory, which can be better utilized by LPs closer to GVT. In such cases it is better to block the optimistic LPs and prevent long rollbacks rather than spend resources in undoing the wrong computation after the fact. Numerous variations of Time Warp have been proposed that attempt to reduce the amount of rolled back computation that may occur. Surveys of methods in this regard are described in [7, 181. Broadly, there are two classes of optimism control schemes: non-adaptive and adaptive. System parameters, e.g. window sizes remain static in non-adaptive schemes, where as they are dynamic in the adaptive schemes.

descriptionView Paper arrow_downwardDownload

Buffer management in shared-memory Time Warp systems

by Richard Fujimoto

2025, Workshop on Parallel and Distributed Simulation

Mechanisms for managing message buffers in Time Warp parallel simulations executing on cache-coherent shared-memory multiprocessors are studied. Two simple buffer management strategies called the sender pool and receiver pool mechanisms... more

descriptionView Paper arrow_downwardDownload

An abstract model for parallel execution of prolog

by Pedro Patinho

2025

Logic programming has been used in a broad range of fields, from artifficial intelligence applications to general purpose applications, with great success. Through its declarative semantics, by making use of logical conjunctions and... more

descriptionView Paper arrow_downwardDownload

A parallel depth first search branch and bound algorithm for the quadratic assignment problem

by Catherine Roucairol

2025, European Journal of Operational Research

We propose a new parallel Branch and Bound algorithm for the Quadratic Assignment Problem, which is a Combinatorial Optimization problem known to be very hard to solve exactly. An original method to distribute work to processors using the... more

descriptionView Paper arrow_downwardDownload

Reducing Network Traffic of Token Protocol Using Sharing Relation Cache

by Jinglei Wang

2025, Tsinghua Science & Technology

Token protocol provides a new coherence framework for shared-memory multiprocessor systems. It avoids indirections of directory protocols for common cache-to-cache transfer misses, and achieves higher interconnect bandwidth and lower... more

descriptionView Paper arrow_downwardDownload

Volume visualization on shared memory architectures

by Anton Koning

2025, Parallel Computing

Direct volume rendering algorithms are too computationally expensive to offer interactive frame rates when rendering large 3D medical datasets on standard workstations. This article presents an image space parallelization of an image... more

descriptionView Paper arrow_downwardDownload

A Shared-Memory Multiprocessor Scheduling Algorithm

by Mauricio Solar

2025, IFIP International Federation for Information Processing

This paper presents an extension of the Latency Time (LT) scheduling algorithm for assigning tasks with arbitrary execution times on a multiprocessor with shared memory. The Extended Latency Time (ELT) algorithm adds to the priority... more

descriptionView Paper arrow_downwardDownload

Buffer management in shared-memory Time Warp systems

by Richard Fujimoto

2025, Simuletter

descriptionView Paper arrow_downwardDownload

Adaptive flow control in time warp

by Richard Fujimoto

2025

descriptionView Paper arrow_downwardDownload

Improving Multiprocessor Average-Case Schedulability using A Modified Global Dual Priority Algorithm

by alex arenas

2025

In this paper we present a modification of the Dual Priority Scheduling Algorithm to work on shared memory multiprocessor systems improving the average-case schedulability. The proposal deals with global fixedpriority preemptive... more

descriptionView Paper arrow_downwardDownload

The FLASH Multiprocessor: Designing a Flexible and Scalable System

by John Hennessy

2025

The choice of a communication paradigm, or protocol, is central to the design of a largescale multiprocessor system. Unlike traditional multiprocessors, the FLASH machine uses a programmable node controller, called MAGIC, to implement all protocol processing. The architecture of the MAGIC chip allows FLASH to support multiple communication paradigms -in particular, cache-coherent shared memory and high-performance message passing -while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine's global memory, a port to the interconnection network, an I/O interface, and MAGIC, the custom node controller. The MAGIC chip handles all communication both within the node and among nodes, using hardwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. The result is a system that is flexible and scalable, yet competitive in performance with a traditional multiprocessor that implements a single communication paradigm completely in hardware. The application results are used to evaluate the performance costs of flexibility by comparing the performance of FLASH to that of a hardwired machine on representative parallel applications and multiprogramming workloads. These results show that poor application memory reference or load balancing characteristics cause the performance of the FLASH system to degrade more rapidly than the performance of the hardwired system; that is, FLASH's performance is less robust. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. Overall, however, the performance of FLASH can be competitive with the performance of the hardwired machine. Specifically, for a range of optimized parallel applications, the performance differences between the hardwired machine and FLASH are small, typically less than 10% at 32 processors and less than 15% at 64 processors. For these programs, either the processor cache miss rates are small or the latency of the programmable protocol processing can be hidden behind the memory access time.

descriptionView Paper arrow_downwardDownload

The Impact of Non-coherent Buffers on Lazy Hardware Transactional Memory Systems

by Jose Lizaraga Garcia

2024, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum

When supported in silicon, transactional memory (TM) promises to become a fast, simple and scalable parallel programming paradigm for future shared memory multiprocessor systems. Among the multitude of hardware TM design points and... more

descriptionView Paper arrow_downwardDownload

An Effective Approach of Data Security for Distributed Shared Memory Multiprocessors

by Shafakhatullah Khan Mohammed

2024, International Journal of Computer Science and Information Technologies

The Concept of Distributed System made life easier to communicate and share resources from any other system with the help of network. Due to the emergence of Distributed system, Data Security has become an increasing concern, and... more

descriptionView Paper arrow_downwardDownload

The habanero multicore software research project

by Reman Barik

2024, Proceedings of the 24th ACM SIGPLAN conference companion on Object oriented programming systems languages and applications

Multiple programming models are emerging to address an increased need for dynamic task parallelism in multicore sharedmemory multiprocessors. This poster describes the main components of Rice University's Habanero Multicore Software... more

descriptionView Paper arrow_downwardDownload

Tuning compiler optimizations for simultaneous multithreading

by Dean Tullsen

2024, International Symposium on Microarchitecture

descriptionView Paper arrow_downwardDownload

How to simulate 1000 cores

by Daniel Ortega

2024, ACM SIGARCH Computer Architecture News

This paper proposes a novel methodology to efficiently simulate shared-memory multiprocessors composed of hundreds of cores. The basic idea is to use thread-level parallelism in the software system and translate it into corelevel... more

descriptionView Paper arrow_downwardDownload

FastFlow: Efficient Parallel Streaming Applications on Multi-core

by Marco Aldinucci

2024, arXiv (Cornell University)

Shared memory multiprocessors come back to popularity thanks to rapid spreading of commodity multi-core architectures. As ever, shared memory programs are fairly easy to write and quite hard to optimise; providing multi-core programmers... more

descriptionView Paper arrow_downwardDownload

Scalability Evaluation of Barrier Algorithms for OpenMP

by Oscar Hernandez

2024, Lecture Notes in Computer Science

OpenMP relies heavily on barrier synchronization to coordinate the work of threads that are performing the computations in a parallel region. A good implementation of barriers is thus an important part of any implementation of this API.... more

descriptionView Paper arrow_downwardDownload

The Cachemire Test Bench A Flexible And Effective Approach For Simulation Of Multiprocessors

by Mats Brorsson

2024, [1993] Proceedings 26th Annual Simulation Symposium

descriptionView Paper arrow_downwardDownload

Performance debugging shared memory multiprocessor programs with MTOOL

by John Hennessy

2024, Proceedings of the 1991 ACM/IEEE conference on Supercomputing - Supercomputing '91

This paper describes MTOOL, a sofware tool for analyzing performance losses in shared memory parallel programs. MTOOL augments a program with low overhead instrumentatwn which perturbs the program's execution as little as possible while... more

descriptionView Paper arrow_downwardDownload

Edge Detection by Maximum Entropy: Application to Omnidirectional and Perspective Images

by El Mustapha Mouaddib

2024, HAL (Le Centre pour la Communication Scientifique Directe)

In the edge detection, the classical operators based on the derivation are sensitive to noise which causes detection errors. It is even more erroneous in the case of omnidirectional images, due to geometric distortions caused by the used... more

descriptionView Paper arrow_downwardDownload

A comment on "A circular list-based mutual exclusion scheme for large shared-memory multiprocessor

by Ting-Lu Huang

2024, IEEE Transactions on Parallel and Distributed Systems

descriptionView Paper arrow_downwardDownload

The impact of shared-cache clustering in small-scale shared-memory multiprocessors

by jaswinder singh

2024

As processor performance continues to increase, greater demands are placed on the bus and memory systems of small-scale sharedmemory multiprocessors. In this paper, we investigate how to reduce these demands by organizing groups of... more

descriptionView Paper arrow_downwardDownload

A. novel approach to reduce L2 miss latency in shared-memory multiprocessors

by Jose María García

2024, Proceedings 16th International Parallel and Distributed Processing Symposium

Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware and the network interface/router. In this work we exploit such... more

descriptionView Paper arrow_downwardDownload

Comparative evaluation of latency reducing and tolerating techniques

by John Hennessy

2024, ACM SIGARCH Computer Architecture News

Techniques that can cope with the large latency of memory accesses are essential for achieving high processor utilization in large-scale better performance than each one on its own. Overall, we show that using suitable combinations of the... more

descriptionView Paper arrow_downwardDownload

An evaluation of directory schemes for cache coherence

by John Hennessy

2024, ACM SIGARCH Computer Architecture News

The problem of cache coherence in shared-memory multiprocessors has been addressed using two basic approaches: directory schemes and snoopy cache schemes. Directory schemes have been given less attention in the past several years, while... more

descriptionView Paper arrow_downwardDownload

Characterizing the caching and synchronization performance of a multiprocessor operating system

by John Hennessy

2024, ACM SIGPLAN Notices

Good cache memory performance is essential to achieving high CPU utilization in shared-memory multiprocessors. While the performance of caches is determined by both application end operating system (OS) references, most research has... more

descriptionView Paper arrow_downwardDownload

Performance evaluation of memory consistency models for shared-memory multiprocessors

by John Hennessy

2024, ACM SIGPLAN Notices

The memory consistency model supported by a multiprocessor architecture determines the amount of buffering and pipelining that may be used to hide or reduce the latency of memory accesses. Several different consistency models have been... more

descriptionView Paper arrow_downwardDownload

A quantitative analysis of the performance and scalability of distributed shared memory cache coherence protocols

by John Hennessy

2024, IEEE Transactions on Computers

descriptionView Paper

Efficient Parallel Algorithms for the Minimum Cost Flow Problem

by PATRIZIA BERALDI

2024, Journal of Optimization Theory and Applications

In this paper, we propose efficient parallel implementations of the auction/sequential shortest path and the e-relaxation algorithms for solving the linear minimum cost flow problem. In the parallel auction algorithm, several augmenting... more

descriptionView Paper arrow_downwardDownload

Scheduling FFT computation on SMP and multicore systems

by Ayaz Ali

2024, Proceedings of the 21st annual international conference on Supercomputing

Increased complexity of memory systems to ameliorate the gap between the speed of processors and memory has made it increasingly harder for compilers to optimize an arbitrary code within a palatable amount of time. With the emergence of... more

descriptionView Paper arrow_downwardDownload

Determinacy and Concurrency Issues in Process Engineering

by Murat Tanik

2024, Journal of Systems Integration

The goal of creating high-quality process systems for real-world applications leads to the need for an engineering approach to process system development. The development of process engineering as a distinct discipline can be greatly... more

descriptionView Paper arrow_downwardDownload

An efficient cache design for scalable glueless shared-memory multiprocessors

by Jose Lizaraga Garcia

2024, Proceedings of the 3rd conference on Computing frontiers - CF '06

Traditionally, cache coherence in large-scale shared-memory multiprocessors has been ensured by means of a distributed directory structure stored in main memory. In this way, the access to main memory to recover the sharing status of the... more

descriptionView Paper arrow_downwardDownload

Two proposals for the inclusion of directory information in the last-level private caches of glueless shared-memory multiprocessors

by Jose Lizaraga Garcia

2024, Journal of Parallel and Distributed Computing

In glueless shared-memory multiprocessors where cache coherence is usually maintained using a directory-based protocol, the fast access to the on-chip components (caches and network router, among others) contrasts with the much slower... more

descriptionView Paper arrow_downwardDownload

The Impacts of Timing Constraints on Virtual Channels Multiplexing in Interconnect Networks*

by Hamid Sarbazi-Azad

2024, 2006 IEEE International Performance Computing and Communications Conference

Interconnect networks employing wormhole-switching play a critical role in shared memory multiprocessor systems-on-chip (MPSoC) designs, Multicomputer systems and System Area Networks. Virtual channels greatly improve the performance of... more

descriptionView Paper arrow_downwardDownload

Analysis of task migration in shared-memory multiprocessor scheduling

by Randolph Nelson

2024, Performance evaluation review

In shared-memory multiprocessor systems it may be more efficient to schedule a task on one processor than on mother, Due to the inevitability of idle processors in these environments, there exists en important tradeoff between keeping the... more

Figure 1: General Structure of Task Migration Model

Qur analysis of the task migration model consists of solving the model of a single processor such that the arrival rate has been modified to reflect the migration of tasks from other processors and the departure rate has been modified to reflect the removal of tasks by other processors. The general structure of our decomposed queueing model is depicted in Figure 2(a). This model is represented by the Markov process (/(t),/(¢)), where /(f) and J(t) are as defined above for processor k. In Figure 2(b) we illustrate the state transition diagram for this process and an arbitrary threshold T. The upper chain represents the states where the processor is either migrating a task or executing a migrated task, while the lower chain represents the states where the processor is either idle or executing a task that arrived locally. Due to our assumption that A;, >> A (see Section 2), the likelihood of a local task arrival while a (idle) processor probes and migrates a non-local task is quite small, and therefore is not reflected in our model. We use Piz, to denote the probability that an idle processor finds a queue containing more than T tasks, including the task in service, within L, unique random probes; with probability 1—p;, the processor remains idle until a task arrives locally. When the processor is over threshold, 1.e., its queue length is greater than 7, other processors that are idle remove tasks from its queue with rate H,,;.

where B o9, Bo, and B yo are finite matrices of dimensions OT x 2T, 2T x2 and 2x 27, respectively, whose elements depend upon the value of T. The remaining matrices have dimension 2 x 2 and are given by: Given this form for the generator matrix of the Markov process, the components of the steady state probability vector y can be obtained exactly via matrix-geometric techniques[20]. In particular, the geometric portion of the probability vector, representing when the processor is above threshold, can be solved as

identical performance. Similarly, the other solid curves represent the points where the above ratio is as labeled.

We note that the unity contour where the threshold and non-migratory policies provide identical performance has moved up, and continues to do so for increasing T. By searching for an overloaded processor, the threshold policy can make better migration decisions and thus improve the sharing of work among the processors. This yields overall performance that is as good as or better than the non- migratory policy even with a significantly larger service demand for migrated tasks. As T continues to rise, task migration eventually ceases to occur because an idle processor is never able to find a queue over threshold, and both policies exhibit the same behavior. This is also the reason for the separation between the unity contour and the y-axis in Figure 4; i.e., the contour is defined only for values of 1 >0.22 because the migratory policy effectively never migrates a waiting task when A < 0.22 and T = 8.

under the migration policy is exhibited by the two-phase behavior of its response time curve, as indicated by the dashed line. In the first phase when load is relatively light, an idle processor migrates a task waiting at another processor and proceeds to execute the task, which requires 8 units of time on average. In the meantime, tasks arrive at this processor’s queue but since load is fairly light, these tasks are migrated by other processors. When the processor of interest finishes servicing the migrated task, there is a good chance that its queue is empty, in which case the processor will migrate a task from another processor. This cycle continues, and increases in intensity with system load, until the load is high enough to yield a non-empty queue after completion of the migrated task. We note that the dashed line indicates the value of 4 for which p,,, (the probability that a processor is executing a migrated task) is greatest. As illustrated in Figure 6, this two-phase behavior also exists in migration policies with thresholds greater than one; the location of instability moves to higher system loads with increasing T. The key point is that the processing of tasks is dominated by the execution of migrated tasks toward the end of the first phase, while this factor continually decreases through the second phase. This form of processor thrashing where processors are spending most of their time executing migrated tasks clearly must be avoided. Figure 5; Mean Task Response Times for the Non-Migratory Policy and the Threshold Scheduling Policy with a Threshold of 1 (T=1)

processor. After measuring the execution time of this task (which we call the local task), the experiment was repeated with the addition of an identical task running concurrently on a second processor (which we call a remote task), where both independent versions of the code and data are stored in the memory module associated with the first processor, and the average execution times of both tasks were obtained. Similar measures were obtained for k, 2<k < 64, identical remote tasks running concurrently on distinct processors together with the local task, yielding the desired performance degradation factors. The reduction in the computing power of the processor running the local task, as a function of k, is given by

Figure 8: Degradation in the Computing Capacity of a Processor Due to Contention for a Remote Memory Module op a 64-Way RP3 System The — functions fa- loca (k ) and F a-remoie (k) are substituted in equations (17) and (18), respectively, and our task migration model is solved as described in Section 3.3. In Figures 9, 10 and 11 we plot the corresponding response time contours for threshold values of 1, 8 and 16. Our results clearly show the significant effects that these degradation factors can have on system performance. In particular, the region of benefit has been reduced considerably under the greedy migration policy, i.e., T = 1, and for most system loads the policy performs consistently worse than a non-migratory policy (see Figure 9). The greedy policy suffers from poor migration decisions, as we have previously shown, and the effects of these poor decisions are compounded by the resulting increase in contention for system resources. This compounded degradation in performance is most significant at moderate to fairly heavy loads, where a sufficient fraction of the processors become idle while a sufficient fraction of the processors are over threshold.

It is important to note, however, that the unity contour moves up considerably for larger policy thresholds. By searching for an overloaded processor, the threshold policy can make better migration decisions and thus improve the sharing of work among the processors. This yields overall performance that is as good as or better than the non-

Figure 11: Task Migration Response Time Contours for a Threshold of 16 (T=16), with the Addition of RP3 Measurements migratory policy even with a significantly larger service demand for migrated tasks. The improvement in system performance continues to be most significant toward heavier loads, since there are sufficient numbers of waiting tasks and the majority of processors are busy executing tasks that arrived locally. The key point is that, even when task migration costs are increased due to significant contention for system resources, performance benefits may be gained in shared-memory multiprocessor systems by migrating a waiting task to an idle processor, provided proper policy thresholds are employed (at least for the class of system environments considered in this paper).

descriptionView Paper arrow_downwardDownload

Design of high performance RTI software

by Richard Fujimoto

2024

This paper describes the implementation of RTI-Kit, a modular software package to realize runtime infrastructure (RTI) software for distributed simulations such as those for the High Level Architecture. RTI-Kit software spans a wide... more

descriptionView Paper arrow_downwardDownload

Interprocedural analysis for loop scheduling and data allocation

by trung nhân nguyễn

2024, Parallel Computing

In order to reduce remote memory accesses on CC-NUMA multiprocessors, we present an interprocedural analysis to support static loop scheduling and data allocation. Given a parallelized program, the compiler constructs graphs which... more

descriptionView Paper arrow_downwardDownload

Scalability Port: A Coherent Interface for Shared Memory Multiprocessors

by Fayé Briggs

2024, IEEE Symposium on High Performance Interconnects

The scalability port (SP) is a point-to-point cache consistent interface to build scalable shared memory multiprocessors. The SP interface consists of three layers of abstraction: the physical layer, the link layer and the protocol layer.... more

descriptionView Paper arrow_downwardDownload

Comparative Study of Reconfigurable Cache Memory

by Ibrahim A. Amory

2024, Cihan University-Erbil scientific journal

Reconfigurable cache memory is important to improve the cache performance and reduces the energy consumption. In this paper, a review for previous papers related with reconfigurable cache memory were presented and compared it with our... more

Fig. 1. 4-way Set Associative Cache using Selective Cache ways. Fig. 1 shows the 4 way set associative cache using selective cache ways. The partitioning required to combine hardware and software elements. A partitioning of the data and tag arrays into one or more subarrays for each cache way. Also a gating hardware and decision logic for disabling the operation of particular ways. Design a cache way select register as a software-visible register that signals the hardware to enable/disable particular ways.

Fig. 2. Miss rate and normalized energy for data cache of different associativity. Their results show a way- concatenaTable cache results in an average energy savings of 37% compared to a conventional four-way cache, with savings over 60% for several examples. Compared to a conventional direct mapped cache, the average savings are more modest, but the direct mapped cache suffers large penalties for some examples — up to 284% for parser, with degraded performance in several examples.

A Special Issue for 2nd International Conference of Cihan University-Erbil on Communication Engineering & Computer Sciences (CIC-COCOS’ 17), March 29-30, 2017 For other some benchmarks, the set-only and set-and-way approaches do not do well compared to the way-concatenation and Smart caches. The reason for this is that these benchmarks require a 2MB cache with two-way associativity which is only offered by way-concatenation and Smart cache. For these architectures, dynamic energy is reduced by accessing fewer ways, which is not possible in the set-only and set-and- way caches.

Fig. 8 shows a 256KB cache size selected with two way set associative. The CPU now dealing with the first 2 sets (Setl and Set2) with blue color as Way0 with size 128KB and with Set3 and Set4 with green color as Way1. While another 4 sets with red colot MAAR radaan dealing with the first 2 sets (Setl and Set2) with blue color as Way0 with size 128KB size selected with Direct Mapped organization. The CPU now dealing with the first 4

Fig. 8. Cache Sets with 2 Way and size 256KB.

descriptionView Paper arrow_downwardDownload

Comparative Study of Reconfigurable Cache Memory

by Ibrahim A. Amory

2024, 2nd International Conference of Cihan University-Erbil on Communication Engineering and Computer Science

descriptionView Paper arrow_downwardDownload

Coherence controller architectures for scalable shared-memory multiprocessors

by Dr. Ashwini Nanda

2024, IEEE Transactions on Computers

ÐScalable distributed shared-memory architectures rely on coherence controllers on each processing node to synthesize cache-coherent shared memory across the entire machine. The coherence controllers execute coherence protocol handlers... more

descriptionView Paper arrow_downwardDownload

Design and performance of directory caches for scalable shared memory multiprocessors

by Dr. Ashwini Nanda

2024, Proceedings Fifth International Symposium on High-Performance Computer Architecture

Recent research shows that the occupancy of the coherence controllers is a major performance bottleneck for distributed cache coherent shared memory multiprocessors. A significant part of the occupancy is due to the latency of accessing... more

descriptionView Paper arrow_downwardDownload

SMARTS: Exploiting Temporal through Vertical Execution

by Steve Karmesin

2024

In the solution of large-scale numerical problems, parallel computing is becoming simultaneously more important and more dificult. The complex organization of today’s multiprocessors with several memory hierarchies has forced the... more

descriptionView Paper arrow_downwardDownload