Advanced Operating Systems

This course is taught by Dr. Chris Gniady.

The final project presentation can be found here.

The final report of the project can be found here.

The final project github repository is given here.

The following papers have been discussed thoroughly throughout the course. Before discussion of each paper, we summarized the papers and wrote 3 technical main points and 1 weakness of the proposals in each paper.

ECOSystem: Managing Energy as a First Class Operating System Resource

Currentcy Model:
This research mainly contributes to an energy accounting framework that is based on a Currentcy Model. The Currentcy is a combination of the two words- “current” and “currency”. Currentcy model unifies the management of resources for various system components and allows definite management of energy.

2. ECOSystem Implementation:
ECOSystem is an OS prototype that has been implemented for demonstrating the unified currentcy model. The framework manages energy as a first-class resource. The experiments also show that the system describes an asynchronous device operation.

3. Validation of ECOSystem:
A sequence of experiments has been conducted in the research paper to validate the energy accounting and allocation of ECOSystem.
An experiment is conducted to evaluate the energy accounting advantages of the currentcy model. Three synthetic benchmarks have been used to exercise the network, disk, and CPU.
To check the design objective of the ECOSystem an experiment is done to obtain the target battery lifetime. For this, a microbenchmark of CPU has been used, also an accounting error for the CPU power consumption has been added.
Another experiment is to validate the proportional share of limited currentcy allocation through the synchronous running of ijpeg and Netscape. It is observed that the results of the experiments match the expectations.

The weakness I found in the methodology of the paper is given below:
The occurrence of the data retransmission in the MAC layer is a result of the data being corrupted. This data retransmission affects the accuracy of the energy accounting by consuming additional energy which is not visible to the OS.

Instruction-Based Prediction Techniques in Operating Systems

1. PCAP:
This research paper presented a mechanism for accurately predicting idle periods in I/O operations accurately. The prediction method is termed as Program Counter based Access Predictor (PCAP). It is based on the idea that a significant correlation exists between sequence of I/O operations in an application and its subsequent idle period.

2. Implementation of PCAP:
In PCAP implementation, a path is encoded that adds the Program Counters arithmetically for prediction of cache accesses. From the encoded path results, a 4 byte variable called signature is obtained.The storage requirements of PCAP is minimized by the encoder and establishes the possibility of two different paths in the same signature.

3. Global Prediction:
Pcap implementation has been done on a per application basis. But, in real systems, multiple processes execute at the same time. Global Shutdown Predictor is a prediction system that generates a global shutdown prediction from input of all processes. Each process has its own private PCAP, which generates local prediction. Global Shutdown Predictor predicts shutdown only when PCAP predicts shutdown for every process.

Weakness:
Misprediction of the global predictor is a weakness. The global predictor mispredicted shutdown even if only one local predictor predicted shutdown incorrectly while other local predictors successfully predicted shutdown.

Interaction‐Aware Memory Energy Management

1. IAMEM:
The authors have proposed a memory management process during execution named Interaction-Aware Memory Energy Management (IAMEM). It measures the user activity and predicts the power state transitions for maximal energy savings.

2. Memory Power State Management by IAMEM:
There are many states for a modern Synchronous Dynamic RAM (SDRAM). The paper has discussed 6 such states and the power specification of each of the states in a transition figure. The I/O seems to consume the highest energy whereas SELF_REF state consumes the least energy among all states. So if IAMEM can keep most of the memory time in that state by predicting user activity correctly, then vast energy will be resolved consumed by memory.

3. Global Prediction:
Specific weight (alpha) has been used to calculate power impact on the system. Each time UI event occurs, IAMEM captures the interaction ID and tries to predict the future task the user might perform. Then according to that, IAMEM tries to change the transition aggressiveness of the memory to lower power usage states.

Weakness:
The most power-hungry components of a computer are usually Display, Hard Disk, Network Interface, and CPU. So saving energy by optimizing memory should not produce very high battery efficiency in overall.

Energy Efficient Prefetching and Caching

Aim of the research:
The goal of the research is to create bursty access patterns for devices that are in non-operational low-power states, increasing the average length of idle periods and utilization maximization during active devices. The work is mainly focused on hard disks.

Design of the Energy-Aware Prefetching:
Energy-Aware Prefetching algorithm design depends on the decision of when to fetch a block, which block is to be fetched, and which block is to be replaced. The prefetching algorithm fetches data from the disk-based on the two Rules of Maximize Data Utilization and Respect Idle Time.

Evaluation of Energy-Conscious Prefetching algorithm:
A comparison with an energy-conscious prefetching algorithm (bursty) is done based on some metrics. The first metric is the Length of idle periods. An increase in length of idle improves the power management policy. The second metric is Energy Savings. Energy savings for bursty and Linux is compared for different memory sizes. The third metric is Slowdown. The challenge for Bursty system is performance penalties minimization caused by increased disk congestion and disk spin-up operations.

Weakness:
This process of increasing idle time for I/O highly depends on the amount of memory available for prefetching. So this approach might not be available for an embedded intelligent operating system with limited memory.

An Efficient Low Inter-Reference Recency Set Replacement Policy to Improve Buffer Cache Performance

LIRS:
Low Inter-reference Recency Set (LIRS) is an efficient buffer cache replacement policy that is proposed in this research paper. The objective is to address the limits of LRU effectively for a general purpose, maintaining the low overhead merit of LRU, and outperforming the replacement policies depending on the access regularity detections.

LIRS algorithm:
The blocks in the cache are divided into two sets which are: High Inter-reference Recency (HIR) block set and Low Inter-reference Recency (LIR) block set. The status might either be LIR or HIR. The key to implementing LIRS successfully depends on the ability to dynamically and responsively maintain the LIR block set and HIR block set. LIRS algorithm ensures that the LIR block set is less than the allocated cache size and keeps the set in the cache. Each block having cache history information has a status.

Overhead Analysis:
In a comparison of the time and space overhead of LIRS and LRU, it is observed in the research that, LIRS keeps LRU merit of low overhead. The time overhead of LIRS is almost the same as LRU in addition to operations as the list for resident HIR blocks.

Weakness:
Using Rmax as the threshold, if it is set to high, then the replacement algorithm converts into an LRU cache replacement algorithm. And with a small threshold value, it increases the stability of LIR blocks, making the algorithm more immune to long-term pattern use cases. Although the study of varying Rmax values is shown, there was no algorithm proposed for finding the perfect Rmax value. So I assume this value will depend upon the use case and set by a user which seems to be infeasible.

Unified Buffer Management (UBM)

UBM, a solution to LRU problem:
The research presents a Unified Buffer Management (UBM) scheme to solve the problem of the LRU replacement scheme that does not make use of regularities present in the reference patterns of applications resulting in degraded performance. The UBM scheme detects sequential and looping references automatically and stores the detected blocks in separate buffer cache partitions. These partitions are managed by appropriate replacement strategies based on the features of their detected patterns.

Main Modules of UBM:
Unified Buffer Management (UBM) scheme consists of three primary modules which are: Detection, Replacement, and Allocation. The detection module automatically detects sequential and looping references. After detection, the block references are classified as sequential, looping, or other references. The replacement module applies different replacement methods to the blocks which belong to three reference patterns according to the features of each pattern. The allocation module allocates limited buffer cache space among three partitions which correspond to sequential, looping, or other references.

Description of block references:
The UBM scheme detects sequential, looping, and other references according to some instructions. Sequential references are consecutive block references occurring only once. Looping references are sequential references occurring in repetition with specific intervals. Looping references are initially detected as sequential references until they are re-referenced. Other references are not recognized as sequential or looping references.

Weakness:
The performance evaluation of UBM was not done in Linux operating system. It was performed only in FreeBSD. Linux is more popular for implementing cache replacement algorithms.

Speculative Execution in a Distributed File System

In-kernel Disk Prefetcher:
The paper presents an in-kernel design that takes advantage of existing operating system features for its easier implementation. It may be beneficial to arbitrary unmodified application binaries. The design automates disk prefetching for virtual memory accesses and explicit I/O calls as well, allowing it to benefit the I/O access methods used by applications.

Basic design of in-kernel speculative execution:
The speculative process is created by forking a normal process when it blocks a disk request for the first time. The speculative process is completely destroyed only when its parent process exits. In order to ensure the safety of the speculative execution, to allow the speculative processes to issue prefetches instead of their parent processes, and to limit the resource utilization of the speculative process to prevent them from harming performance, the operating system treats speculative processes in a different way from normal processes. Speculative processes are never created for normal processes that perform no disk I/O.

Improved prefetching for swapping applications:
Additions to standard operating system mechanisms are done which improves prefetching performance for memory-intensive applications. Swapping applications have large page tables since they rely on file-and-swap-backed virtual memory to hide I/O. The cost of synchronization is reduced by adding a fast, preemptible refork operation. Copy-on-write faults can introduce significant delays because of the page allocation and data copy.

Weakness:
In comparison to a previous design called SpecHunt, the in-kernel design has some drawbacks. It requires a speculative execution-specific operating system. It does not allow deployment on systems where OS modifications are not feasible. Moreover, it is difficult for in-kernel to exploit static analyses and transformations to specialize the application code for speculative execution.

CLOCK-Pro: An Effective Improvement of the CLOCK Replacement

Objective of CLOCK-Pro:
The goal of the research is to develop an improved CLOCK replacement policy called CLOCK-Pro which takes the advantage of incorporating LIRS (an I/O buffer cache replacement algorithm) into CLOCK. CLOCK-Pro also works similar to CLOCK by keeping track of a limited number of replaced pages with a virtual memory affordable cost.

Trace-Driven Simulation Experiment:
The simulation experiments are conducted in three steps with different workload traces. In the first step, the replacement algorithms are evaluated on I/O traces to measure how effectively CLOCK-Pro retains LIRS performance merits. In the second step, the algorithms on VM traces of application program executions are tested. Due to the mistreatment of file data and process pages, the algorithm is tested on the VM and file I/O traces in the third step. In all experiments, CLOCK-Pro performed significantly well.

Overhead of CLOCK-Pro:
The original system’s paging infrastructure is preserved except for replacing the active/inactive lists with a unified clock list and introducing a hash table. CLOCK-Pro’s additional overhead is limited to the clock list ad hash table operations. The average number of entries the clock hands sweep over per page fault on the lists for the two systems is measured. The results show that CLOCK-Pro has a similar number of hand movements to the original system except for large memory sizes, where the original system reduces its movement number while CLOCK-Pro does not. It results in that the additional overhead is negligible compared with the original replacement implementation.

Weakness:
During one victim page search, the CLOCK-Pro can be moved upto three pointers which are higher than trivial CLOCK implementation.

Design and Implementation of Power-Aware Virtual Memory

Contributions of PAVM:
Power-Aware Virtual Memory (PAVM) reduces power dissipation by energy footprint minimization of each system process. It uses Non-Uniform Memory Access (NUMA) techniques as an abstraction layer for managing nodes in power dissipation reduction. Techniques exploration to reconfigure page allocation dynamically yield additional savings from reducing per-process energy footprint.

Memory Node Concept:
Designing PAVM needs the concept of a memory node. It is assumed that system memory is partitioned into one or more nodes, where a single node is the smallest unit of memory that has independent power management. The node concept generalizes the unit of control available for memory power management operations.

NUMA Management layer:
Implementing PAVM based approaches is not easy in modern operating systems, where virtual memory is extensively used. The unequal treatment sections of memory due to latency and overheads for access are not limited to power-managed memory. By using a NUMA layer, PAVM can be implemented with preferential node allocation without having to re-implement complex low-level physical page allocators.

Weakness:
The current implementation does not fully address a limitation. There is no consideration of direct memory access (DMA) by other hardware components on nodes that have educed power states which results in performance degradation.

Dynamic tracking of page miss ratio curve for memory management

Research proposed methods:
The paper provides two approaches to dynamically track the Miss Ratio Curve (MRC) of applications at run time. The first method is to track MRC at fine time granularity using a hardware MRC monitor. The simulation results show that the monitor has negligible performance and energy overheads. The second method is the implementation of OS-only which tracks MRC at coarse time granularity. The results show the addition of only 7-10% overhead.

Implementation of Mattson’s Stack Algorithm:
The paper has two implementations. One is a hardware MRC monitor which tracks the MRC of an application at a fine time. The MRC monitor connects to the memory bus and observes all memory accesses by snooping the bus. The second one is an OS-only solution that tracks MRC at a coarse time granularity which requires extra hardware support.

MRC-directed Memory Allocation:
To find a good memory management algorithm, first, the problem is to be formalized as a resource utility problem. As an optimal solution to the problem, a greedy algorithm is used. The main idea of the greedy algorithm is to divide the memory into small units and incrementally allocate a memory amount to the process that receives this memory with the highest gradient of the utility function at its current memory allocation.

**Weakness: **
The experiments have been performed only in a few multiprogramming parameters. The processes are yet to be evaluated particularly for MRC-directed energy management. Other applications such as program phase tracking and phase prediction are yet to be investigated.

The Case for Compressed Caching in Virtual Memory Systems

Two elements of innovative approach:
The proposal’s methodology contains two parts that are novel. First, innovative compression algorithms that are suitable for compressing in-memory data representations are introduced. These algorithms compete with and complement more advanced Ziv-Lempel compressors. Second, by keeping track of recent program behavior, calculate how much memory should be compacted.

Use of three compression algorithms:
Three compression algorithms were utilized in the studies. The first is WKdm, a recency-based compressor that uses a direct-mapped 16-word dictionary and fast encoding and operates in machine words. The second is LZO, which is a coded Lempel-Ziv implementation that is optimized for decompression. The third is LZRW1, which is another fast Lempel-Ziv implementation. It does not perform as well as LZO, but it is used to demonstrate the effectiveness of compressed caching.

Online Cost/Benefit Analysis:
The adaptive cache-sizing technique solves the problem of adaptability by conducting an online cost-benefit analysis based on recent program behavior information. The method takes advantage of recency information preserved by standard replacement procedures, i.e. it keeps track of the pages’ approximate order. The system is extended by keeping information on recently evicted pages. It keeps an LRU ordering of the pages in memory and the number of recently evicted pages.

Weakness:
The paper has not addressed any compression algorithm for machine code. If a different compression algorithm is used for machine code, there will be an overhead to determine or segment between instruction sets and data. Although this paper has mentioned some machine codes specialized compression algorithms and their compression factors, it has not taken these overheads into account.

Implementing Lottery Scheduling: Matching the Specializations in Traditional Schedulers

Primary Contribution:
The research goal is to improve responsiveness with windowed ticket boost, a method for dynamically identifying interactive processes and boosting ticket allocation. In lottery scheduling, tickets are the priority abstraction. If a process uses less CPU than ticket allocation, it is marked as currently interactive and ticket allocation is increased to affect scheduling order. To regulate relative CPU time, it is done without affecting lottery scheduling.

Hybrid Lottery Scheduler:
Lottery scheduling is improved to be more responsive to interactive applications and to reduce kernel lock contention with techniques from FreeBSD scheduler, resulting in a hybrid lottery scheduler. The schedule processes hold kernel resources rather than in-kernel ticket transfers to address kernel resource contention. When a sleeping process wakes up or when a process receives an event such as a signal, the scheduler forces a context switch. They made boost tickets for CPU and interactive jobs.

Implementation:
The research performed an implementation of lottery scheduling which enabled control over process execution rates and processor load insulation. They added kernel priorities, abbreviated quanta, and windowed ticket boost to lottery scheduling to match performance to FreeBSD scheduler resulting in a hybrid lottery scheduler.

Weakness:
The paper argued that kernel priorities are more desirable than ticket transfers to let processes release kernel resources quickly. However, the chosen sequence of kernel priorities has not been thoroughly studied resulting in suboptimal performance in all circumstances.

Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor

Research Proposal:
The paper discusses several ways for supporting the notion of priority on an SMT processor while allowing symbiotic jobscheduling. It shows that low priority jobs can coexist with high priority jobs using a realistic simulation of SMT architecture. System utilization remains high even when priority jobs are allocated the majority of system resources.

Simple priority Mechanism:
The authors explore a variety of hardware and software options for supporting SMT priorities. They begin by implementing priority on SMT, then explore more complex methods that boost throughput while preserving schedule viability. There are two possible meanings of priority: (A) guarantee a fraction of machine proportional to priority, and (B) guarantee a fraction of single-threaded performance proportionate to priority.

SOS:
The authors previously presented SOS, a symbiotic OS-level jobscheduler that samples the space of running schedules, evaluates hardware performance counters and uses heuristics to guess optimal schedules. It uses a sample phase that collects data on how jobs run when executed independently to preserve the semantics of priority. It works with hardware support for enforcing priorities.

Weakness:
Scheduling limits the space of schedules since schedules that coschedule at a lower level lower than hardware supports are not evaluated. Multithreading designs are made under the assumption that more multithreading is better. This optimizes the opportunity for latency hiding via issue from independent instruction streams.

Fast and secure distribution read-only file system

SFS read-only file system:
The file system presented in this paper, SFS read-only file system uses SFS’s naming scheme and fits into the SFS framework. SFS read-only file system consists of three programs: a database generator, a server, and a client. The database generator generates a signed database from a file system’s content. In response, the server retrieves information from its database copy and returns it to the client.

SFS read-only data structures:
The primary read-only data structure clients interpret is the read-only inode structure, which specifies the whole contents of a file. The database stores file system data structures in XDR marshaled form. There are benefits to using XDR. It simplifies the client implementation, XDR representation clearly defines what the database contains, and it improves the performance of the server by not marshaling.

Applications of SFS read-only file system:
Certificate authorities and software distribution are two applications for the SFS read-only file system. SFS uses file systems to certify public keys of servers. SFS certificate authorities get queried interactively, unlike traditionally. By distributing software through SFS read-only servers, users can browse the distribution as a regular file system and compile software straight from the sources stored on the SFS file system.

Weakness:
File system updates using fhdb scheme have drawbacks. Even minor changes to the file system cause majority of the hash tree under fhdb to change-making incremental database updates unnecessarily expensive. In the read-only file system, because handles are based on file contents, there is no distinction between modifying a file and deleting then recreating it.

Energy-Efficiency and Storage Flexibility in the Blue File System

Blue File System:
The Blue File System presented in this paper is a distributed file system for mobile computing that tackles the new opportunities and difficulties that have arisen as mobile storage has evolved. BlueFS offers a variable cache hierarchy that adapts when and where data is accessed based on each device’s performance and energy characteristics.

BlueFS Server:
BlueFS Server is primarily focused on the mobile client, hence the file server implementation is standard. A single BlueFS Server is expected to handle the same number of clients as a current AFS or NFS server. BlueFS Server stores the primary replica of every object. When a client modifies an object, it sends the change to the server. BlueFS server is aware of individual storage devices.

BlueFS kernel module:
The majority of functionality in BlueFS is handled by a user-level daemon, which simplifies implementation. The BlueFS kernel module intercepts Linux VFS operations and redirects them
to the daemon. The kernel module also handles caching and invalidation of data and metadata in the Linux file cache. Write, create, and rename operations are synchronously forwarded to the daemon.

Weakness:
BlueFS utilizes multiple storage devices with heterogeneous characteristics. The iPAQ handheld contains a small amount of NOR flash used by Linux distribution as the root partition. Flash writes are expensive. Measurements show that the iPAQ flash has lower write throughput and higher latency than the microdrive. It made it difficult for BlueFS to use iPAQ’s flash.

MASSIVE ARRAYS OF IDLE DISKS FOR STORAGE ARCHIVES

Design of MAID arrays:
The design of MAID arrays requires accurate models for drive performance and power. A combination of existing performance models and measurements from sample IDE drives is used to develop a unified performance and power model. The simulator model is an extension of an analytic model for SCSI drives.

Traces for Evaluation:
The first version of MAID has a block-based interface. Traces of file system activity that was reduced to a sequence of block requests (read or write) to a disk array is used to evaluate this version. A request’s arrival time is logged for each request. Requests are issued no sooner than the original arrival time.

Metrics for Evaluation of MAID;
Two performance metrics are used to compare system configurations: energy and request response time. Energy is required to spin up a drive which is greater than the energy required to maintain it spinning. Request response time provides an accurate basis for performance comparison when using trace data in simulation.

Weakness:
The authors expected that a linear data layout would be more energy-efficient than a stipend configuration since fewer drives would be set up for each request. But, the workload at the supercomputer center includes enormous transfers, resulting in longer I/O queues for each disk, increasing response time and spinning-up time.

Operating System I/O Speculation: How two invocations are faster than one

Conditions for Success:
The paper targeted distributed file systems as the first clients of Speculator because they show three ideal features. The first is that the outcome of speculative operations is highly predictable. Second, the checkpoint is frequently faster than remote I/O. Third, modern computers often have spare resources.

Speculation implementation:
Checkpointing is implemented in Speculator by creating a copy-on-write fork of the currently executing process. It saves the state of any open file descriptors and copies any signals pending for the checkpointed process. The speculator discards a checkpoint by recovering the kernel data structures associated with the child process if the speculations on which it is based prove to be valid.

Speculative execution in BlueFS:
The researchers analyzed if speculative execution could help a distributed file system guarantee excellent consistency and safety while maintaining appropriate performance. The authors used Speculator to create a clean sheet design while designing a novel distributed file system called BlueFS. This opportunity is used to ensure that single-copy file semantics are consistent and synchronous I/O is safe.

Weakness:
Speculator implies that a client only communicates with one file server at a time while handling mutation activities. It does not block any process that attempts to write to a server while it has uncommitted speculations that depend on another server.

Wide-Area Cooperative Storage with CFS

Cooperative File System:
The Cooperative File System (CFS) is a new read-only peer-to-peer storage system with verifiable guarantees for file storage and retrieval efficiency, robustness, and load-balancing. The distributed hash table (DHash) provided by CFS servers is used to store blocks. DHash blocks are treated as a file system by CFS clients.

Chord Layer:
The Chord protocol is used by CFS to locate blocks. Chord supports only one operation: it determines the node accountable for a specific key given a key. Chord does not store keys and values; instead, it provides primitives that enable higher-layer applications to create a wide range of storage systems.

DHash Layer:
The CFS DHash layer manages the distribution, replication, and caching of uniquely identifiable blocks, as well as their storage and retrieval. Chord is used by DHash to locate blocks. The CFS design decision to break each file system into blocks and distribute those blocks across servers is reflected in DHash. This setup distributes the load of serving popular files across multiple servers.

Weakness:
While CFS is robust in the case of fail-stop failures, it isn’t designed to protect against malicious participants. An internally consistent CFS system may be formed by a set of malicious nodes; such a system could not falsify documents because CFS can verify authenticity, but it could falsely assert that a document did not exist.

Ivy: A Read/Write Peer-to-Peer File System

Ivy:
Ivy is a read/write peer-to-peer file system for multiple users. An Ivy file system is made up entirely of logs, one for each participant. The DHash distributed hash table is where Ivy saves its logs. Each participant obtains data by consulting all logs, but only modifies its own log by attaching it.

Design of Ivy:
A set of logs, one for each participant, make up an Ivy file system. A log is a record of all changes made to a file system by a single participant. Each person simply adds to their own log but reads from all of them. The DHash distributed hash system, which supports per-record replication and authentication, is used by participants to store log records.

Cache Consistency:
Ivy implements a modified version of close-to-open consistency for file reads and writes. If application A1 writes data to a file, then closes it, and another program A2 accesses and reads the file after the close, A2 will see the data written by A1. Ivy caches log records, but because the records are immutable, this cache never needs to be invalidated.

Weakness:
Ivy’s CVS performance has been poor. CVS analyzes every file in the repository during a commit or update, and Ivy examines each file access to see if another participant has recently updated the file. CVS has locked the repository, making such changes difficult; however, Ivy is unaware of this.

Internet of Things

Internet of Things (IoT):
The Internet of Things (IoT) is a network of physical objects (devices, vehicles, and buildings) that are integrated with electronics, software, sensors, and network connectivity which allows these objects to collect and exchange data.

Changing Networks:
The networks are changing in various ways through extensions, expansions, and enhancements. Extensions include more connections with nodes, IPv6, and other M2M connections. Expansions is done through broadband. Enhancements include data-centric and content-oriented networking.

Attacking IoT:
When mechanisms to ensure physical security are not in place, physical attacks on IoT devices result in tampering to implant malicious code into IoT devices. Non-traditional communication protocols, such as Bluetooth, Zigbee, Zwave, Sigfox, NFC, and 6LowPAN, are used to infiltrate networks and may be beyond the scope of typical incident and forensic management systems.

Weakness:
In IoT, with the concept of “always-on”, such technology will require a change in mindset when considering implementation of products and services. As, the Internet of Things is becoming more and more integrated into people’s and organizations’ daily lives, safeguarding privacy, security, and commercial operations/opportunities will become more of a priority both now and in the future.