Alibaba's 2021 and 2022 microservice datasets are the only publicly
available sources of request-workflow traces from a large-scale
microservice deployment. They have the potential to strongly
influence future research as they provide much-needed visibility into
industrial microservices' characteristics. We conduct the first
systematic analyses of both datasets to help facilitate their use by
the community. We find that the 2021 dataset contains numerous
inconsistencies preventing accurate reconstruction of full trace
topologies. The 2022 dataset also suffers from inconsistencies, but
at a much lower rate. Tools that strictly follow Alibaba's specs for
constructing traces from these datasets will silently ignore these
inconsistencies, misinforming researchers by creating traces of the
wrong sizes and shapes. Tools that discard traces with inconsistencies
will discard many traces. We present Casper, a construction method
that uses redundancies in the datasets to sidestep the
inconsistencies. Compared to an approach that discards traces with
inconsistencies, Casper accurately reconstructs an additional
25.5% of traces in the 2021 dataset (going from 58.32% to 83.82%)
and an additional 12.18% in the 2022 dataset (going from 86.42% to
98.6%).
2023
ATC
Lifting the veil on {M}eta's microservice architecture: {A}nalyses of topology and request workflows
Huye, Darby,
Shkuro, Yuri,
and Sambasivan, Raja R.
The microservice architecture is a novel paradigm for building and operating distributed applications in many organizations. This paradigm changes many aspects of how distributed applications are built, managed, and operated in contrast to monolithic applications. It introduces new challenges to solve and requires changing assumptions about previously well-known ones. But, today, the characteristics of large-scale microservice architectures are invisible outside their organizations, depressing opportunities for research. Recent studies provide only partial glimpses and represent only single design points. This paper enriches our understanding of large-scale microservices by characterizing Meta’s microservice architecture. It focuses on previously unreported (or underreported) aspects important to developing and researching tools that use the microservice topology or traces of request workflows. We find that the topology is extremely heterogeneous, is in constant flux, and includes software entities that do not cleanly fit in the microservice architecture. Request work- flows are highly dynamic, but local properties can be predicted using service and endpoint names. We quantify the impact of obfuscating factors in microservice measurement and conclude with implications for tools and future-work opportunities.
2022
JSys
[SoK] Identifying Mismatches Between Microservice Testbeds and Industrial Perceptions of Microservices
Seshagiri, Vishwanath,
Huye, Darby,
Liu, Lan,
Wildani, Avani,
and Sambasivan, Raja R
Industrial microservice architectures vary so wildly in their characteristics, such as size or communication method, that comparing systems is difficult and often leads to confusion and misinterpretation. In contrast, the academic testbeds used to conduct microservices research employ a very constrained set of design choices. This lack of systemization in these key design choices when developing microservice architectures has led to uncertainty over how to use experiments from testbeds to inform practical deployments and indeed whether this should be done at all. We conduct semi-structured interviews with industry participants to understand the representativeness of existing testbeds' design choices. Surprising results included the presence of cycles in industry deployments, as well as a lack of clarity about the presence of hierarchies. We then systematize the possible design choices we learned about from the interviews, and identify important mismatches between our interview results and testbeds' designs that will inform future, more representative testbeds.
Developers use logs to diagnose performance problems in distributed applications. But, it is difficult to know a pri- ori where logs are needed and what information in them is needed to help diagnose problems that may occur in the future. We summarize our work on the Variance-driven Automated Instrumentation Framework (VAIF), which runs alongside distributed applications. In response to newly-observed per- formance problems, VAIF automatically searches the space of possible instrumentation choices to enable the logs needed to help diagnose them. To work, VAIF combines distributed trac- ing (an enhanced form of logging) with insights about how response-time variance can be decomposed on the critical- path portions of requests’ traces.
2021
SoCC
Automating instrumentation choices for performance problems in distributed applications with VAIF
Toslali, Mert,
Ates, Emre,
Ellis, Alex,
Zhang, Zhaoqi,
Huye, Darby,
Liu, Lan,
Puterman, Samantha,
Coskun, Ayse K.,
and Sambasivan, Raja R.
Developers use logs to diagnose performance problems in distributed applications. However, it is difficult to know a priori where logs are needed and what information in them is needed to help diagnose problems that may occur in the future. We present the Variance-driven Automated Instrumentation Framework (VAIF), which runs alongside distributed applica- tions. In response to newly-observed performance problems, VAIF automatically searches the space of possible instrumen- tation choices to enable the logs needed to help diagnose them. To work, VAIF combines distributed tracing (an enhanced form of logging) with insights about how response-time variance can be decomposed on the critical-path portions of requests’ traces. We evaluate VAIF by using it to localize performance problems in OpenStack and HDFS. We show that VAIF can localize problems related to slow code paths, resource contention, and problematic third-party code while enabling only 3-34\% of the total tracing instrumentation.
2019
BigData
D3N: A multi-layer cache for the rest of us
Kaynar, Emine Ugur,
Abdi, Mania,
Hajkazemi, Mohammad Hossein,
Turk, Ata,
Sambasivan, Raja R,
Cohen, David,
Rudolph, Larry,
Desnoyers, Peter,
and Krieger, Orran
In IEEE International Conference on Big Data (Big Data)
2019
Current caching methods for improving the performance of big-data jobs assume high (e.g., full bi-section) bandwidth; however many enterprise data centers and co-location facilities have large network imbalances due to over-subscription and incremental networking upgrades. We describe D3N, a multi-layer cooperative caching architecture that mitigates network imbalances by caching data on the access side of each layer of a hierarchical network topology, adaptively adjusting cache sizes of each layer based on observed workload patterns and network congestion. We have added (and submitted upstream) a 2-layer D3N cache to the Ceph RADOS Gateway; read bandwidth achieves the 5GB/s speed of our SSDs, and we show that it substantially improves big-data job performance while reducing network traffic.
SoCC
An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications
Ates, Emre,
Sturmann, Lily,
Toslali, Mert,
Krieger, Orran,
Megginson, Richard,
Coskun, Ayse K,
and Sambasivan, Raja R.
Diagnosing performance problems in distributed applications is extremely challenging. A significant reason is that it is hard to know where to place instrumentation a priori to help diagnose problems that may occur in the future. We present the vision of an automated instrumentation framework, Pythia, that runs alongside deployed distributed applications. In response to a newly-observed performance problem, Pythia searches the space of possible instrumentation choices to enable the instru- mentation needed to help diagnose it. Our vision for Pythia builds on workflow-centric tracing, which records the order and timing of how requests are processed within and among a distributed application’s nodes (i.e., records their workflows). It uses the key insight that localizing the sources high perfor- mance variation within the workflows of requests that are ex- pected to perform similarly gives insight into where additional instrumentation is needed.
2017
SIGCOMM
Bootstrapping evolvability for inter-domain routing with D-BGP
Sambasivan, Raja R.,
Tran-Lam, David,
Akella, Aditya,
and Steenkiste, Peter
In ACM Special Interest Group on Data Communication (SIGCOMM)
2017
Abstract The Internet’s inter-domain routing infrastructure, provided today by BGP, is extremely rigid and does not facilitate the introduction of new inter-domain routing protocols. This rigidity has made it incredibly difficult to widely deploy critical fixes to BGP. It has also depressed ASes’; ability to sell value-added services or replace BGP entirely with a more sophisticated protocol. Even if operators undertook the significant effort needed to fix or replace BGP, it is likely the next protocol will be just as difficult to change or evolve. To help, this paper identifies two features needed in the routing infrastructure (i.e., within any inter-domain routing protocol) to facilitate evolution to new protocols. To understand their utility, it presents D-BGP, a version of BGP that incorporates them.
2016
SoCC
Principled workflow-centric tracing of distributed systems
Sambasivan, Raja R.,
Shafer, Ilari,
Mace, Jonathan,
Sigelman, Benjamin H.,
Fonseca, Rodrigo,
and Ganger, Gregory R.
Workflow-centric tracing captures the workflow of causally- related events (e.g., work done to process a request) within and among the components of a distributed system. As distributed systems grow in scale and complexity, such tracing is becoming a critical tool for understanding distributed system behavior. Yet, there is a fundamental lack of clarity about how such infrastructures should be designed to provide maximum benefit for important management tasks, such as resource ac- counting and diagnosis. Without research into this important issue, there is a danger that workflow-centric tracing will not reach its full potential. To help, this paper distills the design space of workflow-centric tracing and describes key design choices that can help or hinder a tracing infrastructure’s utility for important tasks. Our design space and the design choices we suggest are based on our experiences developing several previous workflow-centric tracing infrastructures.
TechReport
Bootstrapping evolvability for inter-domain routing with d-bgp
Sambasivan, Raja R.,
Tran-Lam, David,
Akella, Aditya,
and Steenkiste, Peter
It is extremely difficult to utilize new routing protocols in today’s Internet. As a result, the Internet’s baseline inter-domain protocol for connectivity (BGP) has remained largely unchanged, despite known significant flaws. The difficulty of using new protocols has also depressed opportunities for (currently commoditized) transit providers to provide value-added routing services. To help, this paper proposes Darwin’s BGP (D-BGP), a modified version of BGP that can support evolvability to new protocols. D-BGP modifies BGP’s advertisements and advertisement processing based on requirements imposed by key evolvability scenarios, which we identified via analyses of recently- proposed routing protocols.
2015
HotNets
Bootstrapping evolvability for inter-domain routing
Sambasivan, Raja R.,
Tran-Lam, David,
Akella, Aditya,
and Steenkiste, Peter
In ACM Workshop on Hot Topics in Networks (HotNets)
2015
It is extremely difficult to deploy new inter-domain routing pro- tocols in today’s Internet. As a result, the Internet’s baseline pro- tocol for connectivity, BGP, has remained largely unchanged, despite known significant flaws. The difficulty of deploying new protocols has also depressed opportunities for (currently commoditized) transit providers to provide value-added rout- ing services. To help, we identify the key deployment models under which new protocols are introduced and the require- ments each poses for enabling their usage goals. Based on these requirements, we argue for two modifications to BGP that will greatly improve support for new routing protocols.
2014
TechReport
So, you want to trace your distributed system? Key design insights from years of practical experience
Sambasivan, Raja R.,
Fonseca, Rodrigo,
Shafer, Ilari,
and Ganger, Gregory R.
End-to-end tracing captures the workflow of causally-related activity (e.g., work done to process a request) within and among the components of a distributed system. As distributed systems grow in scale and complexity, such tracing is becoming a critical tool for management tasks like diagnosis and resource accounting. Drawing upon our experiences building and using end-to-end tracing infrastructures, this paper distills the key design axes that dictate trace utility for important use cases. Developing tracing infrastructures without explicitly understanding these axes and choices for them will likely result in infrastructures that are not useful for their intended purposes. In addition to identifying the design axes, this paper identifies good design choices for various tracing use cases, contrasts them to choices made by previous tracing implementations, and shows where prior implementations fall short. It also identifies remaining challenges on the path to making tracing an integral part of distributed system design.
2013
HotStorage
Specialized storage for big numeric time series
Shafer, Ilari,
Sambasivan, Raja R,
Rowe, Anthony,
and Ganger, Gregory R
In USENIX Workshop on Hot Topics in Storage (HotStorage)
2013
Numeric time series data has unique storage requirements and access patterns that can benefit from specialized support, given its importance in Big Data analyses. Popular frameworks and databases focus on addressing other needs, making them a suboptimal fit. This paper describes the support needed for numeric time series, suggests an architecture for efficient time series storage, and illustrates its potential for satisfying key requirements.
TechReport
Visualizing request-flow comparison to aid performance diagnosis in distributed systems
Sambasivan, Raja R.,
Shafer, Ilari,
Mazurek, Michelle L,
and Ganger, Gregory R.
Distributed systems are complex to develop and administer, and performance problem diagnosis is particularly challenging. When performance degrades, the problem might be in any of the system’s many components or could be a result of poor interactions among them. Recent research efforts have created tools that automatically localize the problem to a small number of potential culprits, but effective visualizations are needed to help developers understand and explore their results. This paper compares side-by-side, diff, and animation-based approaches for visualizing the results of one proven automated localization technique called request-flow comparison. Via a 26-person user study, which included real distributed systems developers, we identify the unique benefits that each approach provides for different usage modes and problem types.
InfoVis
Visualizing request-flow comparison to aid performance diagnosis in distributed systems
Sambasivan, Raja R,
Shafer, Ilari,
Mazurek, Michelle L,
and Ganger, Gregory R
IEEE transactions on visualization and computer graphics
2013
Distributed systems are complex to develop and administer, and performance problem diagnosis is particularly challenging. When performance degrades, the problem might be in any of the system’s many components or could be a result of poor interactions among them. Recent research efforts have created tools that automatically localize the problem to a small number of potential culprits, but research is needed to understand what visualization techniques work best for helping distributed systems developers understand and explore their results. This paper compares the relative merits of three well-known visualization approaches (side-by-side, diff, and animation) in the context of presenting the results of one proven automated localization technique called request-flow comparison. Via a 26-person user study, which included real distributed systems developers, we identify the unique benefits that each approach provides for different problem types and usage modes.
Dissertation
Diagnosing performance changes in distributed systems by comparing request flows
Diagnosing performance problems in modern datacenters and distributed systems is challenging, as the root cause could be contained in any one of the system’s numerous components or, worse, could be a result of interactions among them. As distributed systems continue to increase in complexity, di- agnosis tasks will only become more challenging. There is a need for a new class of diagnosis techniques capable of helping developers address problems in these distributed environments.As a step toward satisfying this need, this dissertation proposes a novel technique, called request-flow comparison, for automatically localizing the sources of performance changes from the myriad potential culprits in a distributed system to just a few potential ones. Request-flow comparison works by contrasting the workflow of how individual requests are serviced within and among every component of the distributed system between two periods: a non-problem period and a problem period. By identifying and ranking performance-affecting changes, request-flow comparison provides developers with promising starting points for their diagnosis efforts. Request workflows are obtained with less than 1\% overhead via use of recently developed end-to-end tracing techniques. To demonstrate the utility of request-flow comparison in various distributed systems, this dissertation describes its implementation in a tool called Spectroscope and describes how Spectroscope was used to diagnose real, previously unsolved problems in the Ursa Minor distributed storage service and in select Google services. It also explores request-flow comparison’s applicability to the Hadoop File System. Via a 26-person user study, it identifies effective visualiza- tions for presenting request-flow comparison’s results and further demonstrates that request-flow comparison helps developers quickly identify starting points for diagnosis. This dissertation also distills design choices that will maximize an end-to-end tracing infrastructure’s utility for diagnosis tasks and other use cases.
2012
TechReport
Visualizing request-flow comparison to aid performance diagnosis in distributed systems
Sambasivan, Raja R.,
Mazurek, Michelle L,
and Shafer, Ilari
Distributed systems are complex to develop and administer, and performance problem diagnosis is particularly challenging. When performance decreases, the problem might be in any of the system’s many components or could be a result of poor interactions among them. Recent research has provided the ability to automatically identify a small set of most likely problem locations, leaving the diagnoser with the task of exploring just that set. This paper describes and evaluates three approaches for visualizing the results of a proven technique called “request-flow comparison” for identifying likely causes of performance decreases in a distributed system. Our user study provides a number of insights useful in guiding visualization tool design for distributed system diagnosis. For example, we find that both an overlay-based approach (e.g., diff) and a side-by-side approach are effective, with tradeoffs for different users (e.g., expert vs. not) and different problem types. We also find that an animation-based approach is confusing and difficult to use.
HotCloud
Automated diagnosis without predictability is a recipe for failure
Sambasivan, Raja R,
and Ganger, Gregory R
In USENIX Workshop on Hot Topics in Cloud Computing (HotCloud)
2012
Automated management is critical to the success of cloud computing, given its scale and complexity. But, most sys- tems do not satisfy one of the key properties required for automation: predictability, which in turn relies upon low variance. Most automation tools are not e?ective when variance is consistently high. Using automated perfor- mance diagnosis as a concrete example, this position pa- per argues that for automation to become a reality, system builders must treat variance as an important metric and make conscious decisions about where to reduce it. To help with this task, we describe a framework for reason- ing about sources of variance in distributed systems and describe an example tool for helping identify them.
2011
NSDI
Diagnosing performance changes by comparing request flows
Sambasivan, Raja R,
Zheng, Alice X,
De Rosa, Michael,
Krevat, Elie,
Whitman, Spencer,
Stroucken, Michael,
Wang, William,
Xu, Lianghong,
and Ganger, Gregory R
In USENIX Conference on Networked Systems Design and Implementation (NSDI)
2011
The causes of performance changes in a distributed system often elude even its developers. This paper de- velops a new technique for gaining insight into such changes: comparing request flows from two executions (e.g., of two system versions or time periods). Build- ing on end-to-end request-flow tracing within and across components, algorithms are described for identifying and ranking changes in the flow and/or timing of request pro- cessing. The implementation of these algorithms in a tool called Spectroscope is evaluated. Six case studies are presented of using Spectroscope to diagnose perfor- mance changes in a distributed storage service caused by code changes, configuration modifications, and compo- nent degradations, demonstrating the value and efficacy of comparing request flows. Preliminary experiences of using Spectroscope to diagnose performance changes within select Google services are also presented.
TechReport
Automation without predictability is a recipe for failure
Automated management seems a must, as distributed systems and datacenters continue to grow in scale and complexity. But, automation of performance problem diagnosis and tuning relies upon predictability, which in turn relies upon low variance—most automation tools aren’t effective when variance is regularly high. This paper argues that, for automation to become a reality, system builders must treat variance as an important metric and make conscious decisions about where to reduce it. To help with this task, we describe a framework for understanding sources of variance and describe an example tool for helping identify them.
2010
ATC
A transparently-scalable metadata service for the ursa minor storage system.
Sinnamohideen, Shafeeq,
Sambasivan, Raja R,
Hendricks, James,
Liu, Likun,
and Ganger, Gregory R
The metadata service of the Ursa Minor distributed storage system scales metadata throughput as metadata servers are added. While doing so, it correctly handles metadata operations that involve items served by dif- ferent metadata servers, consistently and atomically up- dating the items. Unlike previous systems, it does so by reusing existing metadata migration functionality to avoid complex distributed transaction protocols. It also assigns item IDs to minimize the occurrence of multi- server operations. Ursa Minor’s approach allows one to implement a desired feature with less complexity than al- ternative methods and with minimal performance penalty (under 1\% in non-pathological cases).
Patent
Managing execution of database queries
Krompass, Stepan,
Kuno, Harumi Anne,
Dayal, Umeshwar,
Wiener, Janet,
and Sambasivan, Raja R.
One embodiment is a method to manage queries in a database. The method identified a query that executes on the database for an elapsed time that is greater than a threshold and then implements a remdial action when the query executes on the database for an execution time that is greater than an estimated execution time.
TechReport
Diagnosing performance problems by visualizing and comparing system behaviours
Sambasivan, Raja R.,
Zheng, Alice X.,
Krevat, Elie,
Whitman, Spencer,
and Ganger, Gregory R.
Spectroscope is a new toolset aimed at assisting developers with the long-standing challenge of performance debugging in dis- tributed systems. To do so, it mines end-to-end traces of request processing within and across components. Using Spectroscope, developers can visualize and compare system behaviours between two periods or system versions, identifying and ranking various changes in the flow or timing of request processing. Examples of how Spectroscope has been used to diagnose real performance problems seen in a distributed storage system are presented, and Spectroscope’s primary assumptions and algorithms are evaluated.
TechReport
Diagnosing performance changes by comparing system behaviours
Sambasivan, Raja R.,
Zheng, Alice X,
Krevat, Elie,
Whitman, Spencer,
Stroucken, Michael,
Wang, William,
Xu, Lianghong,
and Ganger, Gregory R.
The causes of performance changes in a distributed system often elude even its developers. This paper develops a new technique for gaining insight into such changes: comparing system behaviours from two executions (e.g., of two system versions or time periods). Building on end-to-end request flow tracing within and across components, algorithms are described for identifying and ranking changes in the flow and/or timing of request processing. The implementation of these algorithms in a tool called Spectroscope is described and evaluated. Five case studies are presented of using Spectroscope to diagnose performance changes in a distributed storage system caused by code changes and configuration modifications, demonstrating the value and efficacy of comparing system behaviours.
2009
CACM
Relative fitness modeling
Mesnier, Michael P.,
Wachs, Matthew,
Sambasivan, Raja R.,
Zheng, Alice X.,
and Ganger, Gregory R.
Relative fitness is a new approach to modeling the performance of storage devices (e.g., disks and RAID arrays). In contrast to a conventional model, which predicts the performance of an application’s I/O on a given device, a relative fitness model predicts performance differences between devices. The result is significantly more accurate predictions.
2007
SIGMETRICS
Modeling the relative fitness of storage
Mesnier, Michael P.,
Wachs, Matthew,
Sambasivan, Raja R.,
Zheng, Alice X.,
and Ganger, Gregory R.
In ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS)
2007
Relative fitness is a new black-box approach to modeling the performance of storage devices. In contrast with an absolute model that predicts the performance of a workload on a given storage device, a relative fitness model predicts per- formance differences between a pair of devices. There are two primary advantages to this approach. First, because a relative fitness model is constructed for a device pair, the application-device feedback of a closed workload can be cap- tured (e.g., how the I/O arrival rate changes as the workload moves from device A to device B). Second, a relative fitness model allows performance and resource utilization to be used in place of workload characteristics. This is beneficial when workload characteristics are difficult to obtain or concisely express (e.g., rather than describe the spatio-temporal char- acteristics of a workload, one could use the observed cache behavior of device A to help predict the performance of B).This paper describes the steps necessary to build a relative fitness model, with an approach that is general enough to be used with any black-box modeling technique. We compare relative fitness models and absolute models across a vari- ety of workloads and storage devices. On average, relative fitness models predict bandwidth and throughput within 10– 20\% and can reduce prediction error by as much as a factor of two when compared to absolute models.
HotAC
Categorizing and differencing system behaviours
Sambasivan, Raja R,
Zheng, Alice X,
Thereska, Eno,
and Ganger, Gregory R
In USENIX Workshop on Hot Topics in Autonomic Computing
2007
Making request flow tracing an integral part of soft- ware systems creates the potential to better understand their operation. The resulting traces can be converted to per- request graphs of the work performed by a service, repre- senting the flow and timing of each request’s processing. Collectively, these graphs contain detailed and comprehen- sive data about the system’s behavior and the workload that induced it, leaving the challenge of extracting insights. Categorizing and differencing such graphs should greatly improve our ability to understand the runtime behavior of complex distributed services and diagnose problems. Clus- tering the set of graphs can identify common request pro- cessing paths and expose outliers. Moreover, clustering two sets of graphs can expose differences between the two; for example, a programmer could diagnose a problem that arises by comparing current request processing with that of an earlier non-problem period and focusing on the aspects that change. Such categorizing and differencing of system behavior can be a big step in the direction of automated problem diagnosis.
FAST
// Trace: Parallel trace replay with approximate causal events
Mesnier, Michael P,
Wachs, Matthew,
Simbasivan, Raja R,
Lopez, Julio,
Hendricks, James,
Ganger, Gregory R,
and O’hallaron, David R
In USENIX Conference on File and Storage Technologies
2007
//TRACE1 is a new approach for extracting and replaying traces of parallel applications to recreate their I/O behav- ior. Its tracing engine automatically discovers inter-node data dependencies and inter-I/O compute times for each node (process) in an application. This information is re- flected in per-node annotated I/O traces. Such annota- tion allows a parallel replayer to closely mimic the be- havior of a traced application across a variety of stor- age systems. When compared to other replay mecha- nisms, //TRACE offers significant gains in replay accu- racy. Overall, the average replay error for the parallel applications evaluated in this paper is below 6%.
2006
IEEE Bulletin
Early experiences on the journey towards self-* storage.
Abd-El-Malek, Michael,
Courtright II, William V,
Cranor, Chuck,
Ganger, Gregory R,
Hendricks, James,
Klosterman, Andrew J,
Mesnier, Michael P,
Prasad, Manish,
Salmon, Brandon,
and Sambasivan, Raja R
Self-* systems are self-organizing, self-configuring, self-healing, self-tuning and, in general, self- managing. Ursa Minor is a large-scale storage infrastructure being designed and deployed at Carnegie Mellon University, with the goal of taking steps towards the self-* ideal. This paper discusses our early experiences with one specific aspect of storage management: performance tuning and projection. Ursa Minor uses self-monitoring and rudimentary system modeling to support analysis of how system changes would affect performance, exposing simple What.if query interfaces to administrators and tuning agents. We find that most performance predictions are sufficiently accurate (within 10-20\%) and that the associated performance overhead is less than 6\%. Such embedded support for What.if queries simplifies tuning automation and reduces the administrator expertise needed to make acquisition decisions.
TechReport
Improving small file performance in object-based storage
Hendricks, James,
Sambasivan, Raja R.,
Sinnamohideen, Shafeeq,
and Ganger, Gregory R.
This paper proposes architectural refinements, server-driven metadata prefetching and namespace flattening, for improving the efficiency of small file workloads in object-based storage systems. Server-driven metadata prefetching consists of having the metadata server provide information and capabilities for multiple objects, rather than just one, in response to each lookup. Doing so allows clients to access the contents of many small files for each metadata server interaction, reducing access latency and metadata server load. Namespace flattening encodes the directory hierarchy into object IDs such that namespace locality translates to object ID similarity. Doing so exposes namespace relationships among objects (e.g., as hints to storage devices), improves locality in metadata indices, and enables use of ranges for exploiting them. Trace-driven simulations and experiments with a prototype implementation show significant performance benefits for small file workloads.
TechReport
Eliminating cross-server operations in scalable file systems
Hendricks, James,
Sinnamohideen, Shafeeq,
Sambasivan, Raja R.,
and Ganger, Gregory R.
Distributed file systems that scale by partitioning files and directories among a collection of servers inevitably encounter cross- server operations. A common exampleis a REN ME that moves afile from a directorymanaged by oneserver to a directory managed by another. Systems that provide the same semantics for cross-server operations as for those that do not span servers traditionally implement dedicated protocols for these rare operations. This paper suggests an alternate approach that exploits the existence of dynamic redistribution functionality (e.g., for load balancing, incorporation of new servers, and so on). When a client request would involve files on multiple servers, the system can redistribute those files onto one server and have it service the request. Although such redistribution is more expensive than a dedicated cross-server protocol, the rareness of such operations makes the overall performance impact minimal. Analysis of NFS traces indicates that cross-server operations make up fewer than 0.001\% of client requests, and experiments with a prototype implementation show that the performance impact is negligible when such operations make up as much as 0.01\% of operations. Thus, when dynamic redistribution functionality exists in the system, cross-server operations can be handled with little additional implementation complexity.
No single data encoding scheme or fault model is right for all data. A versatile storage system allows these to be data-specific, so that they can be matched to access patterns, reliability requirements, and cost goals. Ursa Minor is a cluster-based storage system that allows data-specific selection of and on-line changes to encoding schemes and fault models. Thus, different data types can share a scalable storage infrastructure and still enjoy customized choices, rather than suffering from “one size fits all.” Experiments with Ursa Minor show performance penalties as high as 2–3 for workloads using poorly-matched choices. Experiments also show that a single cluster supporting multiple workloads is much more efficient when the choices are specialized rather than forced to use a “one size fits all” configuration.
FAST
Ursa minor: Versatile cluster-based storage
Abd-El-Malek, Michael,
Courtright II, William V,
Cranor, Chuck,
Ganger, Gregory R,
Hendricks, James,
Klosterman, Andrew J,
Mesnier, Michael P,
Prasad, Manish,
Salmon, Brandon,
and Sambasivan, Raja R
In USENIX Conference on File and Storage Technologies
2005
No single encoding scheme or fault model is optimal for all data. A versatile storage system allows them to be matched to access patterns, reliability requirements, and cost goals on a per-data item basis. Ursa Minor is a cluster-based storage system that allows data-specific selection of, and on-line changes to, encoding schemes and fault models. Thus, different data types can share a scalable storage infrastructure and still enjoy specialized choices, rather than suffering from “one size fits all.” Ex- periments with Ursa Minor show performance benefits of 2–3× when using specialized choices as opposed to a single, more general, configuration. Experiments also show that a single cluster supporting multiple workloads simultaneously is much more efficient when the choices are specialized for each distribution rather than forced to use a “one size fits all” configuration. When using the specialized distributions, aggregate cluster through- put nearly doubled.
MASCOTS
Replication policies for layered clustering of nfs servers
Sambasivan, Raja R.,
Klosterman, Andrew J,
and Ganger, Gregory R.
In IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
2005
Layered clustering offers cluster-like load balancing for unmodified NFS or CIFS servers. Read requests sent to a busy server can be offloaded to other servers holding replicas of the accessed files. This paper explores a key design question for this approach; which files should be replicated? We find that the popular policy of replicating read-only files offers little benefit. A policy that replicates read-only portions of read-mostly files, however, implicitly coordinates with client cache invalidations and thereby allows almost all read operations to be offloaded. In a read-heavy trace, 75\% of all operations and 52\% of all data transfers can be offloaded.
TechReport
Selected project reports, spring 2005 advanced os \& distributed systems (15-712)
This technical report contains six final project reports contributed by participants in CMU’s Spring 2005 Advanced Operating Systems and Distributed Systems course (15-712) offered by professor Garth Gibson. This course examines the design and analysis of various aspects of operating systems and distributed systems through a series of background lectures, paper readings, and group projects. Projects were done in groups of two or three, required some kind of implementation and evalution pertaining to the classrom material, but with the topic of these projects left up to each group. Final reports were held to the standard of a systems conference paper submission; a standard well met by the majority of completed projects. Some of the projects will be extended for future submissions to major system conferences.
The reports that follow cover a broad range of topics. These reports present a characterization of synchronization behavior and overhead in commercial databases, and a hardware-based lock predictor based on the characterization; design and implementation of a partitioned protocol offload architecture that provides Direct Data Placement (DDP) functionality and better utilizes both the network interface and the host CPU; design and implementation of file indexing inside file systems for fast content searching support; comparison-based server verification techniques for stateful and semi-deterministic protocols such as NFSv4; data-plane protection techniques for link-state routing protocols such as OSPF, which is resilient to the existence of compromised routers; and performance comparison of in-band and out-of-band data access strategies in file systems.
While not all of these reports report definitely and positively, all are worth reading because they involve novelty in the systems explored and bring forth interesting research questions.