| Sign In to gain access to subscriptions and/or personal tools. |
Monitoring Large Systems Via Statistical SamplingDEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF ILLINOIS URBANA, USA
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF ILLINOIS URBANA, USA As the trend in parallel systems scales toward petaflop performance tapped by advances in circuit density and by an increasingly available computational Grid, the development of efficient mechanisms for monitoring large systems becomes imperative. When computational components are coupled via dynamically shifting connections with various remote resources, the number of potential factors affecting system behavior is enormous. Yet the overhead of monitoring can be prohibitive. In this paper we present a new technique for monitoring large systems based on statistical sampling. Rather than monitoring each component, we select a statistically valid sample and measure the behavior of sample members. We describe the formal requirements of sample selection and verify the feasibility of our approach with experiments on large parallel systems and wide-area networks. Our results show that this technique can be a powerful tool to enable effective monitoring without incurring the large costs typically associated to exhaustive checking.
Key Words: Large systems statistical sampling performance monitoring
International Journal of High Performance Computing Applications, Vol. 18, No. 2,
267-277 (2004) |
|||