|
Sign In to gain access to subscriptions and/or personal tools.
|
International Journal of High Performance Computing Applications, Vol. 18, No. 1,
95-113 (2004)
DOI: 10.1177/1094342004041294
Sparse Tiling for Stationary Iterative Methods
Michelle Mills Strout
ARGONNE NATIONAL LABORATORY
Larry Carter
Jeanne Ferrante
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Barbara Kreaseck
LA SIERRA UNIVERSITY
In modern computers, a programs data locality can affect performance significantly. This paper details full sparse tiling, a run-time reordering transformation that improves the data locality for stationary iterative methods such as GaussSeidel operating on sparse matrices. In scientific applications such as finite element analysis, these iterative methods dominate the execution time. Full sparse tiling chooses a permutation of the rows and columns of the sparse matrix, and then an order of execution that achieves better data locality. We prove that full sparsetiled GaussSeidel generates a solution that is bitwise identical to traditional GaussSeidel on the permuted matrix. We also present measurements of the performance improvements and the overheads of full sparse tiling and of cache blocking for irregular grids, a related technique developed by Douglas et al.
Key Words: tiling iterative alogorithms static and dynamic analysis irregular grids data locality sparse matrix computer architecture
References
- Adams, M. F. 2001. A distributed memory unstructured Gauss Seidel algorithm for multigrid smoothers . In Proceedings of 2001 Conference on Supercomputing, Denver, CO, ACM/ IEEE, New York.
- Ahmed, N., Mateev, N., and Pingali, K. 2000a. Synthesizing transformations for locality enhancement of imperfectlynested loop nests . In Proceedings of the 2000 ACM International Conference on Supercomputing (ICS), pp. 141152 .
- Ahmed, N., Mateev, N., and Pingali, K. 2000b. Synthesizing transformations for locality enhancement of imperfectlynested loop nests . In Conference Proceedings of the 2000 International Conference on Supercomputing, Santa Fe, NM, ACM SIGARCH, pp. 141152 .
- Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C., and der Vorst, H. V. 1994. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd edition, SIAM, Philadelphia, PA .
- Bassetti, F., Davis, K., and Quinlan, D. 1998. Optimizing transformations of stencil operations for parallel object-oriented scientific frameworks on cache-based architectures. Lecture Notes in Computer Science, Vol. 1505, Spinger-Verlag, Berlin .
- Carr, S. and Kennedy, K. 1992. Compiler blockability of numerical algorithms . In Proceedings of 1992 Conference on Supercomputing, Minneapolis, MN, IEEE Computer Society Press, Los Alamitos, CA, pp. 114124
- Das, R., Uysal, M., Saltz, J., and Hwang, Y.-S. S. 1994. Communication optimizations for irregular scientific computations on distributed memory architectures . Journal of Parallel and Distributed Computing 22(3): 462478 .[CrossRef]
- Ding, C. and Kennedy, K. 1999. Improving cache performance in dynamic applications through data and computation reorganization at run time . In Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, Atlanta, GA, ACM SIGPLAN Notices, pp. 229241 .
- Douglas, C. C., Hu, J., Kowarschik, M., Rüde, U., and Weiss, C. 2000. Cache o ptimization for structured and unstructured grid multigrid . Electronic Transaction on Numerical Analysis 10: 2140 .
- Gannon, D., Jalby, W., and Gallivan, K. 1988. Strategies for cache and local memory management by global program transformation . Journal of Parallel and Distributed Computing 5(5): 587616 .[CrossRef]
- Garey, M. R., Johnson, D. S., and Stockmeyer, L. 1976. Some simplified NP-complete graph problems . Theoretical Computer Science 1: 237267 .[CrossRef]
- Gatlin, K. S. and Carter, L. 1999. Architecture-cognizant divide and conquer algorithms . In Proceedings of 1999 Conference on Supercomputing, Portland, OR, ACM/IEEE, New York.
- Han, H. and Tseng, C.-W. 2000. Efficient compiler and runtime support for parallel irregular reductions . Parallel Computing 26(1314): 18611887 .[CrossRef]
- Hoare, C. A. R. 1969. An axiomatic basis of computer programming . Communications of the ACM 12: 576580 .[CrossRef][ISI]
- Im, E.-J. 2000. Optimizing the performance of sparse matrix vector multiply. Ph.D. Thesis, University of California, Berkeley.
- Im, E.-J. and Yelick, K. 2001. Optimizing sparse matrix computations for register reuse in sparsity. In Computational Science ICCS 2001, San Jose, CA, V. N. Alexandrov et al., editors, Lecture Notes in Computer Science, Springer-Verlag, Berlin , pp. 127136.
- Irigoin, F. and Triolet, R. 1988. Supernode partitioning . In Proceedings of the 15th Annual ACM SIGPLAN Symposium on Principles of Programming Languages, San Diego, CA, pp. 319329 .
- Jin, G., Mellor-Crummey, J., and Fowler, R. 2001. Increasing temporal locality with skewing and recursive blocking . In Proceedings of 2001 ACM/IEEE Conference on Supercomputing, Denver, CO.
- Karypis, G. and Kumar, V. 1998. Multilevel k-way partitioning scheme for irregular graphs . Journal of Parallel and Distributed Computing 48(1): 96129 .[CrossRef][ISI]
- Kelly, W. and Pugh, W. 1994. Finding legal reordering transformations using mappings . In Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing, Ithaca, NY, K. Pingali et al., editors, LNCS Vol. 892, Springer-Verlag, Berlin, pp. 107124 .
- Kelly, W. and Pugh, W. 1995. A unifying framework for iteration reordering transformations, Technical Report CSTR-3430, University of Maryland, College Park .
- Kodukula, I., Ahmed, N., and Pingali, K. 1997. Data-centric multi-level blocking . In Proceedings of the 1997 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Las Vegas, NV, pp. 346357 .
- London, K., Dongarra, J., Moore, S., Mucci, P., Seymour, K., and Spencer, T. 2001. End-user tools for application performance analysis using hardware counters . In International Conference on Parallel and Distributed Computing Systems, Kyongju City, Korea.
- McKinley, K. S., Carr, S., and Tseng, C.-W. 1996. Improving data locality with loop transformations . ACM Transactions on Programming Languages and Systems 18(4): 424453 .[CrossRef]
- Mellor-Crummey, J., Whalley, D., and Kennedy, K. 2001. Improving memory hierarchy performance for irregular applications using data and computation reorderings . International Journal of Parallel Programming 29(3): 217247 .[CrossRef]
- Mirchandaney, R., Saltz, J. H., Smith, R. M., Nicol, D. M., and Crowley, K. 1988. Principles of run-time support for parallel processors . In Proceedings of the 1988 ACM International Conference on Supercomputing (ICS), Saint Malo, France, pp. 140152 .
- Mitchell, N., Carter, L., and Ferrante, J. 1999. Localizing nonaffine array references . In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, Newport Beach, CA, pp. 192202 .
- Pugh, W. and Rosser, E. 1999. Iteration space slicing for locality . In Proceedings of the 12th Workshop on Languages and Compilers for Parallel Computing (LCPC) San Diego, CA.
- Pugh, W. and Shpeisman, T. 1998. SIPR: A new framework for generating efficient code for sparse matrix computations . In Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing (LCPC) Chapel Hill, NC.
- Pugh, B. and Wonnacott, D. 1994. Nonlinear array dependence analysis, Technical Report CS-TR-3372, Department of Computer Science, University of Maryland .
- Rivera, G. and Tseng, C.-W. 2000. Tiling optimizations for 3D scientific computations . In Proceedings of 2000 Conference on Supercomputing, Dallas, TX, ACM/IEEE, New York, pp. 6061 .
- Sellappa, S. and Chatterjee, S. 2001. Cache-efficient multigrid algorithms. In Computational Science ICCS 2001, San Jose, CA, V. N. Alexandrov et al., editors, Lecture Notes in Computer Science, Springer-Verlag, Berlin , pp. 107116.
- Song, Y. and Li, Z. 1999. New tiling techniques to improve cache temporal locality . In Proceedings of the ACM SIGPLAN 99 Conference on Programming Language Design and Implementation, Atlanta, GA, pp. 215228 .
- Strout, M. M. 2003. Performance transformations for irregular applications, Ph.D. Thesis, University of California, San Diego.
- Strout, M. M., Carter, L., and Ferrante, J. 2001. Rescheduling for locality in sparse matrix computations. In Computational Science ICCS 2001, San Jose, CA, V. N. Alexandrov et al., editors, Lecture Notes in Computer Science, Springer-Verlag, Berlin .
- Strout, M. M., Carter, L., Ferrante, J., Freeman, J., and Kreaseck, B. 2002. Combining performance aspects of irregular GaussSeidel via sparse tiling . In Proceedings of the 15th Workshop on Languages and Compilers for Parallel Computing (LCPC) College Park, MD.
- Wolf, M. E. and Lam, M. S. 1991. A data locality optimizing algorithm . In Proceedings of ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation, Toronto, Canada, ACM SIGPLAN Notices, Vol. 26, pp. 3044 .
- Wolfe, M. J. 1987. Iteration space tiling for memory hierarchies . In Proceedings of the 3rd Conference on Parallel Processing for Scientific Computing, Los Angeles, CA, SIAM, Philadelphia, PA, pp. 357361 .
- Wolfe, M. 1989. More iteration space tiling . In Proceedings of 1989 Conference on Supercomputing, Reno, NV, ACM, New York, pp. 655664 .
- Wonnacott, D. 2002. Achieving scalable locality with time skewing . International Journal of Parallel Programming 30(3): 181221 .
- Wu, C. H. 1990. A multicolour SOR method for the finite-element method . Journal of Computational and Applied Mathematics 30(3): 283294 .[CrossRef]

CiteULike Connotea Del.icio.us Digg Reddit Technorati What's this?
|