Suchen und Finden
Service
Software Automatic Tuning - From Concepts to State-of-the-Art Results
Ken Naono, Keita Teranishi, John Cavazos, Reiji Suda
Verlag Springer-Verlag, 2010
ISBN 9781441969354 , 377 Seiten
Format PDF
Kopierschutz Wasserzeichen
Software Automatic Tuning
3
Preface
5
Contents
7
Contributors
11
Part I Introduction
15
Chapter1 Software Automatic Tuning: Concepts and State-of-the-Art Results
16
1.1 Software Automatic Tuning: General Concepts
17
1.2 Software Automatic Tuning: State-of-the-Art
22
Part II Achievements in Scientific Computing
29
Chapter2 ATLAS Version 3.9: Overview and Status*
30
2.1 Introduction
30
2.1.1 Basic AEOS Requirements
32
2.1.2 Methods of Software Adaptation
33
2.2 ATLAS Overview
34
2.2.1 Explicit LAPACK Support In ATLAS
35
2.2.2 Dense Level 3 BLAS Support in ATLAS
36
2.2.2.1 Using Assembly in ATLAS
38
2.2.3 Dense Level 2 BLAS Support in ATLAS
39
2.2.4 Dense Level 1 BLAS Support in ATLAS
39
2.3 Status
40
References
40
Chapter3 Autotuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS
44
3.1 Introduction
44
3.2 Background
45
3.2.1 BLAS
46
3.2.2 DL-BLAS
46
3.2.3 Motivation of the present Research
48
3.3 Proposed Methods
49
3.3.1 Parallel Efficiency Analysis
49
3.3.2 Diagonal Search
51
3.3.3 Reductive Search
51
3.3.4 Parameter Selection
53
3.4 Experiments
53
3.4.1 Environments
53
3.4.2 Performance of Diagonal Search
54
3.4.3 Analysis and Discussion of Diagonal Search
55
3.4.4 Performance Evaluation Tests
56
3.5 Conclusion
57
References
59
Chapter4 Automatic Tuning for Parallel FFTs
60
4.1 Introduction
60
4.2 Parallel One-Dimensional FFT
61
4.2.1 A Block Nine-Step FFT Algorithm
61
4.2.2 A Parallel One-Dimensional FFT Algorithm Based on the Block Nine-Step FFT
63
4.3 Parallel Multi-Dimensional FFT
65
4.3.1 Parallel Two-Dimensional FFT
65
4.3.2 Parallel Three-Dimensional FFT
66
4.4 Automatic Tuning of Parallel FFTs
68
4.4.1 Automatic Tuning of All-to-All Communication
68
4.4.2 Selection of the Radices
69
4.4.3 Selection of the Block Size
69
4.5 Performance Results
70
4.5.1 Performance of All-to-All Communication
72
4.5.2 Performance of One-Dimensional FFT
72
4.5.3 Performance of Multi-Dimensional FFTs
75
4.6 Conclusion
76
References
78
Chapter5 Dynamic Programming Approaches to Optimizing the Blocking Strategy for Basic Matrix Decompositions
79
5.1 Introduction
79
5.2 Basics of the Householder QR Algorithm
81
5.2.1 The Householder QR Algorithm
81
5.2.2 Blocking Strategies for the Householder QR Algorithm
81
5.2.2.1 Fixed-size Blocking
82
5.2.2.2 Recursive Blocking
82
5.2.3 Another Blocking Strategy: The TSQR Algorithm
83
5.3 Dynamic Programming Approaches to Optimizing the Blocking Strategy
84
5.3.1 Limitations of Conventional Blocking Strategies
84
5.3.2 Approaches for the Sequential Case
85
5.3.2.1 Variable-size Blocking
85
5.3.2.2 Generalized Recursive Blocking
87
5.3.3 Approaches for the Parallel Case
89
5.3.3.1 Variable-size Blocking for Distributed Memory Machines
89
5.3.3.2 Combination of Variable-size Blocking and TSQR
89
5.4 Future Directions
92
5.4.1 Use of Task Level Parallelism
92
5.4.2 Extension to Heterogeneous Architectures
92
5.4.3 Online Refinement of the Performance Models and the Blocking Strategy
92
5.4.4 Application to Other Linear Algebra Algorithms
93
5.5 Conclusion
93
References
93
Chapter6 Automatic Tuning of the Division Number in the Multiple Division Divide-and-Conquer for Real Symmetric Eigenproblem
96
6.1 Introduction
96
6.2 DCk Algorithm and Deflation
97
6.3 Deflation Occurrence Rate (DOR)
99
6.4 Proposal of Algorithm to Estimate the Optimal Division Number
102
6.5 Numerical Experiment
103
6.6 Summary
109
References
109
Chapter7 Automatically Tuned Mixed-Precision Conjugate Gradient Solver
111
7.1 Introduction
111
7.2 Iterative Refinement
113
7.3 Choosing the Target Residual Reduction for the Inner Solver
115
7.4 Stopping the Iterative Refinement
119
7.4.1 Stopping Based on Threshold
119
7.4.2 Stopping Based on Condition Number Estimation
121
7.4.3 Combined Stopping Method
122
7.5 Condition Number Estimation
123
7.6 Conclusions
126
References
127
Chapter8 Automatically Tuned Sparse Eigensolvers
128
8.1 Introduction
128
8.2 Background: AT Research Trends
129
8.2.1 Definition of AT
129
8.2.2 Six Studies on AT
130
8.2.3 AT Research Trends
131
8.3 Proposal of AT-Restarted-Lanczos
133
8.3.1 Restarted-Lanczos and Its Problem
133
8.3.2 Preliminary Experiment
133
8.3.3 Proposal of the Run-Time Automatic Tuning Algorithm
135
8.3.4 Evaluations
137
8.4 Conclusions
139
References
139
Chapter9 Systematic Performance Evaluation of Linear Solvers Using Quality Control Techniques
141
9.1 Introduction
141
9.2 Systematic Performance Evaluation and Systematic Analyses for Numerical Computation Algorithms
142
9.2.1 Systematic Performance Evaluation of Solution Algorithms
142
9.2.2 Construction of the Performance Information DB on the Solution Algorithms
143
9.3 Quality Control in the Solution Process
145
9.3.1 Quality Control Techniques
145
9.3.2 Systematic Performance Evaluation Using Survey Charts
148
9.4 Analysis of Causes of Poor Solution Quality
152
9.4.1 Preconditioning of Solution Algorithm for Linear Equations
154
9.4.2 Residual Vector and Convergence Criterion for Each Preconditioning System
155
9.5 Concluding Remarks
156
References
158
Chapter 10 Application of Alternating Decision Trees in Selecting Sparse Linear Solvers*
159
10.1 Introduction
159
10.2 Machine Learning Methods
161
10.2.1 Adaboost
162
10.2.2 Alternating Decision Trees
163
10.3 Solver Selection as a Classification Problem
164
10.3.1 Feature Selection and Computation
166
10.4 Applying Machine Learning
167
10.5 Experimental Results
168
10.5.1 Experiment 1: Classification on Linear Systems Generated from PDE-based Simulation Code
169
10.5.2 Experiment 2: Classification on Only Coefficient Matrices
171
10.6 Conclusions and Future Plans
175
References
177
Chapter11 Toward Automatic Performance Tuning for Numerical Simulations in the SILC Matrix Computation Framework
180
11.1 Introduction
180
11.2 The SILC Matrix Computation Framework
182
11.3 Performance Modeling
184
11.4 Experiments
185
11.4.1 Cloth Simulation
186
11.4.2 Fluid Simulation
189
11.4.3 Simple Initial Value Problem
190
11.5 Automatic Performance Tuning
193
11.5.1 Autotuning Mechanism for Numerical Simulations in SILC
193
11.5.2 Related Work and Discussions
195
11.6 Concluding Remarks
195
References
196
Chapter12 Exploring Tuning Strategies for Quantum Chemistry Computations
198
12.1 Introduction
198
12.2 Related Works
200
12.3 Methodology
200
12.4 Performance Results and Observations
203
12.5 Tuning Strategy
208
12.6 Conclusions and Future Work
211
References
211
Chapter13 Automatic Tuning of CUDA Execution Parameters for Stencil Processing
214
13.1 Introduction
214
13.2 Related Work
216
13.3 Performance Tuning for CUDA
217
13.4 Automatic Performance Tuning of CUDA Execution Parameters
219
13.4.1 Parameter Exploration Space
220
13.4.2 Auto-Tuning Mechanism
223
13.5 Performance Evaluation
225
13.5.1 Automatic Tuning on Himeno Benchmark
225
13.5.2 Automatic Tuning on LU Decomposition
229
13.6 Conclusions
232
References
232
Chapter14 Static Task Cluster Size Determination in Homogeneous Distributed Systems
234
14.1 Introduction
234
14.2 Problem Definition and Assumptions
236
14.2.1 Assumed Model
236
14.2.2 Problems in Conventional Cluster Merging
237
14.2.3 Requirements for Reducing Both the Schedule Length and the Number of Processors
238
14.3 Definition of an Indicative Value for Schedule Length
239
14.3.1 Abstract of Worst Schedule Length
239
14.3.2 WSL Definition
239
14.4 Derivation of Cluster Size for Minimizing WSL
240
14.4.1 An Upper Bound of WSL
240
14.4.2 Derivation of dopt
246
14.4.3 Relationship Between WSLs and Schedule Length
247
14.4.4 Requirements for Minimizing WSL in a Task Clustering
248
14.5 Task Clustering
248
14.5.1 Overview of the Algorithm
248
14.5.2 Cluster Selection Policies
249
14.6 Experimental Evaluation
251
14.6.1 Comparison Targets
251
14.6.2 Simulation Setup
251
14.6.3 Comparison of WSLS and Schedule Length in the Fixed Number of Processors
252
14.6.4 Processor Utilization
253
14.6.5 Comparison of WSLS and Schedule Length of Real Application DAGs
254
14.6.6 Discussion
255
14.7 Conclusion and Future Works
256
References
256
Part III Evolution to a General Paradigm
258
Chapter15 Algorithmic Parameter Optimization of the DFO Method with the OPAL Framework
259
15.1 Introduction
259
15.2 Optimization of Algorithmic Parameters
262
15.2.1 Black Box Construction
262
15.2.2 Direct Search Algorithms
264
15.3 The OPAL Package
265
15.3.1 The OPAL Structure
266
15.3.2 Usage of OPAL
266
15.4 Application to Derivative-Free Optimization
268
15.4.1 General Description of DFO
268
15.4.2 Two DFO Parameter Optimization Problems
268
15.4.3 Numerical Results
272
15.5 Discussion
276
References
277
Chapter16 A Bayesian Method of Online Automatic Tuning
279
16.1 Introduction
279
16.2 Automatic Tuning Concepts
280
16.3 Review of the Proposed Methods
281
16.3.1 Online Automatic Tuning
282
16.3.2 Tuning Based on Cost Function Modeling
283
16.3.3 Bayesian Analysis to Treat Uncertainties
284
16.3.4 Bayesian Suboptimal Sequential Experimental Design
286
16.3.5 Desirable Properties of Mathematical Methods for Automatic Tuning
288
16.3.6 Infinite Dilution
291
16.3.7 Pseudocode of the Proposed Method
292
16.4 Evaluation
293
16.4.1 Infinite Dilution with Random Sampling
293
16.4.2 Performance Evaluation
293
16.5 Conclusion
296
References
297
Chapter17 ABCLibScript: A Computer Language for Automatic Performance Tuning
298
17.1 Introduction
298
17.2 The FIBER Framework
299
17.2.1 FIBER Framework Basics
299
17.2.2 The Two Users and Auto-tuning Software Construction Assumption
299
17.2.2.1 Software Developer Phase
300
17.2.2.2 Software User (End-User) Phase
301
17.3 The ABCLibScript
301
17.4 Examples of ABCLibScript Descriptions
304
17.4.1 Software Developer API
304
17.4.2 End-User API: Before Execute-Time Auto-tuning
305
17.4.3 Other Programming Examples for Software Developers Using ABCLibScript
306
17.4.3.1 Install-Time Auto-tuning
306
17.4.3.2 Before Execute-Time Auto-tuning
308
17.4.3.3 Run-Time Auto-tuning
309
17.5 Future Direction of ABCLibScript
310
17.5.1 Evaluation of Compiler Optimization and Hardware Performance
310
17.5.2 Adapting ABCLibScript to Embedded Systems
312
17.5.3 Adapting ABCLibScript to Low-Power Optimization
314
17.6 Concluding Remarks
315
References
316
Chapter18 Automatically Tuning Task-Based Programs for Multicore Processors
317
18.1 Introduction
317
18.2 Example
318
18.2.1 Task Extensions
319
18.2.2 Execution
319
18.2.3 Scheduling for Multicore Processor
320
18.3 Implementation Synthesis
322
18.3.1 Dependence Analysis
323
18.3.2 Profiling
323
18.3.3 Candidate Implementation Synthesis
323
18.3.4 Simulation Stage
325
18.3.5 Directed-Simulated Annealing
326
18.3.5.1 Generation of Optimized Implementation
327
18.3.6 Dynamic Scheduling
328
18.4 Evaluation
328
18.4.1 Benchmarks
329
18.4.2 Accuracy of Scheduling Simulator
330
18.4.3 Optimality of Directed Simulated Annealing
331
18.4.4 Generality of Synthesized Implementation
332
18.4.5 Overhead of Bamboo
333
18.5 Related Work
334
18.6 Conclusion
335
References
335
Chapter19 Efficient Program Compilation Through Machine Learning Techniques
337
19.1 Introduction
337
19.1.1 Motivation
338
19.2 Design and Implementation
339
19.2.1 Preparing for Data Collection
339
19.2.2 Gathering Training Data
342
19.2.3 Learning Process
344
19.2.4 Deployment Process
345
19.3 Experimental Setup
346
19.4 Results and Discussion
346
19.4.1 Quality of Training Data
347
19.4.2 Classifier Evaluation
348
19.4.3 Other Benchmarks
349
19.5 Related Work
350
19.6 Conclusions and Future Work
352
19.7 Acknowledgments and Trademarks
352
References
352
Chapter20 Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology
354
20.1 Introduction
354
20.2 Compiler Technology: Autotuning and Specialization
356
20.2.1 Optimization Strategy
358
20.2.2 Specialization and Transformation Using CHiLL
359
20.2.3 Autotuning: Heuristics for Pruning the Search Space
361
20.2.4 Building the Library
362
20.3 Experiments
363
20.3.1 Experimental Environment
364
20.3.2 Generating the Library
364
20.3.3 Evaluating the Pruning Heuristics
365
20.3.4 Performance Results
367
20.4 Related Work
369
20.5 Conclusion
370
References
370
Index
372