Inhaltsverzeichnis von Software Automatic Tuning - From Concepts to State-of-the-Art Results vom Springer-Verlag

Suchen und Finden

Service

Infos und Kontakt

Software Automatic Tuning - From Concepts to State-of-the-Art Results

Ken Naono, Keita Teranishi, John Cavazos, Reiji Suda

Verlag Springer-Verlag, 2010

ISBN 9781441969354 , 377 Seiten

Format PDF

Kopierschutz Wasserzeichen

Geräte

Kapitelübersicht
Kurzinformation
Inhaltsverzeichnis
Leseprobe
Blick ins Buch
Fragen zum eBook

Software Automatic Tuning: 3
Preface: 5
Contents: 7
Contributors: 11
Part I Introduction: 15
Chapter1 Software Automatic Tuning: Concepts and State-of-the-Art Results: 16
1.1 Software Automatic Tuning: General Concepts: 17
1.2 Software Automatic Tuning: State-of-the-Art: 22
Part II Achievements in Scientific Computing: 29
Chapter2 ATLAS Version 3.9: Overview and Status*: 30
2.1 Introduction: 30
2.1.1 Basic AEOS Requirements: 32
2.1.2 Methods of Software Adaptation: 33
2.2 ATLAS Overview: 34
2.2.1 Explicit LAPACK Support In ATLAS: 35
2.2.2 Dense Level 3 BLAS Support in ATLAS: 36
2.2.2.1 Using Assembly in ATLAS: 38
2.2.3 Dense Level 2 BLAS Support in ATLAS: 39
2.2.4 Dense Level 1 BLAS Support in ATLAS: 39
2.3 Status: 40
References: 40
Chapter3 Autotuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS: 44
3.1 Introduction: 44
3.2 Background: 45
3.2.1 BLAS: 46
3.2.2 DL-BLAS: 46
3.2.3 Motivation of the present Research: 48
3.3 Proposed Methods: 49
3.3.1 Parallel Efficiency Analysis: 49
3.3.2 Diagonal Search: 51
3.3.3 Reductive Search: 51
3.3.4 Parameter Selection: 53
3.4 Experiments: 53
3.4.1 Environments: 53
3.4.2 Performance of Diagonal Search: 54
3.4.3 Analysis and Discussion of Diagonal Search: 55
3.4.4 Performance Evaluation Tests: 56
3.5 Conclusion: 57
References: 59
Chapter4 Automatic Tuning for Parallel FFTs: 60
4.1 Introduction: 60
4.2 Parallel One-Dimensional FFT: 61
4.2.1 A Block Nine-Step FFT Algorithm: 61
4.2.2 A Parallel One-Dimensional FFT Algorithm Based on the Block Nine-Step FFT: 63
4.3 Parallel Multi-Dimensional FFT: 65
4.3.1 Parallel Two-Dimensional FFT: 65
4.3.2 Parallel Three-Dimensional FFT: 66
4.4 Automatic Tuning of Parallel FFTs: 68
4.4.1 Automatic Tuning of All-to-All Communication: 68
4.4.2 Selection of the Radices: 69
4.4.3 Selection of the Block Size: 69
4.5 Performance Results: 70
4.5.1 Performance of All-to-All Communication: 72
4.5.2 Performance of One-Dimensional FFT: 72
4.5.3 Performance of Multi-Dimensional FFTs: 75
4.6 Conclusion: 76
References: 78
Chapter5 Dynamic Programming Approaches to Optimizing the Blocking Strategy for Basic Matrix Decompositions: 79
5.1 Introduction: 79
5.2 Basics of the Householder QR Algorithm: 81
5.2.1 The Householder QR Algorithm: 81
5.2.2 Blocking Strategies for the Householder QR Algorithm: 81
5.2.2.1 Fixed-size Blocking: 82
5.2.2.2 Recursive Blocking: 82
5.2.3 Another Blocking Strategy: The TSQR Algorithm: 83
5.3 Dynamic Programming Approaches to Optimizing the Blocking Strategy: 84
5.3.1 Limitations of Conventional Blocking Strategies: 84
5.3.2 Approaches for the Sequential Case: 85
5.3.2.1 Variable-size Blocking: 85
5.3.2.2 Generalized Recursive Blocking: 87
5.3.3 Approaches for the Parallel Case: 89
5.3.3.1 Variable-size Blocking for Distributed Memory Machines: 89
5.3.3.2 Combination of Variable-size Blocking and TSQR: 89
5.4 Future Directions: 92
5.4.1 Use of Task Level Parallelism: 92
5.4.2 Extension to Heterogeneous Architectures: 92
5.4.3 Online Refinement of the Performance Models and the Blocking Strategy: 92
5.4.4 Application to Other Linear Algebra Algorithms: 93
5.5 Conclusion: 93
References: 93
Chapter6 Automatic Tuning of the Division Number in the Multiple Division Divide-and-Conquer for Real Symmetric Eigenproblem: 96
6.1 Introduction: 96
6.2 DCk Algorithm and Deflation: 97
6.3 Deflation Occurrence Rate (DOR): 99
6.4 Proposal of Algorithm to Estimate the Optimal Division Number: 102
6.5 Numerical Experiment: 103
6.6 Summary: 109
References: 109
Chapter7 Automatically Tuned Mixed-Precision Conjugate Gradient Solver: 111
7.1 Introduction: 111
7.2 Iterative Refinement: 113
7.3 Choosing the Target Residual Reduction for the Inner Solver: 115
7.4 Stopping the Iterative Refinement: 119
7.4.1 Stopping Based on Threshold: 119
7.4.2 Stopping Based on Condition Number Estimation: 121
7.4.3 Combined Stopping Method: 122
7.5 Condition Number Estimation: 123
7.6 Conclusions: 126
References: 127
Chapter8 Automatically Tuned Sparse Eigensolvers: 128
8.1 Introduction: 128
8.2 Background: AT Research Trends: 129
8.2.1 Definition of AT: 129
8.2.2 Six Studies on AT: 130
8.2.3 AT Research Trends: 131
8.3 Proposal of AT-Restarted-Lanczos: 133
8.3.1 Restarted-Lanczos and Its Problem: 133
8.3.2 Preliminary Experiment: 133
8.3.3 Proposal of the Run-Time Automatic Tuning Algorithm: 135
8.3.4 Evaluations: 137
8.4 Conclusions: 139
References: 139
Chapter9 Systematic Performance Evaluation of Linear Solvers Using Quality Control Techniques: 141
9.1 Introduction: 141
9.2 Systematic Performance Evaluation and Systematic Analyses for Numerical Computation Algorithms: 142
9.2.1 Systematic Performance Evaluation of Solution Algorithms: 142
9.2.2 Construction of the Performance Information DB on the Solution Algorithms: 143
9.3 Quality Control in the Solution Process: 145
9.3.1 Quality Control Techniques: 145
9.3.2 Systematic Performance Evaluation Using Survey Charts: 148
9.4 Analysis of Causes of Poor Solution Quality: 152
9.4.1 Preconditioning of Solution Algorithm for Linear Equations: 154
9.4.2 Residual Vector and Convergence Criterion for Each Preconditioning System: 155
9.5 Concluding Remarks: 156
References: 158
Chapter 10 Application of Alternating Decision Trees in Selecting Sparse Linear Solvers*: 159
10.1 Introduction: 159
10.2 Machine Learning Methods: 161
10.2.1 Adaboost: 162
10.2.2 Alternating Decision Trees: 163
10.3 Solver Selection as a Classification Problem: 164
10.3.1 Feature Selection and Computation: 166
10.4 Applying Machine Learning: 167
10.5 Experimental Results: 168
10.5.1 Experiment 1: Classification on Linear Systems Generated from PDE-based Simulation Code: 169
10.5.2 Experiment 2: Classification on Only Coefficient Matrices: 171
10.6 Conclusions and Future Plans: 175
References: 177
Chapter11 Toward Automatic Performance Tuning for Numerical Simulations in the SILC Matrix Computation Framework: 180
11.1 Introduction: 180
11.2 The SILC Matrix Computation Framework: 182
11.3 Performance Modeling: 184
11.4 Experiments: 185
11.4.1 Cloth Simulation: 186
11.4.2 Fluid Simulation: 189
11.4.3 Simple Initial Value Problem: 190
11.5 Automatic Performance Tuning: 193
11.5.1 Autotuning Mechanism for Numerical Simulations in SILC: 193
11.5.2 Related Work and Discussions: 195
11.6 Concluding Remarks: 195
References: 196
Chapter12 Exploring Tuning Strategies for Quantum Chemistry Computations: 198
12.1 Introduction: 198
12.2 Related Works: 200
12.3 Methodology: 200
12.4 Performance Results and Observations: 203
12.5 Tuning Strategy: 208
12.6 Conclusions and Future Work: 211
References: 211
Chapter13 Automatic Tuning of CUDA Execution Parameters for Stencil Processing: 214
13.1 Introduction: 214
13.2 Related Work: 216
13.3 Performance Tuning for CUDA: 217
13.4 Automatic Performance Tuning of CUDA Execution Parameters: 219
13.4.1 Parameter Exploration Space: 220
13.4.2 Auto-Tuning Mechanism: 223
13.5 Performance Evaluation: 225
13.5.1 Automatic Tuning on Himeno Benchmark: 225
13.5.2 Automatic Tuning on LU Decomposition: 229
13.6 Conclusions: 232
References: 232
Chapter14 Static Task Cluster Size Determination in Homogeneous Distributed Systems: 234
14.1 Introduction: 234
14.2 Problem Definition and Assumptions: 236
14.2.1 Assumed Model: 236
14.2.2 Problems in Conventional Cluster Merging: 237
14.2.3 Requirements for Reducing Both the Schedule Length and the Number of Processors: 238
14.3 Definition of an Indicative Value for Schedule Length: 239
14.3.1 Abstract of Worst Schedule Length: 239
14.3.2 WSL Definition: 239
14.4 Derivation of Cluster Size for Minimizing WSL: 240
14.4.1 An Upper Bound of WSL: 240
14.4.2 Derivation of dopt: 246
14.4.3 Relationship Between WSLs and Schedule Length: 247
14.4.4 Requirements for Minimizing WSL in a Task Clustering: 248
14.5 Task Clustering: 248
14.5.1 Overview of the Algorithm: 248
14.5.2 Cluster Selection Policies: 249
14.6 Experimental Evaluation: 251
14.6.1 Comparison Targets: 251
14.6.2 Simulation Setup: 251
14.6.3 Comparison of WSLS and Schedule Length in the Fixed Number of Processors: 252
14.6.4 Processor Utilization: 253
14.6.5 Comparison of WSLS and Schedule Length of Real Application DAGs: 254
14.6.6 Discussion: 255
14.7 Conclusion and Future Works: 256
References: 256
Part III Evolution to a General Paradigm: 258
Chapter15 Algorithmic Parameter Optimization of the DFO Method with the OPAL Framework: 259
15.1 Introduction: 259
15.2 Optimization of Algorithmic Parameters: 262
15.2.1 Black Box Construction: 262
15.2.2 Direct Search Algorithms: 264
15.3 The OPAL Package: 265
15.3.1 The OPAL Structure: 266
15.3.2 Usage of OPAL: 266
15.4 Application to Derivative-Free Optimization: 268
15.4.1 General Description of DFO: 268
15.4.2 Two DFO Parameter Optimization Problems: 268
15.4.3 Numerical Results: 272
15.5 Discussion: 276
References: 277
Chapter16 A Bayesian Method of Online Automatic Tuning: 279
16.1 Introduction: 279
16.2 Automatic Tuning Concepts: 280
16.3 Review of the Proposed Methods: 281
16.3.1 Online Automatic Tuning: 282
16.3.2 Tuning Based on Cost Function Modeling: 283
16.3.3 Bayesian Analysis to Treat Uncertainties: 284
16.3.4 Bayesian Suboptimal Sequential Experimental Design: 286
16.3.5 Desirable Properties of Mathematical Methods for Automatic Tuning: 288
16.3.6 Infinite Dilution: 291
16.3.7 Pseudocode of the Proposed Method: 292
16.4 Evaluation: 293
16.4.1 Infinite Dilution with Random Sampling: 293
16.4.2 Performance Evaluation: 293
16.5 Conclusion: 296
References: 297
Chapter17 ABCLibScript: A Computer Language for Automatic Performance Tuning: 298
17.1 Introduction: 298
17.2 The FIBER Framework: 299
17.2.1 FIBER Framework Basics: 299
17.2.2 The Two Users and Auto-tuning Software Construction Assumption: 299
17.2.2.1 Software Developer Phase: 300
17.2.2.2 Software User (End-User) Phase: 301
17.3 The ABCLibScript: 301
17.4 Examples of ABCLibScript Descriptions: 304
17.4.1 Software Developer API: 304
17.4.2 End-User API: Before Execute-Time Auto-tuning: 305
17.4.3 Other Programming Examples for Software Developers Using ABCLibScript: 306
17.4.3.1 Install-Time Auto-tuning: 306
17.4.3.2 Before Execute-Time Auto-tuning: 308
17.4.3.3 Run-Time Auto-tuning: 309
17.5 Future Direction of ABCLibScript: 310
17.5.1 Evaluation of Compiler Optimization and Hardware Performance: 310
17.5.2 Adapting ABCLibScript to Embedded Systems: 312
17.5.3 Adapting ABCLibScript to Low-Power Optimization: 314
17.6 Concluding Remarks: 315
References: 316
Chapter18 Automatically Tuning Task-Based Programs for Multicore Processors: 317
18.1 Introduction: 317
18.2 Example: 318
18.2.1 Task Extensions: 319
18.2.2 Execution: 319
18.2.3 Scheduling for Multicore Processor: 320
18.3 Implementation Synthesis: 322
18.3.1 Dependence Analysis: 323
18.3.2 Profiling: 323
18.3.3 Candidate Implementation Synthesis: 323
18.3.4 Simulation Stage: 325
18.3.5 Directed-Simulated Annealing: 326
18.3.5.1 Generation of Optimized Implementation: 327
18.3.6 Dynamic Scheduling: 328
18.4 Evaluation: 328
18.4.1 Benchmarks: 329
18.4.2 Accuracy of Scheduling Simulator: 330
18.4.3 Optimality of Directed Simulated Annealing: 331
18.4.4 Generality of Synthesized Implementation: 332
18.4.5 Overhead of Bamboo: 333
18.5 Related Work: 334
18.6 Conclusion: 335
References: 335
Chapter19 Efficient Program Compilation Through Machine Learning Techniques: 337
19.1 Introduction: 337
19.1.1 Motivation: 338
19.2 Design and Implementation: 339
19.2.1 Preparing for Data Collection: 339
19.2.2 Gathering Training Data: 342
19.2.3 Learning Process: 344
19.2.4 Deployment Process: 345
19.3 Experimental Setup: 346
19.4 Results and Discussion: 346
19.4.1 Quality of Training Data: 347
19.4.2 Classifier Evaluation: 348
19.4.3 Other Benchmarks: 349
19.5 Related Work: 350
19.6 Conclusions and Future Work: 352
19.7 Acknowledgments and Trademarks: 352
References: 352
Chapter20 Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology: 354
20.1 Introduction: 354
20.2 Compiler Technology: Autotuning and Specialization: 356
20.2.1 Optimization Strategy: 358
20.2.2 Specialization and Transformation Using CHiLL: 359
20.2.3 Autotuning: Heuristics for Pruning the Search Space: 361
20.2.4 Building the Library: 362
20.3 Experiments: 363
20.3.1 Experimental Environment: 364
20.3.2 Generating the Library: 364
20.3.3 Evaluating the Pruning Heuristics: 365
20.3.4 Performance Results: 367
20.4 Related Work: 369
20.5 Conclusion: 370
References: 370
Index: 372

Mediengruppe Stein

Buchhandlungen

Öffentliche Hand

Education

Bibliotheken

RWS

Corporate

Medical

Service

Infos

Shop

Alle Preise verstehen sich inklusive der gesetzlichen MwSt.

Widerruf

Download-Voucher einlösen

Kopierschutz

= Kopierschutzfrei

= Wasserzeichen

= DRM Kopierschutz

Software Automatic Tuning - From Concepts to State-of-the-Art Results

Ken Naono, Keita Teranishi, John Cavazos, Reiji Suda

Software Automatic Tuning

Preface

Contents

Contributors

Part I Introduction

Chapter1 Software Automatic Tuning: Concepts and State-of-the-Art Results

1.1 Software Automatic Tuning: General Concepts

1.2 Software Automatic Tuning: State-of-the-Art

Part II Achievements in Scientific Computing

Chapter2 ATLAS Version 3.9: Overview and Status*

2.1 Introduction

2.1.1 Basic AEOS Requirements

2.1.2 Methods of Software Adaptation

2.2 ATLAS Overview

2.2.1 Explicit LAPACK Support In ATLAS

2.2.2 Dense Level 3 BLAS Support in ATLAS

2.2.2.1 Using Assembly in ATLAS

2.2.3 Dense Level 2 BLAS Support in ATLAS

2.2.4 Dense Level 1 BLAS Support in ATLAS

2.3 Status

References

Chapter3 Autotuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS

3.1 Introduction

3.2 Background

3.2.1 BLAS

3.2.2 DL-BLAS

3.2.3 Motivation of the present Research

3.3 Proposed Methods

3.3.1 Parallel Efficiency Analysis

3.3.2 Diagonal Search

3.3.3 Reductive Search

3.3.4 Parameter Selection

3.4 Experiments

3.4.1 Environments

3.4.2 Performance of Diagonal Search

3.4.3 Analysis and Discussion of Diagonal Search

3.4.4 Performance Evaluation Tests

3.5 Conclusion

References

Chapter4 Automatic Tuning for Parallel FFTs

4.1 Introduction

4.2 Parallel One-Dimensional FFT

4.2.1 A Block Nine-Step FFT Algorithm

4.2.2 A Parallel One-Dimensional FFT Algorithm Based on the Block Nine-Step FFT

4.3 Parallel Multi-Dimensional FFT

4.3.1 Parallel Two-Dimensional FFT

4.3.2 Parallel Three-Dimensional FFT

4.4 Automatic Tuning of Parallel FFTs

4.4.1 Automatic Tuning of All-to-All Communication

4.4.2 Selection of the Radices

4.4.3 Selection of the Block Size

4.5 Performance Results

4.5.1 Performance of All-to-All Communication

4.5.2 Performance of One-Dimensional FFT

4.5.3 Performance of Multi-Dimensional FFTs

4.6 Conclusion

References

Chapter5 Dynamic Programming Approaches to Optimizing the Blocking Strategy for Basic Matrix Decompositions

5.1 Introduction

5.2 Basics of the Householder QR Algorithm

5.2.1 The Householder QR Algorithm

5.2.2 Blocking Strategies for the Householder QR Algorithm

5.2.2.1 Fixed-size Blocking

5.2.2.2 Recursive Blocking

5.2.3 Another Blocking Strategy: The TSQR Algorithm

5.3 Dynamic Programming Approaches to Optimizing the Blocking Strategy

5.3.1 Limitations of Conventional Blocking Strategies

5.3.2 Approaches for the Sequential Case

5.3.2.1 Variable-size Blocking

5.3.2.2 Generalized Recursive Blocking

5.3.3 Approaches for the Parallel Case

5.3.3.1 Variable-size Blocking for Distributed Memory Machines

5.3.3.2 Combination of Variable-size Blocking and TSQR

5.4 Future Directions

5.4.1 Use of Task Level Parallelism

5.4.2 Extension to Heterogeneous Architectures

5.4.3 Online Refinement of the Performance Models and the Blocking Strategy

5.4.4 Application to Other Linear Algebra Algorithms