dummies
 

Suchen und Finden

Titel

Autor/Verlag

Inhaltsverzeichnis

Nur ebooks mit Firmenlizenz anzeigen:

 

Software Automatic Tuning - From Concepts to State-of-the-Art Results

Software Automatic Tuning - From Concepts to State-of-the-Art Results

Ken Naono, Keita Teranishi, John Cavazos, Reiji Suda

 

Verlag Springer-Verlag, 2010

ISBN 9781441969354 , 377 Seiten

Format PDF

Kopierschutz Wasserzeichen

Geräte

149,79 EUR

  • Von PDM zu PLM - Prozessoptimierung durch Integration
    Die Vier Lektionen des Lebens
    Konstruieren mit CAD - Das Komplettpaket für 3D Modellieren im Maschinenbau
    Blown Film Extrusion - An Introduction
    Gottesentrümplung - Warum es nicht verrückt ist, heute religiös zu sein
    Kleines Lexikon christlicher Irrtümer - Von Abendmahl bis Zungenreden
    Yb:KYW regenerativer Verstärker für ultrakurze Pulse
    Nutzen und Grenzen guter Fokussierbarkeit beim Laserschweißen
  • Metallobiolomics - Analysis, Function, Clinic Trials
    Modernisierungsfall(e) Universität - Wege zur Selbstfindung einer eigensinnigen Institution
    Wo lassen Sie denken? - Warum der Glaube an die Wissenschaft uns dumm macht

     

     

     

     

     

     

 

 

Software Automatic Tuning

3

Preface

5

Contents

7

Contributors

11

Part I Introduction

15

Chapter1 Software Automatic Tuning: Concepts and State-of-the-Art Results

16

1.1 Software Automatic Tuning: General Concepts

17

1.2 Software Automatic Tuning: State-of-the-Art

22

Part II Achievements in Scientific Computing

29

Chapter2 ATLAS Version 3.9: Overview and Status*

30

2.1 Introduction

30

2.1.1 Basic AEOS Requirements

32

2.1.2 Methods of Software Adaptation

33

2.2 ATLAS Overview

34

2.2.1 Explicit LAPACK Support In ATLAS

35

2.2.2 Dense Level 3 BLAS Support in ATLAS

36

2.2.2.1 Using Assembly in ATLAS

38

2.2.3 Dense Level 2 BLAS Support in ATLAS

39

2.2.4 Dense Level 1 BLAS Support in ATLAS

39

2.3 Status

40

References

40

Chapter3 Autotuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS

44

3.1 Introduction

44

3.2 Background

45

3.2.1 BLAS

46

3.2.2 DL-BLAS

46

3.2.3 Motivation of the present Research

48

3.3 Proposed Methods

49

3.3.1 Parallel Efficiency Analysis

49

3.3.2 Diagonal Search

51

3.3.3 Reductive Search

51

3.3.4 Parameter Selection

53

3.4 Experiments

53

3.4.1 Environments

53

3.4.2 Performance of Diagonal Search

54

3.4.3 Analysis and Discussion of Diagonal Search

55

3.4.4 Performance Evaluation Tests

56

3.5 Conclusion

57

References

59

Chapter4 Automatic Tuning for Parallel FFTs

60

4.1 Introduction

60

4.2 Parallel One-Dimensional FFT

61

4.2.1 A Block Nine-Step FFT Algorithm

61

4.2.2 A Parallel One-Dimensional FFT Algorithm Based on the Block Nine-Step FFT

63

4.3 Parallel Multi-Dimensional FFT

65

4.3.1 Parallel Two-Dimensional FFT

65

4.3.2 Parallel Three-Dimensional FFT

66

4.4 Automatic Tuning of Parallel FFTs

68

4.4.1 Automatic Tuning of All-to-All Communication

68

4.4.2 Selection of the Radices

69

4.4.3 Selection of the Block Size

69

4.5 Performance Results

70

4.5.1 Performance of All-to-All Communication

72

4.5.2 Performance of One-Dimensional FFT

72

4.5.3 Performance of Multi-Dimensional FFTs

75

4.6 Conclusion

76

References

78

Chapter5 Dynamic Programming Approaches to Optimizing the Blocking Strategy for Basic Matrix Decompositions

79

5.1 Introduction

79

5.2 Basics of the Householder QR Algorithm

81

5.2.1 The Householder QR Algorithm

81

5.2.2 Blocking Strategies for the Householder QR Algorithm

81

5.2.2.1 Fixed-size Blocking

82

5.2.2.2 Recursive Blocking

82

5.2.3 Another Blocking Strategy: The TSQR Algorithm

83

5.3 Dynamic Programming Approaches to Optimizing the Blocking Strategy

84

5.3.1 Limitations of Conventional Blocking Strategies

84

5.3.2 Approaches for the Sequential Case

85

5.3.2.1 Variable-size Blocking

85

5.3.2.2 Generalized Recursive Blocking

87

5.3.3 Approaches for the Parallel Case

89

5.3.3.1 Variable-size Blocking for Distributed Memory Machines

89

5.3.3.2 Combination of Variable-size Blocking and TSQR

89

5.4 Future Directions

92

5.4.1 Use of Task Level Parallelism

92

5.4.2 Extension to Heterogeneous Architectures

92

5.4.3 Online Refinement of the Performance Models and the Blocking Strategy

92

5.4.4 Application to Other Linear Algebra Algorithms

93

5.5 Conclusion

93

References

93

Chapter6 Automatic Tuning of the Division Number in the Multiple Division Divide-and-Conquer for Real Symmetric Eigenproblem

96

6.1 Introduction

96

6.2 DCk Algorithm and Deflation

97

6.3 Deflation Occurrence Rate (DOR)

99

6.4 Proposal of Algorithm to Estimate the Optimal Division Number

102

6.5 Numerical Experiment

103

6.6 Summary

109

References

109

Chapter7 Automatically Tuned Mixed-Precision Conjugate Gradient Solver

111

7.1 Introduction

111

7.2 Iterative Refinement

113

7.3 Choosing the Target Residual Reduction for the Inner Solver

115

7.4 Stopping the Iterative Refinement

119

7.4.1 Stopping Based on Threshold

119

7.4.2 Stopping Based on Condition Number Estimation

121

7.4.3 Combined Stopping Method

122

7.5 Condition Number Estimation

123

7.6 Conclusions

126

References

127

Chapter8 Automatically Tuned Sparse Eigensolvers

128

8.1 Introduction

128

8.2 Background: AT Research Trends

129

8.2.1 Definition of AT

129

8.2.2 Six Studies on AT

130

8.2.3 AT Research Trends

131

8.3 Proposal of AT-Restarted-Lanczos

133

8.3.1 Restarted-Lanczos and Its Problem

133

8.3.2 Preliminary Experiment

133

8.3.3 Proposal of the Run-Time Automatic Tuning Algorithm

135

8.3.4 Evaluations

137

8.4 Conclusions

139

References

139

Chapter9 Systematic Performance Evaluation of Linear Solvers Using Quality Control Techniques

141

9.1 Introduction

141

9.2 Systematic Performance Evaluation and Systematic Analyses for Numerical Computation Algorithms

142

9.2.1 Systematic Performance Evaluation of Solution Algorithms

142

9.2.2 Construction of the Performance Information DB on the Solution Algorithms

143

9.3 Quality Control in the Solution Process

145

9.3.1 Quality Control Techniques

145

9.3.2 Systematic Performance Evaluation Using Survey Charts

148

9.4 Analysis of Causes of Poor Solution Quality

152

9.4.1 Preconditioning of Solution Algorithm for Linear Equations

154

9.4.2 Residual Vector and Convergence Criterion for Each Preconditioning System

155

9.5 Concluding Remarks

156

References

158

Chapter 10 Application of Alternating Decision Trees in Selecting Sparse Linear Solvers*

159

10.1 Introduction

159

10.2 Machine Learning Methods

161

10.2.1 Adaboost

162

10.2.2 Alternating Decision Trees

163

10.3 Solver Selection as a Classification Problem

164

10.3.1 Feature Selection and Computation

166

10.4 Applying Machine Learning

167

10.5 Experimental Results

168

10.5.1 Experiment 1: Classification on Linear Systems Generated from PDE-based Simulation Code

169

10.5.2 Experiment 2: Classification on Only Coefficient Matrices

171

10.6 Conclusions and Future Plans

175

References

177

Chapter11 Toward Automatic Performance Tuning for Numerical Simulations in the SILC Matrix Computation Framework

180

11.1 Introduction

180

11.2 The SILC Matrix Computation Framework

182

11.3 Performance Modeling

184

11.4 Experiments

185

11.4.1 Cloth Simulation

186

11.4.2 Fluid Simulation

189

11.4.3 Simple Initial Value Problem

190

11.5 Automatic Performance Tuning

193

11.5.1 Autotuning Mechanism for Numerical Simulations in SILC

193

11.5.2 Related Work and Discussions

195

11.6 Concluding Remarks

195

References

196

Chapter12 Exploring Tuning Strategies for Quantum Chemistry Computations

198

12.1 Introduction

198

12.2 Related Works

200

12.3 Methodology

200

12.4 Performance Results and Observations

203

12.5 Tuning Strategy

208

12.6 Conclusions and Future Work

211

References

211

Chapter13 Automatic Tuning of CUDA Execution Parameters for Stencil Processing

214

13.1 Introduction

214

13.2 Related Work

216

13.3 Performance Tuning for CUDA

217

13.4 Automatic Performance Tuning of CUDA Execution Parameters

219

13.4.1 Parameter Exploration Space

220

13.4.2 Auto-Tuning Mechanism

223

13.5 Performance Evaluation

225

13.5.1 Automatic Tuning on Himeno Benchmark

225

13.5.2 Automatic Tuning on LU Decomposition

229

13.6 Conclusions

232

References

232

Chapter14 Static Task Cluster Size Determination in Homogeneous Distributed Systems

234

14.1 Introduction

234

14.2 Problem Definition and Assumptions

236

14.2.1 Assumed Model

236

14.2.2 Problems in Conventional Cluster Merging

237

14.2.3 Requirements for Reducing Both the Schedule Length and the Number of Processors

238

14.3 Definition of an Indicative Value for Schedule Length

239

14.3.1 Abstract of Worst Schedule Length

239

14.3.2 WSL Definition

239

14.4 Derivation of Cluster Size for Minimizing WSL

240

14.4.1 An Upper Bound of WSL

240

14.4.2 Derivation of dopt

246

14.4.3 Relationship Between WSLs and Schedule Length

247

14.4.4 Requirements for Minimizing WSL in a Task Clustering

248

14.5 Task Clustering

248

14.5.1 Overview of the Algorithm

248

14.5.2 Cluster Selection Policies

249

14.6 Experimental Evaluation

251

14.6.1 Comparison Targets

251

14.6.2 Simulation Setup

251

14.6.3 Comparison of WSLS and Schedule Length in the Fixed Number of Processors

252

14.6.4 Processor Utilization

253

14.6.5 Comparison of WSLS and Schedule Length of Real Application DAGs

254

14.6.6 Discussion

255

14.7 Conclusion and Future Works

256

References

256

Part III Evolution to a General Paradigm

258

Chapter15 Algorithmic Parameter Optimization of the DFO Method with the OPAL Framework

259

15.1 Introduction

259

15.2 Optimization of Algorithmic Parameters

262

15.2.1 Black Box Construction

262

15.2.2 Direct Search Algorithms

264

15.3 The OPAL Package

265

15.3.1 The OPAL Structure

266

15.3.2 Usage of OPAL

266

15.4 Application to Derivative-Free Optimization

268

15.4.1 General Description of DFO

268

15.4.2 Two DFO Parameter Optimization Problems

268

15.4.3 Numerical Results

272

15.5 Discussion

276

References

277

Chapter16 A Bayesian Method of Online Automatic Tuning

279

16.1 Introduction

279

16.2 Automatic Tuning Concepts

280

16.3 Review of the Proposed Methods

281

16.3.1 Online Automatic Tuning

282

16.3.2 Tuning Based on Cost Function Modeling

283

16.3.3 Bayesian Analysis to Treat Uncertainties

284

16.3.4 Bayesian Suboptimal Sequential Experimental Design

286

16.3.5 Desirable Properties of Mathematical Methods for Automatic Tuning

288

16.3.6 Infinite Dilution

291

16.3.7 Pseudocode of the Proposed Method

292

16.4 Evaluation

293

16.4.1 Infinite Dilution with Random Sampling

293

16.4.2 Performance Evaluation

293

16.5 Conclusion

296

References

297

Chapter17 ABCLibScript: A Computer Language for Automatic Performance Tuning

298

17.1 Introduction

298

17.2 The FIBER Framework

299

17.2.1 FIBER Framework Basics

299

17.2.2 The Two Users and Auto-tuning Software Construction Assumption

299

17.2.2.1 Software Developer Phase

300

17.2.2.2 Software User (End-User) Phase

301

17.3 The ABCLibScript

301

17.4 Examples of ABCLibScript Descriptions

304

17.4.1 Software Developer API

304

17.4.2 End-User API: Before Execute-Time Auto-tuning

305

17.4.3 Other Programming Examples for Software Developers Using ABCLibScript

306

17.4.3.1 Install-Time Auto-tuning

306

17.4.3.2 Before Execute-Time Auto-tuning

308

17.4.3.3 Run-Time Auto-tuning

309

17.5 Future Direction of ABCLibScript

310

17.5.1 Evaluation of Compiler Optimization and Hardware Performance

310

17.5.2 Adapting ABCLibScript to Embedded Systems

312

17.5.3 Adapting ABCLibScript to Low-Power Optimization

314

17.6 Concluding Remarks

315

References

316

Chapter18 Automatically Tuning Task-Based Programs for Multicore Processors

317

18.1 Introduction

317

18.2 Example

318

18.2.1 Task Extensions

319

18.2.2 Execution

319

18.2.3 Scheduling for Multicore Processor

320

18.3 Implementation Synthesis

322

18.3.1 Dependence Analysis

323

18.3.2 Profiling

323

18.3.3 Candidate Implementation Synthesis

323

18.3.4 Simulation Stage

325

18.3.5 Directed-Simulated Annealing

326

18.3.5.1 Generation of Optimized Implementation

327

18.3.6 Dynamic Scheduling

328

18.4 Evaluation

328

18.4.1 Benchmarks

329

18.4.2 Accuracy of Scheduling Simulator

330

18.4.3 Optimality of Directed Simulated Annealing

331

18.4.4 Generality of Synthesized Implementation

332

18.4.5 Overhead of Bamboo

333

18.5 Related Work

334

18.6 Conclusion

335

References

335

Chapter19 Efficient Program Compilation Through Machine Learning Techniques

337

19.1 Introduction

337

19.1.1 Motivation

338

19.2 Design and Implementation

339

19.2.1 Preparing for Data Collection

339

19.2.2 Gathering Training Data

342

19.2.3 Learning Process

344

19.2.4 Deployment Process

345

19.3 Experimental Setup

346

19.4 Results and Discussion

346

19.4.1 Quality of Training Data

347

19.4.2 Classifier Evaluation

348

19.4.3 Other Benchmarks

349

19.5 Related Work

350

19.6 Conclusions and Future Work

352

19.7 Acknowledgments and Trademarks

352

References

352

Chapter20 Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology

354

20.1 Introduction

354

20.2 Compiler Technology: Autotuning and Specialization

356

20.2.1 Optimization Strategy

358

20.2.2 Specialization and Transformation Using CHiLL

359

20.2.3 Autotuning: Heuristics for Pruning the Search Space

361

20.2.4 Building the Library

362

20.3 Experiments

363

20.3.1 Experimental Environment

364

20.3.2 Generating the Library

364

20.3.3 Evaluating the Pruning Heuristics

365

20.3.4 Performance Results

367

20.4 Related Work

369

20.5 Conclusion

370

References

370

Index

372