ParCo'95                                                                                                                                                                 BELGACOM


Debugging Parallel Programmes: some issues and solutions

Coordinators:
Jacques Chassin de Kergommeaux
IMAG, projet APACHE
46 avenue F&eacutelix Viallet
F-38031 Grenoble Cedex 1, France
E-mail: Jacques.Chassin-de-Kergommeaux@imag.fr
Tel: +33 76 57 48 72

Luk Levrouw
ELIS, RUG
Sint-Pietersnieuwstraat 41
B-9000 Gent, Belgium
E-mail: Luk.Levrouw@elis.rug.ac.be
Tel: +32+9/264.33.67

Abstract

This workshop will survey the main problems encountered when debugging parallel programmes, and will describe the most promising solutions to some of these problems. Both correctness and performance debugging will be addressed. The workshop will conclude with a panel discussion summarizing the specific problems of parallel debugging, the classical solutions to some of them and the remaining open problems.

Description of workshop and intended audience

It is widely acknowledged that one of the main problems hindering the use of parallel computers is the difficulty of correctness and performance deugging of parallel programmes. Solutions used for sequential debugging are not sufficient for parallel debugging. Compared to sequential programme executions, parallel programme executions generate a large number of states and are extremely sensitive to perturbations (probe effect), likely to change drastically their behaviour. In addition, a large number of objects are involved in the execution of a parallel programme: processors, processes, threads, communication links, etc., which ought to be observed and controled in order to catch errors and bottlenecks.

The goal of this workshop is to give the audience a survey of the specific problems of parallel debugging, the existing solutions and the remaining open problems.

The first part of the workshop will be dedicated to correctness debugging and in particular to the solutions to the problem of non-determinism of parallel executions. Non-determinism occurs even for programmes producing deterministic results and may result in transient errors which appear unfrequently or vanish when debugging tools are used, because of changes introduced by these tools in the causal relationship between parallel processes.

The most classical technique used to catch transient errors appearing during executions of parallel programs is to record an initial execution and to force subsequent replayed executions to be deterministic with respect to the initial execution, using the recorded information. Debugging an erroneous program then amounts to record an erroneous execution and to apply cyclic debugging techniques during subsequent replayed executions. Efficient record-replay techniques for shared-memory, message passing and client-server programming models will be described as well as several optimisations proposed to reduce the overhead of the initial recording.

Other issues pertaining to parallel correctness debugging will also be briefly surveyed, such as testing parallel programmes for global properties or setting causally coherent breakpoints.

The second part of the workshop will be dedicated to performance debugging of parallel programmes. Although performance is the main reason for parallelizing applications, it is very hard to approach the peak performances of a parallel system with a real application. This problem has motivated the development of performance measurement tools for parallel programmes. Tools used for measuring the performances of sequential programmes are mainly based on sampling. They are not adequate for parallel ones since they cannot measure communication overheads and exhibit bottlenecks.

Specific tools based on trace collection and analysis were developed together with visualisation tools which measure and display the activity of the entities involved in the parallel computation: processors, processes, communication links, etc. The main problems raised by trace collection, analysis and visualisation and the existing solutions to cope with these problems will be surveyed: the probe effect, the lack of a global clock in distributed systems, the presentation of a large number of objects with complex relationships.

One interesting perspective is to combine correctness and performance debugging tools in a single environment, since performance debugging tools often provide programmers with a ``high level'' view of parallel programme executions (for example graphical representations of processes and their inter-relations) which can be a significant help to understand and control the execution behaviour of a parallel programme.

The workshop will conclude with a panel discussion on the state of the art in parallel debugging to identify specific problems, classical solutions and open problems.

The intended audience is anyone concerned with the debugging of parallel programmes, either as a developer of parallel programmes interested in hearing about problems and solutions or as a researcher in the field interested to debate about the state of the art.

OUTLINE

Total time: 3 hours.

First hour: correctness debugging.

Second hour: performance debugging.

Third hour: panel discussion.