AUTO_P(5)AUTO_P(5)NAME
AUTO_P - Automatic Parallelization
TOPIC
This man page discusses automatic parallelization and how to achieve it
with the Silicon Graphics MIPSpro Automatic Parallelization Option. The
following topics are covered:
Automatic Parallelization and the MIPSpro Compilers
Using the MIPSpro Automatic Parallelization Option
Automatic Parallelization and the MIPSpro Compilers
Parallelization is the process of analyzing sequential programs for
parallelism so that they may be restructured to run efficiently on
multiprocessor systems. The goal is to minimize the overall computation
time by distributing the computational work load among the available
processors. Parallelization can be automatic or manual.
During automatic parallelization, the MIPSpro Automatic Parallelization
Option, hereafter called the auto-parallelizer, analyzes and structures
the program with little or no intervention by the developer. The auto-
parallelizer can automatically generate code that splits the processing
of loops among multiple processors. The alternative is manual
parallelization by which the developer performs the parallelization using
pragmas and other programming techniques. Manual parallelization is
discussed in the mp(3f) and mp(3c) man pages.
Automatic parallelization begins with the determination of data
dependence of variables and arrays in loops. Data dependence can prevent
loops from being safely run in parallel because the final outcome of the
computation may vary depending on the order the various processors access
the variables and arrays. Data dependence and other obstacles to
parallelization are discussed in more detail in the next section.
Once data dependences are resolved, a number of automatic parallelization
strategies can be employed. They can consist of the following:
Loop interchange of nested loops
Scalar expansion
Loop distribution
Automatic synchronization of DOACROSS loops
Intraprocedural array privatization
The 7.2 release of the MIPSpro compilers marks a major revision of the
auto-parallelizer. The new release incorporates automatic parallelization
into the other optimizations performed by the MIPSpro compilers. Previous
Page 1
AUTO_P(5)AUTO_P(5)
versions relied on preprocessors to provide source-to-source conversions
prior to compilation. This change provides several benefits to
developers:
Automatic parallelization is integrated with optimizations for single
processors
A set of options and pragmas consistent with the rest of the MIPSpro
compilers
Support for C++
Better run-time and compile-time performance
The MIPSpro Automatic Parallelization Option
Developers exploit parallelism in programs to provide better performance
on multiprocessor systems. You do not need a multiprocessor system to use
th e automatic parallelizer. Although there is a slight performance loss
when a single-processor system runs multiprocessed code, you can use the
auto-parallelizer on any Silicon Graphics system to create and debug a
program.
The automatic parallelizer is an optional software product that is used
as an extension to the following compilers:
MIPSpro Fortran 77
MIPSpro Fortran 90
MIPSpro C
MIPSpro C++
It is controlled by flags inserted in the command lines that invoke the
supported compilers.
Using the MIPSpro Automatic Parallelizer
This section describes how to use the auto-parallelizer when you compile
and run programs with the MIPSpro compilers.
Using the MIPSpro Compilers to Parallelize Programs
You invoke the auto-parallelizer by using the -pfa or -pca flags on the
command lines for the MIPSpro compilers. The syntax for compiling
programs with the auto-parallelizer is as follows:
For Fortran 77 and Fortran 90 use -pfa:
%f77 options -pfa [{ list | keep }] [ -mplist ] filename
Page 2
AUTO_P(5)AUTO_P(5)
%f90 options -pfa [{ list | keep }] [ -mplist ] filename
For C and C++ use -pca:
%cc options -pca [{ list | keep }] [ -mplist ] filename
%CC options -pca [{ list | keep }] [ -mplist ] filename
where options are MIPSPro compiler command-line options. For details on
the other options see the documentation for your MIPSPro compiler.
-pfa and -pca
Invoke the auto-parallelizer and enable any multiprocessing
directives.
list
Produce an annotated listing of the parts of the program that can
(and cannot) run in parallel on multiple processors. The listing
file has the suffix .l.
keep
Generate the listing file (.l), and the transformed equivalent
program (.m), and creates an output file for use with WorkShop Pro
MPF (.anl).
-mplist
Generate a transformed equivalent program in a .w2f.f file for
Fortran 77 or a .w2c.c file for C.
filename
The name of the file containing the source code.
To use the automatic parallelizer with Fortran programs, add the -pfa
flag to both the compile and link line. For C or C++, add the -pca flag.
If you link separately, you must also add -mp to the link line. Previous
versions of the Power compilers had a large set of flags to control
optimization. The 7.2 version uses the same set of options as the rest of
the MIPSPro compilers. So, for example, while in the older Power
compilers the option -pfa,-r=0 turned off roundoff changing
transformations in the pfa preprocessor, in the new compiler
-OPT:roundoff=0 turns off roundoff changing transformations in all phases
of the compiler.
The -pfa list option generates a .l file. The .l file lists the loops in
your code, indicating which were parallelized and which were not. If any
were not parallelized, it explains why not. The -pfa keep option
generates a .l, a .m file and a .anl file that is used by the Workshop
ProMPF tool. The .m file is similar to the .w2f.f or .w2c.c file except
that the file is annotated with some information used by Workshop ProMPF
Page 3
AUTO_P(5)AUTO_P(5)
tool.
The -mplist option will, in addition to compiling your program, generate
a .w2f.f file (for Fortran 77, .w2c.c file for C) that represents the
program after the automatic parallelization phase. These programs should
be readable and in most cases should be valid code suitable for
recompilation. The -mplist option can be used to see what portions of
your code were parallelized.
For Fortran 90 and C++, automatic parallelization happens after the
source program has been converted into an internal representation. It is
not possible to regenerate Fortran 90 or C++ after parallelization.
Examples:
Analyzing a .l File %cat foo.f
subroutine sub(arr,n)
real*8 arr(n)
do i=1,n
arr(i) = arr(i) + arr(i-1)
end do
do i=1,n
arr(i) = arr(i) + 7.0
call foo(a)
end do
do i=1,n
arr(i) = arr(i) + 7.0
end do
end
%f77 -O3 -n32 -mips4 -pfa list foo.f -c.
Here's the associated .l file
Parallelization Log for Subprogram sub_ 3: Not Parallel
Array dependence from arr on line 4 to arr on line 4.
6: Not Parallel
Call foo on line 8.
10: PARALLEL (Auto) __mpdo_sub_1
Example Analyzing a .w2f.f File
%cat test.f
subroutine trivial(a)
real a(10000)
Page 4
AUTO_P(5)AUTO_P(5)
do i=1,10000
a(i) = 0.0
end do end
%f77 -O3 -n32 -mips4 -c -pfa -c -mplist test.f
We get both an object file, test.o, and a test.w2f.f file that contains
the following code
SUBROUTINE trivial(a)
IMPLICIT NONE
REAL*4 a(10000_8)
INTEGER*4 i
C$DOACROSS local(i), shared(a)
DO i = 1, 10000, 1
a(i) = 0.0
END DO
RETURN
END ! trivial
Running Your Program
Invoke your program as if it were a sequential program. The same binary
can execute using different numbers of processors. By default, the
runtime will selec t how many processors to use based on the number of
processors in the machine. The developer can use the environment
variable, NUM_THREADS, to change the default to use an explicit number of
processors. In addition, the developer can have the number of processors
vary dynamically from loop to loop based on system load by setting the
environment variable MP_SUGNUMTHD. Refer to the mp(3f) and mp(3c) for
more details.
Simply passing code through the auto-parallelizer does not always produce
s all the increased performance available. In the next chapter, we
discuss strategies for making effective use of the product when the
auto-parallelizer is not able to fully parallelize an application.
Analyzing the Automatic Parallelizer's Results
Page 5
AUTO_P(5)AUTO_P(5)
Running a program through the auto-parallelizer often results in
excellent parallel speedups, but there are cases that cannot be
automatically well parallelized. By understanding the listing files, you
can sometimes identify small problems that prevent a loop from running
safely in parallel. With a relatively small amount of work, you can
remove these data dependencies and dramatically improve the program's
performance.
Hint: When trying to find loops to run in parallel, focus your efforts
on the areas of the code that use the bulk of the run time. Spending time
trying to run a routine in parallel that uses only one percent of the run
time of the program cannot significantly improve the overall performance
of your program. To determine where your code spends its time, take an
execution profile of the program using the Speedshop performance tools.
The auto-parallelizer provides several mechanisms to analyze what it did.
For Fortran 77 and C programs, the -mplist the code after
parallelization. Manual parallelism directives are inserted on loops that
have been automatically parallelized. For details about these directives,
refer to Chapters 5-7, "Fortran Enhancements for Multiprocessors," of the
MIPSpro Fortran 77 Programmer's Guide", or Chapter 11, "Multiprocessing
C/C++ Compiler Directives," of the C Language Reference Manual.
The output code in the .w2f.f or .w2c.c file should be readable and under
standable. The user can use it as a tool to gain insight into what the
auto-parallelizer did. The user can then use that insight to make changes
to the original source program.
Note that the auto-parallelizer is not a source to source preprocessor,
but is instead an internal phase of the MIPSPro compilers. With a
preprocessor system, a post parallelization file would always be
generated and fed into the regular compiler. This is not the case with
the auto-parallelizer. Therefore, compiling a .w2f.f or .w2c.c file
through a MIPSPro compiler will not generate identical code to compiling
the original source through the MIPSPro auto-parallelizer. But, often the
two will be almost the same.
The auto-parallelizer also provides a listing mechanism via the -pfa or
-pca keep or -pfa or -pca list option. This will cause the compiler to
generate a .l file. The .l file will list the original loops in the
program along with messages telling whether or not the loops were
parallelized. For loops that were not parallelized, an explanation will
be given.
Parallelization Failures With the Automatic Parallelizer
This section discusses mistakes you can avoid and actions you can take to
enhance the performance of the auto-parallelizer. The auto-parallelizer
is not always able to parallelize programs effectively. This can be true
for a number of reason s, some of which you can address. There are three
Page 6
AUTO_P(5)AUTO_P(5)
broad categories of parallelization failure:
The auto-parallelizer does not detect that a loop is safe to parallelize
The auto-parallelizer chooses the wrong nested loop to make parallel
The auto-parallelizer parallelizes a loop that would run more efficiently
sequentially
Failure to Recognize Safe Loops
We want the auto-parallelizer to recognize every loop that is safe to par
allelize. A loop is not safe if there is data dependence, so the
automatic parallelizer analyzes each loop in a sequential program to try
to prove it is safe. If it cannot prove a loop is safe, it does not do
the parallelization. A loop that contains any of the constructs described
in this section may not be proved safe. However, in many instances the
loop can be proved safe after minor changes. You should review your
program's .l file, to see if there are any of these constructs in your
code.
Usually the failure to recognize a loop as safe is related to one or more
of the following practices:.
Function Calls in Loops
GO TO Statements in Loops
Complicated Array Subscripts
Conditionally Assigned Temporary Variables in Loops"
Unanalyzable Pointer Usage in C/C++
Function Calls in Loops
By default, the auto-parallelizer does not parallelize a loop that
contains a function call because the function in one iteration may modify
or depend on data in other iterations of the loop. However, a couple of
tools can help with this problem.
Interprocedural analysis, specified by the -IPA command-line option, can
provide the auto-parallelizer with enough additional information to
parallelize some loops that contain function calls. For more information
on interprocedural analysis, see the MIPSpro Compiling and Performance
Tuning Guide.
The C*$* ASSERT CONCURRENT CALL Fortran assertion, discussed below allows
you to tell the auto-parallelizer to ignore function calls when analyzing
the specified loops.
GO TO Statements in Loops
Page 7
AUTO_P(5)AUTO_P(5)
The use of GO TO statements in loops can cause two problems:
Early exits from loops.
It is not possible to parallelize loops with early exits, either
automatically or manually.
Unstructured control flows.
The auto-parallelizer attempts to convert unstructured control flows
in loops into structured constructs. If the auto-parallelizer cannot
restructure these control flows, your only alternatives are manual
parallelization or restructuring the code.
Complicated Array Subscripts
There are several cases where array subscripts are too complicated to
permit parallelization.
Indirect Array References
The auto-parallelizer is not able to analyze indirect array
references. Consider the following Fortran example.
do i= 1,n
a(b(i)) ...
end do
This loop cannot be run safely in parallel if the indirect reference
b(i) is equal to the same value for different iterations of i. If
every element of array b is unique, the loop can safely be made
parallel. In such cases, use either manual methods or the C*$*
ASSERT PERMUTATION Fortran directive discussed below, to achieve
parallelism.
Unanalyzable Subscripts
The auto-parallelizer cannot parallelize loops containing arrays
with unanalyzable subscripts. In the following case, the auto-
parallelizer is not able to analyze the / in the array subscript and
cannot reorder the loop.
do i = l,u,2
a(i/2) = ... Changed to ().
end do
Hidden Knowledge
In the following example there may be hidden knowledge about the
relationship between the variables m and n.
Page 8
AUTO_P(5)AUTO_P(5)
do i = 1,n
a(i) = a(i+m) Changed to ().
end do
The loop can be run in parallel if m > n, because the arrays will
not overlap. However, because the auto-parallelizer does not know
the value of the variables, it cannot make the loop parallel.
Conditionally Assigned Temporary Variables in Loops
When parallelizing a loop, the auto-parallelizer often localizes
(privatizes) temporary scalar and array variables. Consider the following
example.
do i = 1,n
do j = 1,n
tmp(j) = ...
end do
do j = 1,n
a(j,i) = a(j,i) + tmp(j)
end do
end do
The array tmp is used for local scratch space. To successfully
parallelize the outer (i) loop, each processor must be given a distinct,
private tmp array. In this example, the auto-parallelizer is able to
localize tmp and parallelize the loop. The auto-parallelizer runs into
trouble when a conditionally assigned temporary variable might be used
outside of the loop, as in the following example.
subroutine s1(a,b)
common t
...
do i = 1,n
if (b(i)) then
t = ...
a(i) = a(i) + t
Page 9
AUTO_P(5)AUTO_P(5)
end if
end do
call s2()
If the loop were to be run in parallel, a problem would arise if the
value of t were used inside subroutine s2(). Which processor's private
copy of t should s2() use? If t were not conditionally assigned, the
answer would be the processor that executed iteration n. But t is
conditionally assigned and the auto-parallelizer cannot determine which
copy to use.
The loop is inherently parallel if the conditionally assigned variable t
is localized. If the value of t is not used outside the loop, you should
replace t with a local variable. Unless t is a local variable, the auto-
parallelizer must assume that s2() might use it.
Unanalyzable Pointer Usage in C/C++
The C and C++ languages have features that make them more difficult than
Fortran to automatically parallelize. Many of these features are related
to the use of pointers. The following practices involving pointers
interfere with the auto-parallelizer's effectiveness:
Arbitrary Pointer Dereferences
The auto-parallelizer does not analyze arbitrary pointer
dereferences. The only pointers it analyzes are array references and
pointer dereferences that can be converted into array references.
The auto-parallelizer can subdivide the trees formed by
dereferencing arbitrary pointers and run the parts in parallel.
However, it cannot determine if the tree is really a directed graph
with an unsafe multiple reference. Therefore the parallelization is
not done.
Arrays of Arrays
Multidimensional arrays are sometimes implemented as arrays of
arrays. Consider this example: double **p;
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
p[i][j] = ...
If p is a true multi-dimensional array, the outer loop can be run
safely in parallel. If two of the array pointers, p[2] and p[3] for
example, reference the same array, the loop must not be run in
Page 10
AUTO_P(5)AUTO_P(5)
parallel. Although this duplicate reference is unlikely, the auto-
parallelizer cannot prove it doesn't exist. You can avoid this
problem by always using true arrays. To parallelize the code
fragment above, rewrite it as follows:
double p[n][n];
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
p[i][j] = ...
Note: Although ANSI C does not allow variable-sized multi-
dimensional arrays, there is a proposal to allow them in the next
standard. The MIPSPro 7.2 auto-parallelizer already implements this
proposal.
Loops Bounded by Pointer Comparisons
The auto-parallelizer reorders only those loops in which the number
of it erations can be exactly determined. In Fortran programs this
is rarely a problem, but in C and C++ subtle issues relating to
overflow and unsigned arithmetic can come to play. One consequence
of this is that loops should not be bounded by pointer comparisons
such as
int* pl, pu;
for (int *p = pl; p != pu; p++)
This loop cannot be made parallel, and compiling it will result in a
.l file entry stating the bound cannot be standardized. To avoid
this result, restructure the loop to be of the form
int lb, ub;
for (int i = lb; i <= ub; i++)
Aliased Parameter Information
Perhaps the most frequent impediment to parallelizing C and C++ is
aliased information. Although Fortran guarantees that multiple
parameters to a subroutine are not aliased to each other, C and C++
do not. Consider the following example:
void sub(double *a, double *b,n) {
for (int i = 0; i < n; i++)
Page 11
AUTO_P(5)AUTO_P(5)
a[i] = b[i];
This loop can be parallelized only if arrays a and b do not overlap.
With the option -OPT:alias=restrict, you can assure the auto-
parallelizer that the arrays do not overlap. This assurance permits
the auto-parallelizer to proceed with the parallelization. See the
MIPSpro Compiling and Performance Tuning Guide for details about
this option.
Incorrectly Parallelized Nested Loops
The auto-parallelizer parallelizes a loop by distributing its
iterations among the available processors.
Because the resulting performance is usually better, the auto-
parallelizer tries to parallelize the outermost loop.
If it cannot do so, probably for one of the reasons mentioned in the
previous section, it tries to interchange the outermost loop with an
inner one that it can parallelize.
Example Nested Loops
do i = 1,n
do j = 1,n
...
end do
end do
Even when most of your program is parallelized, it is possible that
the wrong loop is parallelized. Given a nest of loops, the auto-
parallelizer will only parallelize one of the loops in the nest. In
general, it is better to parallelize outer loops rather than inner
ones.
The auto-parallelizer will try to either parallelize the outer loop
or in terchange the parallel loop so that it will be outermost, but
sometimes it is not possible. For any of the reasons mentioned in
the previous section, the auto-parallelizer might be able to
parallelize an inner loop but not the outer one. Even if this
results in most of your code being parallelized, it might be
advantageous to modify your code so that the outer loop is
parallelized.
It is better to parallelize loops that do not have very small trip
counts. Consider the following example.
do i = 1,m
Page 12
AUTO_P(5)AUTO_P(5)
do j = 1,n
The auto-parallelizer may decide to parallelize the i loop, but if m
is v ery small, it would be better to interchange the j loop to be
outermost and then parallelize it. The auto-parallelizer might not
have any way to know that m is small. In such cases, the user can
either use the C*$* ASSERT DO PREFER
directives discussed in the next section to tell the auto-
parallelizer that it is better to parallelize the j loop, or the
user can use manual parallelism directives.
Because of memory hierarchies, performance can be improved if the
same processors access the same data in all parallel loop nests.
Consider the following two examples.
Example Inefficient Loop
do i = 1,n
...a(i)
end do
do i = n,1
...a(i)...
end do
Assume that there are p processors. In the first loop, the first
processor will access the first n/p elements of a, the second
processor will access the next n/p and so on. In the second loop,
the first processor will access the last n/p elements of a. Assuming
n is not too large, those elements will be in the cache of the a
different processor. Accessing data that is in some other
processor's cache can be very expensive. This example might run much
more efficiently if we reverse the direction of one of the loops.
Example Efficient Loop
do i = 1,n
do j = 1,n
a(i,j) = b(j,i) + ...
end do
end do
Page 13
AUTO_P(5)AUTO_P(5)
do i = 1,n
do j = 1,n
b(i,j) = a(j,i) + ...
end do
end do
In this second example, the auto-parallelizer might chose to
parallelize the outer loop in both nests. This means that in the
first loop the first processor is accessing the first n/p rows of a
and the first n/p columns of b, while in the second loop the first
processor is accessing the first n/p columns of a and the first n/p
rows of b. This example will run much more efficiently if we
parallelize the i loop in one nest and the j loop in the other. The
user can add the prefer directives described in the next section to
solve this problem.
Unnecessarily Parallelized Loops
The auto-parallelizer may parallelize loops that would run better sequent
ially. While this is usually not a disaster, it can cause unnecessary
overhead. There is a certain overhead to running loops in parallel. If,
for example, a loop has a small number of iterations, it is faster to
execute the loop sequentially. When bounds are unknown (and even
sometimes when they are known), the auto-parallelizer parallelizes loops
conditionally. In other words, code is generated for both a parallel and
sequential version of the loop. The parallel version is executed only
when the auto-parallelizer thinks that there is sufficient work for it to
be worthwhile to execute the loop in parallel. This estimate depends on
the iteration count, what code is inside the loop body, how many
processors are available and the auto-parallelizer estimate for the
overhead cost to invoke a parallel loop. This user can control the
compiler's estimate for the invocation overhead using the option
-LNO:parallel_overhead=n. The default value for n will vary on different
systems, but typical values are in the low thousands.
By generating two versions of the loop, we avoid going parallel in small
trip count cases, but versioning does incur an overhead to do the dynamic
check. The user can use the DO PREFER assertions to insure that a loop
goes parallel or sequential without incurring a run-time test.
Nested parallelism is not supported. Consider the following case:
subroutine caller
do i
call sub
Page 14
AUTO_P(5)AUTO_P(5)
end do
subroutine sub
...
do i
..
end do
end
Suppose that the first loop is parallelized. It is not possible to
execute the loop inside sub in parallel whenever sub is called by caller.
Thus the auto-parallelizer must generate a test for every parallel loop
that checks whether the loop is being invoked from another parallel loop
or region. While this check is not very expensive, in some cases it can
add to overhead. If the user knows that sub is always called from caller,
the user can use the prefer directives to force the loop in sub to go
sequential.
Assisting the Silicon Graphics Automatic Parallelizer
This section discusses actions you can take to enhance the performance of
the auto-parallelizer.
Assisting the Automatic Parallelizer
There are circumstances that interfere with the auto-parallelizer's
ability to optimize programs. As shown in Parallelization Failures With
the Automatic Parallelizer, problems are sometimes caused by coding
practices. Other times, the auto-parallelizer does not have enough
information to make good parallelization decisions. You can pursue three
strategies to attack these problems and achieve better results with the
auto-parallelizer.
The first approach is to modify your code to avoid coding practices that
the auto-parallelizer cannot analyze well.
The second strategy is to assist the auto-parallelizer with the manual
parallelization directives described in the MIPSpro Compiling and
Performance Tuning Guide. The auto-parallelizer is designed to recognize
and coexist with manual parallelism. You can use manual directives with
some loop nests, while leaving others to the auto-parallelizer. This
approach has both positive and negative aspects.
On the positive side, the manual parallelism directives are well defined
and deterministic. If you use a manual directive, the specified loop will
run in parallel.
Note: This last statement assumes that the trip count is greater than
Page 15
AUTO_P(5)AUTO_P(5)
one and that the specified loop is not nested in another parallel loop.
On the negative side, you must carefully analyze the code to determine
that parallelism is safe. Also, you must mark all variables that need to
be localized.
The third alternative is to use the automatic parallelization directives
and assertions to give the auto-parallelizer more information about your
code. The automatic directives and assertions are described in Directives
and Assertions for Automatic Parallelization. Like the manual directives,
they have positive and negative features:
On the positive side, automatic directives and assertions are easier to
use and they allow you to express the information you know without your
having to be certain that all the conditions for parallelization are met.
On the negative side, they are hints and thus do not impose parallelism.
In addition, as with the manual directives, you must ensure that you are
using them legally. Because they require less information than the manual
directives, automatic directives and assertions can have subtle meanings.
Directives and Assertions for Automatic Parallelization
Directives enable, disable, or modify features of the auto-parallelizer.
Assertions assist the auto-parallelizer by providing it with additional
information about the source program. The automatic directives and
assertions do not impose parallelism; they give hints and assertions to
the auto-parallelizer in order to assist it in paralleling the that the
right loops. To invoke a directive or assertion, include it in the input
file. Listed below are the Fortran directives and assertions for the
auto-parallelizer.
C*$* NO CONCURRENTIZE
Do not parallelize either a subroutine or file.
C*$* CONCURRENTIZE
Not used. (See below.)
C*$* ASSERT DO (CONCURRENT)
Ignore perceived dependences between two references to the same
array when parallelizing.
C*$* ASSERT DO (SERIAL)
Do not parallelize the following loop.
C*$* ASSERT CONCURRENT CALL
Ignore subroutine calls when parallelizing.
C*$* ASSERT PERMUTATION (array_name)
Array array_name is a permutation array.
Page 16
AUTO_P(5)AUTO_P(5)
C*$* ASSERT DO PREFER (CONCURRENT)
Parallelize the following loop if it is safe.
C*$* ASSERT DO PREFER (SERIAL)
Do not parallelize the following loop.
Note: The general compiler option -LNO:ignore_pragmas causes the
auto-parallelizer to ignore all of these directives and assertions.
C*$* NO CONCURRENTIZE
The C*$* NO CONCURRENTIZE directive prevents parallelization. Its
effect depends on where it is placed.
When placed inside a subroutine, the directive prevents the
parallelization of the subroutine. In the following example, SUB1()
is not parallelized. Example:
SUBROUTINE SUB1
C*$* NO CONCURRENTIZE
...
END
When placed outside of a subroutine, C*$* NO CONCURRENTIZE prevents
the parallelization of all the subroutines in the file. The
subroutines SUB2() and SUB3() are not parallelized in the next
example. Example:
SUBROUTINE SUB2
...
END
C*$* NO CONCURRENTIZE
SUBROUTINE SUB3
...
END
The C*$* NO CONCURRENTIZE directive is valid only when the -pfa or
-pca command-line option is used.
C*$* CONCURRENTIZE
The C*$* CONCURRENTIZE directive exists only to maintain backwards
compatibility, and its use is discouraged. Using the -pfa or -pca
option replaces using this directive.
Page 17
AUTO_P(5)AUTO_P(5)
C*$* ASSERT DO (CONCURRENT)
C*$* ASSERT DO (CONCURRENT) says that when analyzing the loop
immediately following this assertion, the auto-parallelizer should
ignore any perceived dependences between two references to the same
array. The following example is a correct use of the assertion when
M > N.
Example:
C*$* ASSERT DO (CONCURRENT)
DO I = 1, N
A(I) = A(I+M)
This assertion is usually used to help the auto-parallelizer with
loops that have indirect array references. There are other facts to
be aware of when using this assertion.
If multiple loops in a nest can be parallelized, C*$* ASSERT DO
(CONCURRENT) causes the auto-parallelizer to prefer the loop
immediately following the assertion. The assertion does not affect
how the auto-parallelizer analyzes CALL statements and dependences
between two potentially aliased pointers.
Note: If there are real dependences between array references, C*$*
ASSERT DO (CONCURRENT) may cause the auto-parallelizer to generate
incorrect code.
C*$* ASSERT DO (SERIAL)
C*$* ASSERT DO (SERIAL) instructs the auto-parallelizer to not
parallelize the loop following the assertion.
C*$* ASSERT CONCURRENT CALL
The C*$* ASSERT CONCURRENT CALL assertion tells the auto-
parallelizer to ignore subroutine calls contained in a loop when
deciding if that loop is parallel. The assertion applies to the loop
that immediately follows it and to all loops nested inside that
loop. The auto-parallelizer ignores subroutine FRED() when it
analyzes the following loop.
C*$* ASSERT CONCURRENT CALL
DO I = 1, N
CALL FRED
...
END DO
Page 18
AUTO_P(5)AUTO_P(5)
SUBROUTINE FRED
...
END
To prevent incorrect parallelization, you must make sure the
following conditions are met when using C*$* ASSERT CONCURRENT CALL:
A subroutine cannot read from a location inside the loop that is
written to during another iteration. This rule does not apply to a
location that is a local variable declared inside the subroutine.
A subroutine cannot write to a location inside the loop that is read
from during another iteration. This rule does not apply to a
location that is a local variable declared inside the subroutine.
The following code shows an illegal use of the assertion. Subroutine
FRED() writes to variable T which is also read from by WILMA()
during other iterations.
C*$* ASSERT CONCURRENT CALL
DO I = 1,M
CALL FRED(B, I, T)
CALL WILMA(A, I, T)
END DO
SUBROUTINE FRED(B, I, T)
REAL B(*)
T = B(I)
END
SUBROUTINE WILMA(A, I, T)
REAL A(*)
A(I) = T
END
By localizing the variable T, you could manually parallelize the
above example safely. But, the auto-parallelizer does not know to
localize T, and it illegally parallelizes the loop because of the
Page 19
AUTO_P(5)AUTO_P(5)
assertion.
C*$* ASSERT PERMUTATION (array_name)
C*$* ASSERT PERMUTATION tells the auto-parallelizer that array_name
is a permutation array: every element of the array has a distinct
value. Array B is asserted to be a permutation array in this
example.
Example:
C*$* ASSERT PERMUTATION (B)
DO I = 1, N
A(B(I)) = ...
END DO
As shown in the previous example, you can use this assertion to
parallelize loops that use arrays for indirect addressing. Without
this assertion, the auto-parallelizer is not able to determine that
the array elements used as indexes are distinct.
Note: The assertion does not require the permutation array to be
dense.
C*$* ASSERT DO PREFER (CONCURRENT)
C*$* ASSERT DO PREFER (CONCURRENT) says that the auto-parallelizer
should parallelize the loop immediately following the assertion, if
it is safe to do so. The following code encourages the auto-
parallelizer to run the I loop in parallel.
C*$*ASSERT DO PREFER (CONCURRENT)
DO I = 1, M
DO J = 1, N
A(I,J) = B(I,J)
END DO
...
END DO
When dealing with nested loops, follow these guidelines:
If the loop specified by this assertion is safe to parallelize, the
Page 20
AUTO_P(5)AUTO_P(5)
auto-parallelizer chooses it to parallelize, even if other loops in
the nest are safe.
If the specified loop is not safe, the auto-parallelizer chooses
another loop that is safe, usually the outermost.
This assertion can be applied to more than one loop in a nest. In
this case, the auto-parallelizer uses its heuristics to choose one
of the specified loops.
Note: C*$* ASSERT DO PREFER (CONCURRENT) is always safe to use. The
auto-parallelizer will not illegally parallelize a loop because of
this assertion.
C*$* ASSERT DO PREFER (SERIAL)
The C*$* ASSERT DO PREFER (SERIAL) assertion requests the auto-
parallelizer not to parallelize the loop that immediately follows.
In the following case, the assertion requests that the J loop be run
serially.
DO I = 1, M
C*$*ASSERT DO PREFER (SERIAL)
DO J = 1, N
A(I,J) = B(I,J)
END DO
...
END DO
Using C*$* ASSERT DO PREFER (SERIAL)
The assertion applies only to the loop directly after the assertion.
Page 21