Table of Contents
- Getting Started
- Using the GPU version of the code
- Using the ADIOS library for I/O
- Using HDF5 for file I/O
- Adding OpenMP support in addition to MPI
- Configuration summary
- Compiling on an IBM BlueGene
- Visualizing the subroutine calling tree of the source code
- Becoming a developer of the code, or making small modifications in the source code
- References
Getting Started
To download the SPECFEM3D_Cartesian software package, type this:
git clone --recursive --branch devel https://github.com/SPECFEM/specfem3d.git
Then, to configure the software for your system, run the configure
shell script. This script will attempt to guess the appropriate configuration values for your system. However, at a minimum, it is recommended that you explicitly specify the appropriate command names for your Fortran compiler (another option is to define FC, CC and MPIF90 in your .bash_profile or your .cshrc file):
./configure FC=gfortran CC=gcc
If you want to run in parallel, i.e., using more than one processor core, then you would type
./configure FC=gfortran CC=gcc MPIFC=mpif90 --with-mpi
You can replace the GNU compilers above (gfortran and gcc) with other compilers if you want to; for instance for Intel ifort and icc use FC=ifort CC=icc instead. Note that MPI must be installed with MPI-IO enabled because parts of SPECFEM3D perform I/Os through MPI-IO.
Before running the configure
script, you should probably edit file flags.guess
to make sure that it contains the best compiler options for your system. Known issues or things to check are:
Intel ifort compiler
See if you need to add -assume byterecl
for your machine. In the case of that compiler, we have noticed that initial release versions sometimes have bugs or issues that can lead to wrong results when running the code, thus we strongly recommend using a version for which at least one service pack or update has been installed. In particular, for version 17 of that compiler, users have reported problems (making the code crash at run time) with the -assume buffered_io
option; if you notice problems, remove that option from file flags.guess
or change it to -assume nobuffered_io
and try again.
IBM compiler
See if you need to add -qsave
or -qnosave
for your machine.
Mac OS
You will probably need to install XCODE
.
When compiling on an IBM machine with the xlf
and xlc
compilers, we suggest running the configure
script with the following options:
./configure FC=xlf90_r MPIFC=mpif90 CC=xlc_r CFLAGS="-O3 -q64" FCFLAGS="-O3 -q64" -with-scotch-dir=...
If you have problems configuring the code on a Cray machine, i.e. for instance if you get an error message from the configure
script, try exporting these two variables: MPI_INC=$
CRAY_MPICH2_DIR
/include and FCLIBS=" "
, and for more details if needed you can refer to the utils/infos/Cray_compiler_information
directory. You can also have a look at the configure script called: utils/infos/Cray_compiler_information/configure_SPECFEM_for_Piz_Daint.bash
.
On SGI systems, flags.guess
automatically informs configure
to insert ‘‘TRAP_FPE=OFF
’’ into the generated Makefile
in order to turn underflow trapping off.
You can add --enable-vectorization
to the configuration options to speed up the code in the fluid (acoustic) and elastic parts. This works fine if (and only if) your computer always allocates a contiguous memory block for each allocatable array; this is the case for most machines and most compilers, but not all. To disable this feature, use option --disable-vectorization
. For more details see github.com/SPECFEM/specfem3d/issues/81 . To check if that option works fine on your machine, run the code with and without it for an acoustic/elastic model and make sure the seismograms are identical.
Note that we use CUBIT (now called Trelis) to create meshes of hexahedra, but other packages can be used as well, for instance GiD from https://www.gidsimulation.com or Gmsh from http://gmsh.info (Geuzaine and Remacle 2009). Even mesh creation packages that generate tetrahedra, for instance TetGen from http://tetgen.berlios.de, can be used because each tetrahedron can then easily be decomposed into four hexahedra as shown in the picture of the TetGen logo at http://tetgen.berlios.de/figs/Delaunay-Voronoi-3D.gif; while this approach does not generate hexahedra of optimal quality, it can ease mesh creation in some situations and it has been shown that the spectral-element method can very accurately handle distorted mesh elements (Oliveira and Seriani 2011).
The SPECFEM3D Cartesian software package relies on the SCOTCH library to partition meshes created with CUBIT. METIS (Karypis and Kumar 1998a, 1998b, 1998c) can also be used instead of SCOTCH if you prefer, by changing the parameter PARTITIONING_TYPE
in the DATA/Par_file
. You will also then need to install and compile Metis version 4.0 (do *NOT* install Metis version 5.0, which has incompatible function calls) and edit Makefile.in
and uncomment the METIS link flag in that file before running configure
.
The SCOTCH library (Pellegrini and Roman 1996) provides efficient static mapping, graph and mesh partitioning routines. SCOTCH is a free software package developed by François Pellegrini et al. from LaBRI and INRIA in Bordeaux, France, downloadable from the web page https://gitlab.inria.fr/scotch/scotch. In case no SCOTCH libraries can be found on the system, the configuration will bundle the version provided with the source code for compilation. The path to an existing SCOTCH installation can to be set explicitly with the option --with-scotch-dir
. Just as an example:
./configure FC=ifort MPIFC=mpif90 --with-scotch-dir=/opt/scotch
If you use the Intel ifort compiler to compile the code, we recommend that you use the Intel icc C compiler to compile Scotch, i.e., use:
./configure CC=icc FC=ifort MPIFC=mpif90
When compiling the SCOTCH source code, if you get a message such as: “ld: cannot find -lz”, the Zlib compression development library is probably missing on your machine and you will need to install it or ask your system administrator to do so. On Linux machines the package is often called “zlib1g-dev” or similar. (thus “sudo apt-get install zlib1g-dev” would install it)
To compile a serial version of the code for small meshes that fits on one compute node and can therefore be run serially, run configure
with the --without-mpi
option to suppress all calls to MPI.
For people who would like to run the package on Windows rather than on Unix machines, you can install Docker or VirtualBox (installing a Linux in VirtualBox in that latter case) and run it easily from inside that.
We recommend that you add ulimit -S -s unlimited
to your .bash_profile
file and/or limit stacksize unlimited
to your .cshrc
file to suppress any potential limit to the size of the Unix stack.
Beware that some cluster systems that run a recent version may not run and/compile an older version of the code.
When using dynamic fault in parallel with the developer version, we suggest you to set the configuration parameters of FAULT_DISPL_VELOC
and FAULT_SYNCHRONIZE_ACCEL
as .true.
.
Using the GPU version of the code
SPECFEM3D now supports CUDA and HIP GPU acceleration. When compiling for GPU cards, you can enable the CUDA version with:
./configure --with-cuda ..
or
./configure --with-cuda=cuda9 ..
where for example cuda4,cuda5,cuda6,cuda7,..
specifies the target GPU architecture of your card, (e.g., with CUDA 9 this refers to Volta V100 cards), rather than the installed version of the CUDA toolkit. Before CUDA version 5, one version supported basically one new architecture and needed a different kind of compilation. Since version 5, the compilation has stayed the same, but newer versions supported newer architectures. However at the moment, we still have one version linked to one specific architecture:
- CUDA 4 for Tesla, cards like K10, Geforce GTX 650, ..
- CUDA 5 for Kepler, like K20
- CUDA 6 for Kepler, like K80
- CUDA 7 for Maxwell, like Quadro K2200
- CUDA 8 for Pascal, like P100
- CUDA 9 for Volta, like V100
- CUDA 10 for Turing, like GeForce RTX 2080
- CUDA 11 for Ampere, like A100
- CUDA 12 for Hopper, like H100
So even if you have the new CUDA toolkit version 11, but you want to run on say a K20 GPU, then you would still configure with:
./configure --with-cuda=cuda5
The compilation with the cuda5 setting chooses then the right architecture (-gencode=arch=compute_35,code=sm_35
for K20 cards).
The same applies to compilation for AMD cards with HIP:
./configure --with-hip ..
or
./configure --with-hip=MI8 ..
where for example MI8,MI25,MI50,MI100,MI250,..
specifies the target GPU architecture of your card. Additional compilation flags can be added by specifying HIP_FLAGS
, as for example:
./configure --with-hip=MI250 \
HIP_FLAGS="-fPIC -ftemplate-depth-2048 -fno-gpu-rdc \
-O2 -fdenormal-fp-math=ieee -fcuda-flush-denormals-to-zero -munsafe-fp-atomics" \
..
Using the ADIOS library for I/O
Regular POSIX I/O can be problematic when dealing with large simulations on large clusters (typically more than $10,000$ MPI processes). SPECFEM3D can use the ADIOS library (Liu et al. 2013) to take advantage of advanced parallel file system features. To enable ADIOS, the following steps should be done:
-
Install ADIOS (available from https://www.olcf.ornl.gov/center-projects/adios/). Make sure that your environment variables reference it.
-
You may want to change ADIOS related values in the
setup/constants.h
file. The default values probably suit most cases. -
Configure using the
--with-adios
flag.
ADIOS is currently only usable for meshfem3D generated mesh (i.e. not for meshes generated with CUBIT). Additional control parameters are discussed in section [cha:Main-Parameter].
Using HDF5 for file I/O
As file I/O can be a bottleneck in large-scale simulations, SPECFEM3D supports file I/O using the HDF5 format for movie snapshots and database files. To support this feature, you will need to compile the code with corresponding HDF5 flags. The configuration of the package could look for example like:
./configure --with-hdf5 HDF5_INC="/opt/homebrew/include" HDF5_LIBS="-L/opt/homebrew/lib" \
..
In the main Par_file
, you will then have to turn on the HDF5 flag HDF5_ENABLED
. Note that additional MPI processes can be launched specifically to handle the file I/O in an asynchronous way. The number of these additional MPI processes is specified by the parameter HDF5_IO_NODES
, such that the total number of MPI processes to launch the executables becomes NPROC + HDF5_IO_NODES
.
Adding OpenMP support in addition to MPI
OpenMP support can be enabled in addition to MPI. However, in many cases performance will not improve because our pure MPI implementation is already heavily optimized and thus the resulting code will in fact be slightly slower. A possible exception could be IBM BlueGene-type architectures.
To enable OpenMP, add the flag --enable-openmp
to the configuration:
./configure --enable-openmp ..
This will add the corresponding OpenMP flag for the chosen Fortran compiler.
The DO-loop using OpenMP threads has a SCHEDULE property. The OMP_SCHEDULE
environment variable can set the scheduling policy of that DO-loop. Tests performed by Marcin Zielinski at SARA (The Netherlands) showed that often the best scheduling policy is DYNAMIC with the size of the chunk equal to the number of OpenMP threads, but most preferably being twice as the number of OpenMP threads (thus chunk size = 8 for 4 OpenMP threads etc). If OMP_SCHEDULE
is not set or is empty, the DO-loop will assume generic scheduling policy, which will slow down the job quite a bit.
Configuration summary
A summary of the most important configuration variables follows.
F90
Path to the Fortran compiler.
MPIF90
Path to MPI Fortran.
MPI_FLAGS
Some systems require this flag to link to MPI libraries.
FLAGS_CHECK
Compiler flags.
The configuration script automatically creates for each executable a corresponding Makefile
in the src/
subdirectory. The Makefile
contains a number of suggested entries for various compilers, e.g., Portland, Intel, Absoft, NAG, and Lahey. The software has run on a wide variety of compute platforms, e.g., various PC clusters and machines from Sun, SGI, IBM, Compaq, and NEC. Select the compiler you wish to use on your system and choose the related optimization flags. Note that the default flags in the Makefile
are undoubtedly not optimal for your system, so we encourage you to experiment with these flags and to solicit advice from your systems administrator. Selecting the right compiler and optimization flags can make a tremendous difference in terms of performance. We welcome feedback on your experience with various compilers and flags.
Now that you have set the compiler information, you need to select a number of flags in the setup/constants.h
file depending on your system:
LOCAL_PATH_IS_ALSO_GLOBAL
Set to .false.
on most cluster applications. For reasons of speed, the (parallel) distributed database generator typically writes a (parallel) database for the solver on the local disks of the compute nodes. Some systems have no local disks, e.g., BlueGene or the Earth Simulator, and other systems have a fast parallel file system, in which case this flag should be set to .true.
. Note that this flag is not used by the database generator or the solver; it is only used for some of the post-processing.
The package can run either in single or in double precision mode. The default is single precision because for almost all calculations performed using the spectral-element method using single precision is sufficient and gives the same results (i.e. the same seismograms); and the single precision code is faster and requires exactly half as much memory. Select your preference by selecting the appropriate setting in the setup/constants.h
file:
CUSTOM_REAL
Set to SIZE_REAL
for single precision and SIZE_DOUBLE
for double precision.
In the precision.h
file:
CUSTOM_MPI_TYPE
Set to MPI_REAL
for single precision and MPI_DOUBLE_PRECISION
for double precision.
On many current processors (e.g., Intel, AMD, IBM Power), single precision calculations are significantly faster; the difference can typically be 10% to 25%. It is therefore better to use single precision. What you can do once for the physical problem you want to study is run the same calculation in single precision and in double precision on your system and compare the seismograms. If they are identical (and in most cases they will), you can select single precision for your future runs.
If your compiler has problems with the use mpi
statements that are used in the code, use the script called replace_use_mpi_with_include_mpif_dot_h.pl
in the root directory to replace all of them with include ’mpif.h’
automatically.
Compiling on an IBM BlueGene
Installation instructions for IBM BlueGene (from April 2013):
Edit file flags.guess
and put this for FLAGS_CHECK
:
-g -qfullpath -O2 -qsave -qstrict -qtune=qp -qarch=qp -qcache=auto -qhalt=w
-qfree=f90 -qsuffix=f=f90 -qlanglvl=95pure -Q -Q+rank,swap_all -Wl,-relax
The most relevant are the -qarch and -qtune flags, otherwise if these flags are set to “auto” then they are wrongly assigned to the architecture of the frond-end node, which is different from that on the compute nodes. You will need to set these flags to the right architecture for your BlueGene compute nodes, which is not necessarily “qp”; ask your system administrator. On some machines if is necessary to use -O2 in these flags instead of -O3 due to a compiler bug of the XLF version installed. We thus suggest to first try -O3, and then if the code does not compile or does not run fine then switch back to -O2. The debug flags (-g, -qfullpath) do not influence performance but are useful to get at least some insights in case of problems.
Before running configure
, select the XL Fortran compiler by typing module load bgq-xl/1.0
or module load bgq-xl
(another, less efficient option is to load the GNU compilers using module load bgq-gnu/4.4.6
or similar).
Then, to configure the code, type this:
./configure FC=bgxlf90_r MPIFC=mpixlf90_r CC=bgxlc_r LOCAL_PATH_IS_ALSO_GLOBAL=true
In order for the SCOTCH domain decomposer to compile, on some (but not all) Blue Gene systems you may need to run configure
with CC=gcc
instead of CC=bgxlc_r
.
Older installation instruction for IBM BlueGene, from 2011:
To compile the code on an IBM BlueGene, Laurent Léger from IDRIS, France, suggests the following: compile the code with
FLAGS_CHECK="-O3 -qsave -qstrict -qtune=auto -qarch=450d -qcache=auto \
-qfree=f90 -qsuffix=f=f90 -g -qlanglvl=95pure -qhalt=w -Q \
-Q+rank,swap_all -Wl,-relax"
Option -Wl,-relax must be added on many (but not all) BlueGene systems to be able to link the binaries xmeshfem3D
and xspecfem3D
because the final link step is done by the GNU ld
linker even if one uses FC=bgxlf90_r, MPIFC=mpixlf90_r
and CC=bgxlc_r
to create all the object files. On the contrary, on some BlueGene systems that use the native AIX linker option -Wl,-relax can lead to problems and must be suppressed from flags.guess
.
Also, AR=ar, ARFLAGS=cru
and RANLIB=ranlib
are hardwired in all Makefile.in
files by default, but to cross-compile on BlueGene/P one needs to change these values to AR=bgar, ARFLAGS=cru
and RANLIB=bgranlib
. Thus the easiest thing to do is to modify all Makefile.in
files and the configure
script to set them automatically by configure
. One then just needs to pass the right commands to the configure
script:
./configure --prefix=/path/to/SPECFEM3DG_SP --host=Babel --build=BGP \
FC=bgxlf90_r MPIFC=mpixlf90_r CC=bgxlc_r AR=bgar ARFLAGS=cru \
RANLIB=bgranlib LOCAL_PATH_IS_ALSO_GLOBAL=false
This trick can be useful for all hosts on which one needs to cross-compile.
On BlueGene, one also needs to run the xcreate_header_file
binary file manually rather than in the Makefile:
bgrun -np 1 -mode VN -exe ./bin/xcreate_header_file
Visualizing the subroutine calling tree of the source code
Packages such as Doxywizard
can be used to visualize the calling tree of the subroutines of the source code. Doxywizard
is a GUI front-end for configuring and running Doxygen
.
To visualize the call tree (calling tree) of the source code, you can see the Doxygen tool available in directory doc/call_trees_of_the_source_code
.
To do your own call graphs, you can follow these simple steps below.
-
Install
Doxygen
andgraphviz
(the two are usually in the package manager of classic Linux distribution). -
Run in the terminal :
doxygen -g
, which creates aDoxyfile
that tells doxygen what you want it to do. -
Edit the Doxyfile. Two Doxyfile-type files have been already committed in the directory
specfem3d/doc/Call_trees
:-
Doxyfile_truncated_call_tree
will generate call graphs with maximum 3 or 4 levels of tree structure, -
Doxyfile_complete_call_tree
will generate call graphs with complete tree structure.
The important entries in the Doxyfile are:
PROJECT_NAME
OPTIMIZE_FOR_FORTRAN
Set to YESEXTRACT_ALL
Set to YESEXTRACT_PRIVATE
Set to YESEXTRACT_STATIC
Set to YESINPUT
From the directoryspecfem3d/doc/Call_trees
, it is"../../src/"
FILE_PATTERNS
In SPECFEM case, it is*.f90* *.F90* *.c* *.cu* *.h*
HAVE_DOT
Set to YESCALL_GRAPH
Set to YESCALLER_GRAPH
Set to YESDOT_PATH
The path where is located the dot program graphviz (if it is not in your $PATH)RECURSIVE
This tag can be used to turn specify whether or not subdirectories should be searched for input files as well. In the case of SPECFEM, set to YES.EXCLUDE
Here, you can exclude:../../src/specfem3D/older_not_maintained_partial_OpenMP_port ../../src/decompose_mesh/scotch ../../src/decompose_mesh/scotch_5.1.12b
DOT_GRAPH_MAX_NODES
to set the maximum number of nodes that will be shown in the graph. If the number of nodes in a graph becomes larger than this value, doxygen will truncate the graph, which is visualized by representing a node as a red box. Minimum value: 0, maximum value: 10000, default value: 50.MAX_DOT_GRAPH_DEPTH
to set the maximum depth of the graphs generated by dot. A depth value of 3 means that only nodes reachable from the root by following a path via at most 3 edges will be shown. Using a depth of 0 means no depth restriction. Minimum value: 0, maximum value: 1000, default value: 0. -
-
Run :
doxygen Doxyfile
, HTML and LaTeX files created by default inhtml
andlatex
subdirectories. -
To see the call trees, you have to open the file
html/index.html
in your browser. You will have many informations about each subroutines of SPECFEM (not only call graphs), you can click on every boxes / subroutines. It show you the call, and, the caller graph of each subroutine : the subroutines called by the concerned subroutine, and the previous subroutines who call this subroutine (the previous path), respectively. In the case of a truncated calling tree, the boxes with a red border indicates a node that has more arrows than are shown (in other words: the graph is truncated with respect to this node).
Finally, some useful links:
-
short summary for the basic utilisation of Doxygen:
-
to configure the diagrams :
-
the complete alphabetical index of the tags in Doxyfile:
-
more generally, the Doxygen manual:
Becoming a developer of the code, or making small modifications in the source code
If you want to develop new features in the code, and/or if you want to make small changes, improvements, or bug fixes, you are very welcome to contribute! To do so, i.e. to access the development branch of the source code with read/write access (in a safe way, no need to worry too much about breaking the package, there are CI tests based on BuildBot, Travis-CI and Jenkins in place that are checking and validating all new contributions and changes), please visit this Web page: https://github.com/SPECFEM/specfem3d/wiki
References
Geuzaine, C., and J. F. Remacle. 2009. “Gmsh: A Three-Dimensional Finite Element Mesh Generator with Built-in Pre- and Post-Processing Facilities.” Int. J. Numer. Methods Eng. 79 (11): 1309–31.
Karypis, George, and Vipin Kumar. 1998a. “A Fast and High-Quality Multilevel Scheme for Partitioning Irregular Graphs.” SIAM Journal on Scientific Computing 20 (1): 359–92.
———. 1998b. “A Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering.” Journal of Parallel and Distributed Computing 48: 71–85.
———. 1998c. “Multilevel $k$-Way Partitioning Scheme for Irregular Graphs.” Journal of Parallel and Distributed Computing 48 (1): 96–129.
Liu, Qing, Jeremy Logan, Yuan Tian, Hasan Abbasi, Norbert Podhorszki, Jong Youl Choi, Scott Klasky, et al. 2013. “Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks.” Concurrency and Computation: Practice and Experience, n/a–. https://doi.org/10.1002/cpe.3125.
Oliveira, S. P., and G. Seriani. 2011. “Effect of Element Distortion on the Numerical Dispersion of Spectral-Element Methods.” Communications in Computational Physics 9 (4): 937–58.
Pellegrini, F., and J. Roman. 1996. “SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs.” Lecture Notes in Computer Science 1067: 493–98.
This documentation has been automatically generated by pandoc based on the User manual (LaTeX version) in folder doc/USER_MANUAL/ (Sep 26, 2024)