Macromolecular substructure solution with SHELXD
SHELXD was originally written for ab initio
small molecule direct methods,
but it turned out to be even more useful for the location of the
heavy atoms in the experimental phasing of macromolecules by the SAD,
SIRAS, MAD and similar methods. Whereas the ab initio
solution of small molecules requires two dual-space stages
(FIND and PLOP), only the first of these (FIND)
is required for substructure solution.
Running SHELXD for substructure solution
The input to SHELXD consists of two files name_fa.ins
and name_fa.hkl that usually will both
have been set up by SHELXC, possibly
under the control of a GUI such as
name_fa.hkl contains FA and σ(FA), plus a
phase shift α that is needed for SHELXE but not SHELXD.
An example of an .ins file for a straightforward
selenomethionine MAD experiment follows:
TITL jia-fa in C222(1)
CELL 0.98000 96.00 120.00 166.13 90.00 90.00 90.00
ZERR 16.00 0.019 0.024 0.033 0.0 0.0 0.0
SYMM -X, -Y, 0.5+Z
SYMM -X, Y, 0.5-Z
SYMM X, -Y, -Z
MIND -3.5 3
Patterson seeding (PATS) and direct methods (FIND) are
used here to find approximately 8 selenium atoms (FIND 8)
that are at least 3.5Å (MIND -3.5 3) from each other.
The second MIND parameter is the shortest distance between
symmetry equivalents of the same atom, and may be used to
ensure that spurious sites on special
positions are ignored, this is recommended for locating selenium or sulfur
atoms, but for heavy atom soaks -0.1 (the default) is advisable
because there may be a heavy atom site on a symmetry axis. 1000 random
starts (NTRY 1000) are overkill here, and it is not
necessary to truncate these data with SHEL. LATT -7
specifies a non-centrosymmetric structure with a C-lattice: the
LATT and SYMM instructions together specify the
space group C2221. TITL to UNIT should
always be in the above order, and HKLF should be the last
instruction, otherwise the order does not matter. There are a large
number of other options - see the alphabetical list below - but
most are rarely, if ever, needed for substructure solution.
SFAC provides an atom name for writing the sites to the
.res and .pdb output files.
For SAD data, for which the experimental phase information tends to
be weaker than for MAD, the critical parameters are the resolution d at
which to truncate the data (SHEL 999 d), the number of
atoms to search for (FIND), and the number of trials (NTRY).
d should be set to the resolution out to which there is significant
anomalous signal. A surprisingly good first approximation is to take the
highest resolution to which the crystals would have diffracted and add
0.5. In general an optimistic value of d can require many more trials
but may give more accurate heavy atom positions, and it may be worth
trying a range of values, e.g. in steps of 0.1Å.
d can be particularly
critical for sulfur-SAD phasing. If d>2.0Å the disulfide bonds
may not be fully resolved, but in the range 2.8>d>2.0 the DSUL
instruction may be used to fit S−S units to the density. This can
dramatically improve the final phase quality. If DSUL is used,
MIND -3.5 3 is recommended and disulfides
should be counted as single (super-sulfur) atoms for FIND. If the
resolution is better than 1.9Å and the sulfurs are being searched
for individually, MIND -1.5 3 should be used.
For the classical case of
rhombohedral insulin, which has two zinc ions on threefold axes and six
disulfides in the asymmetric unit, for CuKα radiation for which
f"(Zn) = 0.68 and f"(S) = 0.56, and significant
anomalous signal at 2.0Å, the instructions between SYMM
and HKLF could be:
SHEL 999 2.0
MIND -1.5 -0.1
In this example, if the anomalous data only extended to about 2.5Å,
it would be better to search for two zincs and six disulfides:
SHEL 999 2.5
MIND -3.5 -0.1
At the zinc absorption edge of 1.283Å, f"(Zn) = 3.89
is so much greater than for sulfur (f" = 0.39) that it might be
advisable just to search for the two zinc sites with FIND 2
and no DSUL, especially if the data are weak. A zinc MAD experiment
would give much better phases in this case, because f' is large (ca.
-7.5) and the low solvent content (36%) would be much more detrimental
for SAD than for MAD phasing, e.g. for density modification with SHELXE.
In difficult cases it may well be worth increasing the number of
trials. some large substructures solved only once in 50000 trials or more.
In such cases one should try to use a computer with as many CPUs as
possible, SHELXD will take full advantage of them!
At the end of the dual-space direct methods SHELXD refines the site
occupancies assuming that all atoms are of the same type. This provides
an adequate approximation in the case where different anomalous scatterers
are present (e.g. Ca2+ and S in trypsin). For a SeMet MAD or sulfur-SAD
experiment there should be a clear drop in occupancy after the last site.
For halide soaks there is often a continuous descent to the noise level,
which is usually assumed to be at an occupancy of about 0.15 relative
to the site with the highest occupancy. This can be used to fine-tune the
number of sites, which should be within about 20% of the true value for
the best results.
Alphabetical list of SHELXD instructions
All instructions in the .ins file commence with a four (or less) letter
word (which may be an atom name) followed by numbers and other information in free
format, separated by one or more spaces. Upper and lower case input may be freely
mixed. Defaults are given in square brackets; '#' indicates that the program will
generate a suitable default value based on the rest of the available information.
Continuation lines are flagged by '=' at the end of a line, the instruction being
continued on the next line which must start with at least one space. Other lines
beginning with one or more spaces are treated as comments, so blank lines may be
added to improve readability. All characters following '!' or '=' in an instruction
are ignored, except after TITL or SYMM (for which continuation lines
are not allowed).
ATOM and HETATM
These instructions define PDB format atoms for use by GROP.
All correlation coefficients (CC) are calculated using weights
w = 1/[1+gσ²(E)]. If the σ(E) values
read from the .hkl file are known to be very unreliable, it might
be better to set g to zero. The correlation coefficients between
Ec and Eo are calculated using the formula:
CC = 100[ΣwEoEcΣw−ΣwEoΣwEc] /
CELL λ a b c α β γ
Wavelength and unit-cell dimensions in Ångstroms and degrees.
Converts the most suitable nss peaks into disulfide units
with S-S distances of 2.06Å. This is an improvement on treating
these atoms as super-sulfurs. Each disulfide counts as a single
peak for FIND, so MIND must be set to avoid both
sulfurs being found in the initial peaksearch (e.g.
MIND -3.5 3).
This is the last instruction in the rare cases when the
.ins file is not terminated by the HKLF
ESEL Emin[#], dlim[1.0]
Minimum E and high-resolution limit for FIND. The E²
values are normalized to 1 in resolution shells, then smoothed.
Emin defaults to 1.2 for ab initio structure solution
and to 1.5 for heavy atom location (the appropriate value is set as
default depending on whether a PLOP instruction is present
or not). It may be necessary to reduce Emin if the resolution
FIND na, ncy[#]
Search for na atoms in ncy dual space cycles. If
WEED is employed, na is the number of atoms remaining
after the random omit procedure. ncy defaults to the largest
of (20 or na) or, if PATS is used, to the smaller of
(3na and 20). If FIND is absent, PLOP expands
directly from the starting atoms.
Resolution of all Fourier syntheses (including the PSMF but excluding
the Patterson itself) in terms of the minimum ratio of the number of
grid points along an axis to the maximum reflection index used along
GROP nor, Eg[1.5], dg[1.2], ntr
The dual-space direct methods is seeded by a 6D search for small rigid
group to find a high value (not necessarily the global maximum) of
ΣEc²(Eo²−1) for the reflections with
E > Eg and d > dg, where d is the
resolution in Ångstroms. For each of nor random orientations,
the local maxima of this function are found starting from ntr
random translations, and the atom positions corresponding to the
orientation/translation combination that gives the highest value for this
function are used to initiate the dual-space recycling (FIND).
The search model is read from PDB-format ATOM or HETATM records in the
.ins file. All other PDB records should be removed. The atomic
number is deduced from the atom name applying PDB rules. A short piece of
alpha-helix might be used for solving small proteins and a diglucose
fragment might be suitable for cyclodextrins. In practice, a thorough
6-dimensional search (with a large nor value and
Eg = 0) using GROP is rather slow, but when
used in combination with TRIK, GROP is much faster because
then only a 3-dimensional search is required.
m = 4 for F² in .hkl file,
m = 3 for F (or FA or ΔF).
nh is the number of (heavy) atoms to retain as fixed atoms
during PLOP expansion. This will normally only be used when
expanding from starting atoms (PLOP without FIND,
GROP or PATS).
Lattice type: 1=P, 2=I, 3=rhombohedral obverse on hexagonal axes, 4=F,
5=A, 6=B, 7=C. N must be made negative if the structure is
MIND mdis[1.0], mdeq[-0.1]
|mdis| is the shortest distance allowed between atoms for
PATS and FIND. If mdis is negative PATFOM is
calculated, and the crossword table for the best PATFOM value so far
is output to the .lst file. In this case the solution is
passed on to the PLOP stage if either the CC is the
best so far or the PATFOM is the best so far. mdeq is the
minimum distance between symmetry equivalents for FIND (for
PATS the |mdis| distance is used). The default value of
-0.1 for mdeq allows heavy atom sites on special positions,
which is normally recommended for small molecules or for heavy atom
soaks for macromolecular phasing.
For the location of selenium or sulfur in macromolecular phasing
it is advisable to use a value of 3.0 to avoid spurious solutions
such as uraninum atom solutions that are incorrect but fit
the tangent formula. For PLOP the PREJ instruction can
be used to control whether peaks on special positions are selected.
MOVE dx dy dz sign
The coordinates of the atoms that follow this instruction
are changed to:
x' = dx + sign⋅x
y' = dy + sign⋅y
z' = dz +
Maximum number of (largest) TPR (triple phase relations) per reflection.
If ntpr is negative, E is replaced by E/[1+σ²(E)] in
the estimation of probabilities involved in the tangent formula and
minimal function, as recommended by Giacovazzo, Siliqi & Garcia-Rodriguez
Number of global tries if starting from random atoms, PATS or
GROP. If ntry is zero or absent, the program runs until
it is interrupted by creating a name.fin file in the current
working directory (e.g. using the UNIX command touch).
PATS +np or -dis , npt[#], nf
Calculates and stores Patterson. A random search is performed for
np two-atom vectors corresponding to Patterson peaks or for a
random orientation vector of length |dis|, using npt
random translations, selecting the one with the best Patterson minimum
function PMF (see PSMF). When selecting a vector from the list
of unique Patterson peaks, special vectors are ignored and the highest
vector is chosen from nf random selections. This favors the
highest peaks but (if nf is not too large) also allows lower
peaks a chance. For example, with the default np = 100
and nf = 5, the chance is 39.5% that one of the first
10 vectors will be chosen and 91.9% that one of the first 50 will be
chosen. The default value of npt is 9999 for space groups with
a floating origin and 99999 for other space groups. When the space
group is P1, an extra atom is placed on the origin in addition to the
two-atom vector employed for the translation search. In the special
case when FIND 1 is specified with PATS, a single
atom Patterson translation search is performed instead of using a
vector. If the first parameter is negative, nf randomly
oriented vectors of length |dis| are compared on the basis of
the corresponding Patterson densities and the best used for the
translation search. If PATS is used together with a second
FIND parameter ncy greater than zero (or FIND
followed by only one number) a full-symmetry Patterson
superposition minimum function (i.e. a superposition based on
the two peaks and all their symmetry equivalents) is used to locate
the starting atoms for the first FIND cycle. PATS and
GROP are mutually exclusive.
PLOP followed by up to 10 numbers
PLOP specifies the number of peaks to start with in each
cycle of the peaklist optimization algorithm of Sheldrick & Gould
(1995). Peaks are then eliminated one at a time until either the
correlation coefficient cannot be increased any more or 50% of the
peaks have been eliminated.
PREJ maxb, dsp[-0.01], mf
PREJ controls the assignment of atoms in the PLOP
stage. maxb is the maximum number of bonds to atoms or higher
peaks, the peak is deleted if there are more. Peaks are also deleted
if they are less than dsp Ångstroms from their
equivalents. Atoms are not output to the final .res file if
they are in a molecule that consists of less than mf atoms.
PSMF pres[3.0], psfac[0.34]
pres is the resolution of the Patterson in terms of minimum
ratio of the number of grid points along an axis and the maximum
reflection index along that axis. If nres is negative a
supersharp Patterson with coefficients √(E³F)
is calculated (in which case a finer grid is advisable, i.e.
PSMF -4), otherwise a normal F² Patterson is used.
psfac is the fraction of the lowest values in the sorted list
of Patterson heights that is summed to get the PMF.
Followed by a comment on the same line. This comment is ignored by the
program but is copied to the results file (.res).
SEED sets the random number seed so that exactly the same
results can be obtained if the job is repeated on an identical
computer with no changes in the other parameters. Each integer
nrand defines a different sequence of random numbers. If
nrand is omitted or zero, the seed is randomized so a new
sequence is always generated.
These element symbols define the order of scattering factors to be
employed by the program. The first 94 elements of the periodic system
are recognized. For some options, e.g. substructure solution, only
the first element type is used.
SHEL dmax [infinity], dmin
Resolution limits in Å for all calculations. Both limits must be
specified but it does not matter which is given first.
During FIND, if the second peak height is less than
min2 times the first, the first peak is rejected (before
applying WEED to reject other peaks). This is sometimes useful
to suppress uranium atom solutions. For large equal-atom
structures in space group P1, where there is a danger of an
uranium-atom pseudo-solution, it might be a good idea to specify
SKIP 0.99 so that the first peak is ALWAYS rejected!
SYMM symmetry operation
Symmetry operators, i.e. coordinates of the general positions as given in the
International Tables, volume A. The operator X, Y, Z is
always assumed, so may NOT be input. If the structure is centrosymmetric, the
origin MUST lie on a center of symmetry. Lattice centering should be indicated
by LATT, not SYMM. The symmetry operators may be specified using
decimal or fractional numbers, e.g. 0.5-x,0.5+y,-z or
Y-X,-X,Z+1/6; the three components are separated by
commas. At least one SYMM instruction must be present unless the
structure is triclinic.
TANG ftan[0.9], fex[0.4]
Fraction |ftan| of the ncy dual space (FIND) cycles are
performed using the tangent formula, the rest using a Sim-weighted E-map.
fex is the fraction of reflections with the largest Ecalc values to
hold fixed when doing tangent expansion to find the remaining phases.
WEED is only applied to the first |ftan|−ncy
cycles. If ftan is negative, the occupancies are refined for
the final (1−|ftan|)−ncy cycles. This is
particularly useful for the anomalous sites in halide soak experiments,
since these often have partial occupancies, but for other substructure
problems it also provides a good check as to how many heavy atom sites
are present. It is not recommended for normal ab initio
applications of SHELXD because the algorithm employed uses a large
amount of memory (in the interests of speed).
TEST CCmin[#], delCC[#]
After FIND, if CC is less than CCmin, FIND is
repeated with new starting atoms. Otherwise PATFOM is calculated
(if the first MIND parameter was negative) and the PLOP
stage entered. CCmin is reduced by 0.1% each cycle until a
solution passes this test. After PLOP has been entered at
least once, subsequent attempts go on to PATFOM and/or PLOP if CC
is within delCC of best CC value so far. If PATFOM is
calculated, then only solutions with either the best initial CC
(i.e. after FIND) so far or the best PATFOM so far go on
to the PLOP stage. Whether or not PATFOM is calculated, if
PLOP is absent the heavy atom sites with the best initial
CC so far are written to the .res and .pdb files. If
PLOP is specified, then the .res and .pdb files
are written after the PLOP stage. Since these files are closed
and reopened each time, they may be inspected with other programs
without stopping the SHELXD job. The defaults for CCmin and
delCC are 45 and 1 resp. for full ab initio solutions,
and 10 and 5 resp. for substructure solution (i.e. when PLOP
TITL [ ]
Title of up to 76 characters, to appear at suitable places in the output.
TRIC (or TRIK)
Expand data to non-centrosymmetric triclinic for all calculations.
UNIT n1 n2 ...
Number of atoms of each type in the cell, in SFAC order.
Randomly omit fraction fr of the atoms in the dual space
recycling (except in the last cycle and the cycles for which no tangent
refinement is performed - see TANG). WEED not applied in
the PLOP stage.
ZERR Z esd(a) esd(b) esd(c) esd(α) esd(β) esd(γ)
Z-value (number of formula units per cell) followed by the estimated errors in
the unit-cell dimensions. This information is not actually required by SHELXD
but is allowed for compatibility with SHELXL.