Small molecule structure solution with SHELXD

SHELXD solves structures in two dual-space stages. The first stage - (sub)structure solution - is determined by the FIND instruction, plus optionally PATS or GROP, and the second stage - structure completion - by PLOP. When SHELXD is used for macromolecular substructure solution this second stage is omitted. The use of SHELXD to solve small (and not so small) molecule structures is described here.

The input to SHELXD consists of two files, name.ins and name.hkl, both of which can conveniently be created using the Bruker XPREP program and many GUIs. With the exception of a couple of instructions in the .ins file, the input is very similar to that for SHELXS. SHELXD is slower than SHELXS and requires a more realistic estimate of the cell contents, but is more effective than SHELXS for larger equal-atom structures and also for (pseudo-)merohedrally twinned structures. We will use the 1.1Å data for the 17-residue linear polypeptide PN1A (PDB-code 1PEN), kindly provided by Jenny Martin, as an example to illustrate the various approaches to solving larger small-molecule structures with SHELXD. A resolution of 1.1Å is borderline for small-molecule direct methods (0.9Å or better would have been desirable), but if the Friedel pairs had not been merged the structure would be a sitting duck for sulfur-SAD phasing, e.g. using SHELXC/D/E (it was collected in-house with CuKα radiation). The files pn1a.ins and pn1a.hkl generated by XPREP are available as a zip archive. The file pn1a.ins consists of:

TITL pn1a in P2(1)
CELL 1.54178 15.0000 19.8000 16.5000 90.000 113.400 90.000
ZERR 2.00 0.0030 0.0040 0.0033 0.000 0.030 0.000
LATT -1
SYMM -X, 0.5+Y, -Z
SFAC C H N O S
UNIT 130 254 38 44 8
FIND 84
PLOP 112 141 157
MIND 1.0 3
NTRY 1000
HKLF 4
END

This solves the structure about 60 times in one or two minutes using a 4-core desktop computer, so NTRY 100 would have been adequate and faster. Alternatively NTRY can be omitted, in which case the job runs for ever unless terminated by creating a file name.fin in the same folder. The correct solutions can be recognised by their high final correlation coefficients between the structure and the native data (FCC) of over 80%. A value of at least 70% usually indicates success given data to 1.0Å or better, but FCC is less decisive for lower resolution data. The solution is written in SHELX format to the file pn1a.res and in PDB format to pn1a.pdb. It is rather complete but does not attempt to differentiate between C, N and O. Note that small-molecule direct methods cannot determine the hand of the structure, so there is a 50% chance that the solution will turn out to be a mirror image of the true structure! In some cases (e.g. P41 or P43) this will require inverting the space group as well as the atom coordinates. A good way is to rename the .res file as .ins and input it into SHELXE with the -i command line option. For a few space groups this also requires an origin shift, but SHELXE takes this into account.


Bootstrapping by solving the substructure first

An alternative approach is to use Patterson seeding in the FIND stage instead of random starting atoms. The idea is first to solve the substructure (in this case the four sulfur atoms that make up the two disulfide bonds in the asymmetric unit) and then to expand to the full structure with PLOP. To use the highest Patterson peaks as two-atom translation search fragments, the instructions between UNIT and HKLF in the pn1a.ins file would be replaced by:

PATS
PSMF -4
FIND 4 5
MIND -1.8 3
TEST 10 5
PLOP 50 80 120 160 160
NTRY 20

This uses a super-sharp √(E³F)-Patterson (PSMF -4), five dual-space cycles to find four heavy atoms and a minimum interatomic distance of 1.8Å (MIND -1.8 3). The negative sign for the first MIND parameter causes the PATFOM figure of merit to be calculated (it measures the agreement of the calculated interatomic vectors with the Patterson function). TEST sets the threshold for entering the PLOP stage and PLOP specifies the number of atoms to be assigned in each PLOP cycle. Alternatively the Patterson seeding may be performed with a randomly oriented fixed length vector, e.g. a disulfide bond with a length of 2.06Å:

PATS -2.06
PSMF -4
FIND 4 5
MIND -1.8 3
TEST 10 5
PLOP 50 80 120 160 160
NTRY 20

In principle it is also possible to search for a disulfide unit using GROP, though this is really intended for larger fragments (but larger 3D fragments would require a larger first GROP parameter):

GROP 99 1.6 1.2
FIND 4 5
MIND -1.8 3
TEST 10 5
PLOP 50 80 120 160 160
NTRY 20
ATOM      1  SG  CYS     1       0.000   0.000   0.000  1.00 10.00
ATOM      2  SG  CYS     2       0.000   0.000   2.060  1.00 10.00


Solving twinned structures with SHELXD

Under favorable conditions, SHELXD is also able to solve (pseudo-)merohedrally twinned structures by ab initio methods. Sometimes this succeeds without special action, but better results may usually be obtained by using the SHELXL instructions TWIN and BASF. The BASF parameter is held at a fixed value (default 0.5) throughout. The Bruker AXS program XPREP can be used to find the TWIN matrix and estimate the BASF parameter value. TWIN and BASF are only applied at the PLOP stage, and are ignored by PATS, GROP and FIND.


Alphabetical list of SHELXD instructions

All instructions in the .ins file commence with a four (or less) letter word (which may be an atom name) followed by numbers and other information in free format, separated by one or more spaces. Upper and lower case input may be freely mixed. Defaults are given in square brackets; '#' indicates that the program will generate a suitable default value based on the rest of the available information. Continuation lines are flagged by '=' at the end of a line, the instruction being continued on the next line which must start with at least one space. Other lines beginning with one or more spaces are treated as comments, so blank lines may be added to improve readability. All characters following '!' or '=' in an instruction are ignored, except after TITL or SYMM (for which continuation lines are not allowed).

ATOM and HETATM
These instructions define PDB format atoms for use by GROP.

CCWT g[0.1]
All correlation coefficients (CC) are calculated using weights w = 1/[1+gσ²(E)]. If the σ(E) values read from the .hkl file are known to be very unreliable, it might be better to set g to zero. The correlation coefficients between Ec and Eo are calculated using the formula:

CC = 100[ΣwEoEcΣw−ΣwEoΣwEc] / √{[ΣwEo²Σw−(ΣwEo)²][ΣwEc²Σw−(ΣwEc)²]}

CELL λ a b c α β γ
Wavelength and unit-cell dimensions in Ångstroms and degrees.

DSUL nss[0]
Converts the most suitable nss peaks into disulfide units with S-S distances of 2.06Å. This is an improvement on treating these atoms as super-sulfurs. Each disulfide counts as a single peak for FIND, so MIND must be set to avoid both sulfurs being found in the initial peaksearch (e.g. MIND -3.5 3).

END
This is the last instruction in the rare cases when the .ins file is not terminated by the HKLF instruction.

ESEL Emin[#], dlim[1.0]
Minimum E and high-resolution limit for FIND. The E² values are normalized to 1 in resolution shells, then smoothed. Emin defaults to 1.2 for ab initio structure solution and to 1.5 for heavy atom location (the appropriate value is set as default depending on whether a PLOP instruction is present or not). It may be necessary to reduce Emin if the resolution is low.

FIND na[0], ncy[#]
Search for na atoms in ncy dual space cycles. If WEED is employed, na is the number of atoms remaining after the random omit procedure. ncy defaults to the largest of (20 or na) or, if PATS is used, to the smaller of (3na and 20). If FIND is absent, PLOP expands directly from the starting atoms.

FRES res[3.0]
Resolution of all Fourier syntheses (including the PSMF but excluding the Patterson itself) in terms of the minimum ratio of the number of grid points along an axis to the maximum reflection index used along that axis.

GROP nor[99], Eg[1.5], dg[1.2], ntr[99]
The dual-space direct methods is seeded by a 6D search for small rigid group to find a high value (not necessarily the global maximum) of ΣEc²(Eo²−1) for the reflections with E > Eg and d > dg, where d is the resolution in Ångstroms. For each of nor random orientations, the local maxima of this function are found starting from ntr random translations, and the atom positions corresponding to the orientation/translation combination that gives the highest value for this function are used to initiate the dual-space recycling (FIND). The search model is read from PDB-format ATOM or HETATM records in the .ins file. All other PDB records should be removed. The atomic number is deduced from the atom name applying PDB rules. A short piece of alpha-helix might be used for solving small proteins and a diglucose fragment might be suitable for cyclodextrins. In practice, a thorough 6-dimensional search (with a large nor value and Eg = 0) using GROP is rather slow, but when used in combination with TRIK, GROP is much faster because then only a 3-dimensional search is required.

HKLF m
m = 4 for F² in .hkl file, m = 3 for F (or FA or ΔF).

KEEP nh[0]
nh is the number of (heavy) atoms to retain as fixed atoms during PLOP expansion. This will normally only be used when expanding from starting atoms (PLOP without FIND, GROP or PATS).

LATT N[1]
Lattice type: 1=P, 2=I, 3=rhombohedral obverse on hexagonal axes, 4=F, 5=A, 6=B, 7=C. N must be made negative if the structure is non-centrosymmetric.

MIND mdis[1.0], mdeq[-0.1]
|mdis| is the shortest distance allowed between atoms for PATS and FIND. If mdis is negative PATFOM is calculated, and the crossword table for the best PATFOM value so far is output to the .lst file. In this case the solution is passed on to the PLOP stage if either the CC is the best so far or the PATFOM is the best so far. mdeq is the minimum distance between symmetry equivalents for FIND (for PATS the |mdis| distance is used). The default value of -0.1 for mdeq allows heavy atom sites on special positions, which is normally recommended for small molecules or for heavy atom soaks for macromolecular phasing. For the location of selenium or sulfur in macromolecular phasing it is advisable to use a value of 3.0 to avoid spurious solutions such as uraninum atom solutions that are incorrect but fit the tangent formula. For PLOP the PREJ instruction can be used to control whether peaks on special positions are selected.

MOVE dx[0] dy[0] dz[0] sign[1]
The coordinates of the atoms that follow this instruction are changed to:

x' = dx + sign⋅x
y' = dy + sign⋅y
z' = dz + sign⋅z

NTPR ntpr[100]
Maximum number of (largest) TPR (triple phase relations) per reflection. If ntpr is negative, E is replaced by E/[1+σ²(E)] in the estimation of probabilities involved in the tangent formula and minimal function, as recommended by Giacovazzo, Siliqi & Garcia-Rodriguez (2001).

NTRY ntry[0]
Number of global tries if starting from random atoms, PATS or GROP. If ntry is zero or absent, the program runs until it is interrupted by creating a name.fin file in the current working directory (e.g. using the UNIX command touch).

PATS +np or -dis [100], npt[#], nf[5]
Calculates and stores Patterson. A random search is performed for np two-atom vectors corresponding to Patterson peaks or for a random orientation vector of length |dis|, using npt random translations, selecting the one with the best Patterson minimum function PMF (see PSMF). When selecting a vector from the list of unique Patterson peaks, special vectors are ignored and the highest vector is chosen from nf random selections. This favors the highest peaks but (if nf is not too large) also allows lower peaks a chance. For example, with the default np = 100 and nf = 5, the chance is 39.5% that one of the first 10 vectors will be chosen and 91.9% that one of the first 50 will be chosen. The default value of npt is 9999 for space groups with a floating origin and 99999 for other space groups. When the space group is P1, an extra atom is placed on the origin in addition to the two-atom vector employed for the translation search. In the special case when FIND 1 is specified with PATS, a single atom Patterson translation search is performed instead of using a vector. If the first parameter is negative, nf randomly oriented vectors of length |dis| are compared on the basis of the corresponding Patterson densities and the best used for the translation search. If PATS is used together with a second FIND parameter ncy greater than zero (or FIND followed by only one number) a full-symmetry Patterson superposition minimum function (i.e. a superposition based on the two peaks and all their symmetry equivalents) is used to locate the starting atoms for the first FIND cycle. PATS and GROP are mutually exclusive.

PLOP followed by up to 10 numbers
PLOP specifies the number of peaks to start with in each cycle of the peaklist optimization algorithm of Sheldrick & Gould (1995). Peaks are then eliminated one at a time until either the correlation coefficient cannot be increased any more or 50% of the peaks have been eliminated.

PREJ maxb[3], dsp[-0.01], mf[1]
PREJ controls the assignment of atoms in the PLOP stage. maxb is the maximum number of bonds to atoms or higher peaks, the peak is deleted if there are more. Peaks are also deleted if they are less than dsp Ångstroms from their equivalents. Atoms are not output to the final .res file if they are in a molecule that consists of less than mf atoms.

PSMF pres[3.0], psfac[0.34]
pres is the resolution of the Patterson in terms of minimum ratio of the number of grid points along an axis and the maximum reflection index along that axis. If nres is negative a supersharp Patterson with coefficients √(E³F) is calculated (in which case a finer grid is advisable, i.e. PSMF -4), otherwise a normal F² Patterson is used. psfac is the fraction of the lowest values in the sorted list of Patterson heights that is summed to get the PMF.

REM
Followed by a comment on the same line. This comment is ignored by the program but is copied to the results file (.res).

SEED nrand[0]
SEED sets the random number seed so that exactly the same results can be obtained if the job is repeated on an identical computer with no changes in the other parameters. Each integer nrand defines a different sequence of random numbers. If nrand is omitted or zero, the seed is randomized so a new sequence is always generated.

SFAC elements
These element symbols define the order of scattering factors to be employed by the program. The first 94 elements of the periodic system are recognized. For some options, e.g. substructure solution, only the first element type is used.

SHEL dmax [infinity], dmin[0]
Resolution limits in Å for all calculations. Both limits must be specified but it does not matter which is given first.

SKIP min2[0.5]
During FIND, if the second peak height is less than min2 times the first, the first peak is rejected (before applying WEED to reject other peaks). This is sometimes useful to suppress uranium atom solutions. For large equal-atom structures in space group P1, where there is a danger of an uranium-atom pseudo-solution, it might be a good idea to specify SKIP 0.99 so that the first peak is ALWAYS rejected!

SYMM symmetry operation
Symmetry operators, i.e. coordinates of the general positions as given in the International Tables, volume A. The operator X, Y, Z is always assumed, so may NOT be input. If the structure is centrosymmetric, the origin MUST lie on a center of symmetry. Lattice centering should be indicated by LATT, not SYMM. The symmetry operators may be specified using decimal or fractional numbers, e.g. 0.5-x,0.5+y,-z or Y-X,-X,Z+1/6; the three components are separated by commas. At least one SYMM instruction must be present unless the structure is triclinic.

TANG ftan[0.9], fex[0.4]
Fraction |ftan| of the ncy dual space (FIND) cycles are performed using the tangent formula, the rest using a Sim-weighted E-map. fex is the fraction of reflections with the largest Ecalc values to hold fixed when doing tangent expansion to find the remaining phases. WEED is only applied to the first |ftan|−ncy cycles. If ftan is negative, the occupancies are refined for the final (1−|ftan|)−ncy cycles. This is particularly useful for the anomalous sites in halide soak experiments, since these often have partial occupancies, but for other substructure problems it also provides a good check as to how many heavy atom sites are present. It is not recommended for normal ab initio applications of SHELXD because the algorithm employed uses a large amount of memory (in the interests of speed).

TEST CCmin[#], delCC[#]
After FIND, if CC is less than CCmin, FIND is repeated with new starting atoms. Otherwise PATFOM is calculated (if the first MIND parameter was negative) and the PLOP stage entered. CCmin is reduced by 0.1% each cycle until a solution passes this test. After PLOP has been entered at least once, subsequent attempts go on to PATFOM and/or PLOP if CC is within delCC of best CC value so far. If PATFOM is calculated, then only solutions with either the best initial CC (i.e. after FIND) so far or the best PATFOM so far go on to the PLOP stage. Whether or not PATFOM is calculated, if PLOP is absent the heavy atom sites with the best initial CC so far are written to the .res and .pdb files. If PLOP is specified, then the .res and .pdb files are written after the PLOP stage. Since these files are closed and reopened each time, they may be inspected with other programs without stopping the SHELXD job. The defaults for CCmin and delCC are 45 and 1 resp. for full ab initio solutions, and 10 and 5 resp. for substructure solution (i.e. when PLOP is absent).

TITL [ ]
Title of up to 76 characters, to appear at suitable places in the output.

TRIC (or TRIK)
Expand data to non-centrosymmetric triclinic for all calculations.

UNIT n1 n2 ...
Number of atoms of each type in the cell, in SFAC order.

WEED fr[0.3]
Randomly omit fraction fr of the atoms in the dual space recycling (except in the last cycle and the cycles for which no tangent refinement is performed - see TANG). WEED not applied in the PLOP stage.

ZERR Z esd(a) esd(b) esd(c) esd(α) esd(β) esd(γ)
Z-value (number of formula units per cell) followed by the estimated errors in the unit-cell dimensions. This information is not actually required by SHELXD but is allowed for compatibility with SHELXL.