SHELXS - classical direct methods for small molecules

SHELXS is a program of some antiquity for solving small (up to about 100 unique non-hydrogen atom) structures by direct methods. It is very fast because in the 1980s computers were so slow. SHELXS is based on the classical tangent formula of Karle and Hauptman, but uses phase annealing and includes information from the weak reflections via the negative quartets. For further details see the TREF, INIT and PHAN instructions in the list below. SHELXS may also be used for automated Patterson interpretation and structure expansion from a partial structure. The much more modern SHELXT simply outperforms SHELXS for routine small molecule structure solution and even reads the identical input files.


Running SHELXS direct methods

SHELXS requires two input files: an instruction file name.ins and a reflection file name.hkl, and outputs the 'best' solution to name.res. A summary is output to the console and a full listing to name.lst. SHELXS is started with the command line:

shelxs name

A SHELXS job that is already running may be terminated gracefully by creating a file name.fin in the directory in which SHELXS is running. Many GUIs can generate the .ins file. The following example ylid.ins was set up, together with ylid.hkl, by the Bruker AXS program XPREP:

TITL ylid in P2(1)2(1)2(1)
CELL 0.71073 5.9645 9.0412 18.3971 90.000 90.000 90.000
ZERR 4.00 0.0004 0.0005 0.0022 0.000 0.000 0.000
LATT -1
SYMM 0.5-X, -Y, 0.5+Z
SYMM -X, 0.5+Y, 0.5-Z
SYMM 0.5+X, 0.5-Y, -Z
SFAC C H O S
UNIT 44 40 8 4
TREF
HKLF 4
END

The resulting ylid.res file will require editing, to define the atom names and scattering factor numbers and delete noise peaks, before it can be renamed to ylid.ins for SHELXL refinement using the same reflection data file ylid.hkl. Many possible instructions may be included in the .ins file (see list below), but the only options that are really worth trying if the above fails to solve the structure are:

1. Increase the number of tries from the default TREF 100 to say TREF 10000.

2. If more than one solution has good Rα and NQUAL values, it is possible that the structure has been solved but the program has chosen the wrong solution. The list of +/- signs (the seminvariant phses) can then be examined to see which solutions are likely to be equivalent or not. Other solutions can then be generated with TREF -n, where n is the solution code number (actually the random number seed). Because of the influence of small rounding errors the TREF -n job should be run on the same computer and no other input parameters may be changed.

3. If heavy atoms are present, it might be better to try to locate them first by replacing TREF with PATT 10 (see below).


Patterson interpretation and partial structure expansion with SHELXS

For Patterson interpretation, TREF is replaced by e.g. PATT 10. To expand a partial structure using the tangent formula, TREF should be replaced by TEXP followed by atoms in SHELX format. In all cases, the .ins file begins with TITL, CELL, ZERR, LATT, SYMM, SFAC and UNIT in that order, and the last instructions is usually HKLF 4.

The PSEE instruction may be used to prepare files for the small molecule Patterson search program PATSEE written by Ernst Egert. The FRAG, SPIN and MOVE instructions are intended for the input of PATSEE solutions into SHELXS to complete the structure.


Patterson interpretation algorithm

The algorithm used to interpret the Patterson to find the heavier atoms is as follows:

1. One peak is selected from the sharpened Patterson (or input by means of a VECT instruction) and used as a superposition vector. This peak must correspond to a correct heavy-atom to heavy-atom vector otherwise the method will fail. The entire procedure may be repeated any number of times with different superposition vectors by specifying PATT nv, with |nv| > 1, or by including more than one VECT instruction in the same job.

2. The Patterson function is calculated twice, displaced from the origin by +U and -U, where U is the superposition vector. At each grid point the lower of the two values is taken, and the resulting superposition minimum function is interpolated to find the peak positions. This is a much cleaner map than the original Patterson and contains only 2N (or 4N etc. if the superposition vector was multiple) peaks rather than N². The superposition map should ideally consist of one image of the structure and its inverse; it has an effective space group of P-1 (or C-1 for a centered lattice etc.).

3. Possible origin shifts are found which place one of the images correctly with respect to the cell origin, i.e. most of the symmetry equivalents can be found in the peak-list. The SYMFOM figure of merit (normalized so that the largest value for a given superposition vector is 99.9) indicates how well the space group symmetry is satisfied for this image.

4. For each acceptable origin shift, atomic numbers are assigned to the potential atoms based on average peak heights, and a crossword table is generated. This gives the minimum distance and Patterson minimum function for each possible pair of unique atoms, taking symmetry into account. This table should be interpreted by hand to find a subset of the atoms making chemically sensible minimum interatomic distances linked by consistently large Patterson minimum function values. The PATFOM figure of merit measures the internal consistency of these minimum function values and is also normalised to a maximum of 99.9 for a given superposition vector. The Patterson values are recalculated from the original Fo data, not from the peak-list. For high symmetry space groups the minimum function is calculated as an average of the two (or more) smallest Patterson densities.


Alphabetical list of instructions in the SHELXS .ins file

All instructions in the .ins file commence with a four (or less) letter word (which may be an atom name) followed by numbers and other information in free format, separated by one or more spaces. Upper and lower case input may be freely mixed. Defaults are given in square brackets; '#' indicates that the program will generate a suitable default value based on the rest of the available information. Continuation lines are flagged by '=' at the end of a line, the instruction being continued on the next line which must start with at least one space. Other lines beginning with one or more spaces are treated as comments, so blank lines may be added to improve readability. All characters following '!' or '=' in an instruction are ignored, except after TITL or SYMM (for which continuation lines are not allowed). AFIX, RESI and PART instructions may be present in the .ins file for compatibility with SHELXL, but will be ignored.

CELL λ a b c α β γ
Wavelength and unit-cell dimensions in Angstroms and degrees.

EGEN d(min) d(max)
All missing reflections in the resolution range d(min) to d(max) Å (the order of d(min) and d(max) is unimportant) are generated on a statistical basis, assuming that they were skipped during the data collection because a prescan indicated that they were weak (only relevant for a 1-D detector!). These reflections will then be flagged as 'unobserved', but improve the estimation of the remaining E-values and enable an increased number of negative quartets to be identified. d(min) should be safely inside the resolution limit of the data and d(max) should be set so that there is no danger of regenerating strong reflections (as weak) which were cut off by the beam stop etc.

END
This is the last instruction in the rare cases when the .ins file is not terminated by the HKLF instruction.

ESEL Emin[1.2] Emax[5] dU[.005] renorm[.7] axis[0]
Emin sets the minimum E-value for the list of largest E-values that the program normally retains in memory; it should be set so as to give more than enough reflections for TREF etc. It is also the threshold used for tangent expansion and 'peak-list optimisation'. It is advisable to reduce Emin to about 1.0 for triclinic structures and pseudosymmetry problems. If Emin is negative, acentric triclinic data are generated for use in all calculations. The other parameters control the normalisation of the E-values:

new(E) = old(E)⋅exp[8π2dU(sin(θ)/λ)2] / [ old(E)-4 + Emax-4 ]0.25

renorm is a factor to control the parity group renormalisation; 0.0 implies no renormalisation, 1.0 sets full renormalisation, i.e. the mean value of E² becomes unity for each parity group. If axis is 1, 2 or 3, an additional similar renormalisation is applied for groups defined by the absolute value of the h, k or l index respectively.

FMAP code[#] axis[#] nl[#]
The unique unit of the cell for performing the Fourier calculation is set up automatically unless specified by the user using FMAP and GRID. The program chooses a 53 x 53 x nl or 103 x 103 x nl grid depending the the resolution of the data, provided sufficient memory is available in the latter case. code = 1 (F²-Patterson), 3 (Patterson with coefficients input using HKLF 7; negative coefficients are allowed. 4 (E-map without peak-list optimisation, e.g. because the peaks correspond to unequal atoms), 5 (Fourier with A and B coefficients input using HKLF 3), 6 (EF Patterson), code > 6 (E-map followed by [code-6] cycles peak-list optimization). Note that the peak-list optimization assigns very strong peaks to heavy atoms (if specified by SFAC) and all remaining peaks to scattering factor type 1, so for many structures this should be specified as carbon on a SFAC instruction. FMAP 4 may be used with atoms but without TEXP etc. for an E-map based on calculated phases.

FRAG code[#] a[1] b[1] c[1] α[90] β[90] γ[90]
FRAG enables the PATSEE search fragment to be read in using the original cell or orthogonal coordinates. This instruction will usually be preceded by SPIN and MOVE commands to give the rotation angles and translation (same conventions as for PATSEE), and followed by a list of atoms. FRAG, SPIN and MOVE instructions remain in force until superseded by another instruction of the same type. code is ignored by SHELXS but is included for compatibility with PATSEE and SHELXL (where it is used for different purposes).

GRID sl[#] sa[#] sd[#] dl[#] da[#] dd[#]
Fourier grid, when not set automatically. Starting points and increments are multiplied by 100. s means starting value, d increment, l is the direction perpendicular to the layers, a is across the paper from left to right, and d is down the paper from top to bottom. Note that the grid is 53 x 53 x nl points that sl and dl need not be integral. The 103 x 103 x nl grid is only available when it is set automatically by the program (see above).

HKLF n[0] s[1] r11...r33[1 0 0 0 1 0 0 0 1] wt[1] m [0]
Before running SHELXS, a reflection data file name.hkl must usually be prepared. The HKLF command tells the program which format has been chosen for this file, and allows the indices to be reorientated using a 3x3 matrix r11..r33 (which should have a positive determinant). n is negative if reflection data follow, otherwise they are read from the .hkl file. The data are read in fixed format (3I4,2F8.2) (except for n = 1) subject to FORTRAN conventions. The data are terminated by a record with h, k and l all zero (except n = 1, which contains a terminator and checksum). If batch numbers, direction cosines or wavelengths are present in the .hkl file they will be ignored. The multiplicative scale s multiplies both F² and σ(F²) (or F and σ(F) for n = 1 or 3). The multiplicative weight wt multiplies all 1/σ² values and m is an integer offset needed to read condensed data (HKLF 1); both are included only for compatibility with SHELX-76. Usually simply 'HKLF 4' is all that will be required.

n = 1: SHELX-76 condensed data, now deprecated.

n = 3: h k l Fo σ(Fo) or h k l A B depending on the FMAP setting.

n = 4: h k l F² σ(F²). The recommended format for nearly all purposes.

n = 7: h k l E or h k l P (Patterson coefficient) depending on FMAP.

There may only be one HKLF instruction and it must come last!

INIT nn[#] nf[#] s+[0.8] s-[0.2] wr[0.2]
The first stage involves five cycles of weighted tangent formula refinement (based on triplet phase relations only) starting from nn reflections with random phases and weights of 1. Single phase seminvariants which have Σ1-formula P+ values less that s- or greater than s+ are included with their predicted phases and unit weights. All these reflections are held fixed during the INIT stage but refined freely in the subsequent stages. The remaining reflections also start from random phases with initial weights wr, but both the phases and the weights are allowed to vary. If nf is non-zero, the nf 'best' (based on the negative quartet and triplet consistency) phase sets are retained and the process repeated for (npp-nf) parallel phase sets, where npp is the previous number of phase sets processed in parallel (often 128). This is repeated for nf fewer phase sets each time until only a quarter of the original number are processed in parallel. This rather involved algorithm is required to make efficient use of available computer memory. Typically nf should be 8 or 16 for 128 parallel permutations. The purpose of the INIT stage is to feed the phase annealing stage with relatively self-consistent phase sets, which turns out to be more efficient than starting the phase annealing from purely random phases. If TREF 0 is used to generate partial structure phases for all reflections, the INIT stage is skipped. To save time, only ns reflections and the strongest mtpr triplets for each reflection (or less, if not so many can be found) are used in the INIT stage; these numbers are given on the PHAN instruction.

LATT N[1]
Lattice type: 1=P, 2=I, 3=rhombohedral obverse on hexagonal axes, 4=F, 5=A, 6=B, 7=C. N must be made negative if the structure is non-centrosymmetric.

LIST m[0]
If m = 1 or m = 2 writes h, k, l, A and B lists to the name.res file, where A and B are the real and imaginary parts of a point-atom structure factor respectively. If m = 1 the list corresponds to the phased E-values for the 'best' direct methods solution, before partial structure expansion (if any). If m = 2 the list is produced after the final cycle of partial structure expansion, and corresponds to the weighted E-values used for the final Fourier synthesis. These options enable other Fourier programs to be used, e.g. for graphical display of 3D-Fouriers for data to less than atomic resolution. After data reduction and merging equivalent reflections, a list of h, k, l, Fo and σ(Fo) (for m = 3) or h, k, l, Fo² and σ(Fo²) (for m = 4) is written to the name.res file. This provided a useful input file for programs such as DIRDIF and MULTAN that did not provide sort/merge and rejection of systematic absences etc. SHELXS always averages Friedel opposites. In all four cases the output format is (3I4,2F8.2), and the list is terminated by a dummy reflection 0,0,0.

MOLE n[#]
Forces the following atoms, and atoms or peaks that are bonded to them, into molecule n of the PLAN output. n may not be greater than 99.

MORE verbosity[1]
More sets the amount of (printer) output; verbosity takes a value in the range 0 (least) to 3 (most verbose).

MOVE dx[0] dy[0] dz[0] sign[1]
The coordinates of the atoms that follow this instruction are changed to:

x' = dx + sign⋅x
y' = dy + sign⋅y
z' = dz + sign⋅z

OMIT s[4] 2θ(lim)[180]
Thresholds for flagging reflections as 'unobserved'. Note that if no OMIT instruction is given, ALL reflections are treated as 'observed'. If F < s⋅σ(F), the reflection is considered to be 'unobserved'. If 2θ(lim) is POSITIVE, it specifies a 2θ value above which the data are treated as 'unobserved'; if it is negative, the absolute value is used as a lower 2θ cutoff.

OMIT h k l
The reflection h k l is flagged as 'unobserved' in the list of merged reflections after data reduction. It will not be used directly in phase refinement or Fourier calculations, but is retained for statistical purposes and as a possible cross-term in a negative quartet. Thus if it is known that a strong reflection has been included accidentally in the .hkl file with a very small intensity (e.g. because it was cut off by the beam stop), it is advisable to delete it from the .hkl file rather than using OMIT (which is intended for imprecisely measured data rather than blunders).

PHAN nsteps[10] cool[0.9] Boltz[#] ns[#] mtpr[40] mnqr[10]
The second stage of phase refinement is based on 'phase annealing' (Sheldrick, 1990). This has proved to be an efficient search method for large structures, and possesses a number of beneficial side-effects. It is based on nsteps cycles of tangent formula refinement (one cycle is a pass through all ns phases), in which a correction is applied to the tangent formula phase. The phase annealing algorithm gives the magnitude of the correction (it is larger when the 'temperature' is higher; this corresponds to a larger value of Boltz), and the sign is chosen to give the best agreement with the negative quartets (if there are no negative quartets involving the reflection in question, a random sign is used instead). After each cycle through all ns phases, a new value for Boltz is obtained by multiplying the old value by cool; this corresponds to a reduction in the 'temperature'. To save time, only ns reflections are refined using the strongest mtpr triplets and mnqr quartets for each reflection (or less, if not so many phase relations can be found). The phase annealing parameters chosen by the program will rarely need to be altered; however if poor convergence is observed, the Boltz value should be reduced; it should usually be in the range 0.2 to 0.5. When the 'TEXP 0 / TREF' method of multisolution partial structure refinement is employed, Boltz should be set at a somewhat higher value (0.4 to 0.7) so that not too many solutions are duplicated.

PLAN npeaks[#] d1[0.5] d2[1.5]
If npeaks is positive it is the number of highest unique Fourier peaks that are written to the .res and .lst files; the remaining parameters are ignored. If npeaks is given as negative, the program attempts to arrange the peaks into unique molecules taking the space group symmetry into account, and to 'plot' a projection of each such molecule on the printer (i.e. the .lst file). Distances involving peaks which are less than r1+r2+d1 (the covalent radii r are defined via SFAC; 1 and 2 refer to the two atoms concerned) are considered to be bonds for purposes of the molecule assembly and tables. Distances involving atoms and/or peaks that are less than r1+r2+d2 are considered to be non-bonded interactions. Such interactions are ignored when defining molecules, but the corresponding atoms and distances are included in the line-printer output. Thus an atom may appear in more than one map, or more than once on the same map. Negative d2 includes hydrogen atoms in these non-bonds, otherwise they are ignored (the absolute value of d2 is used in the test). Peaks are always always assigned the radius of SFAC type 1, which is usually set to carbon. Peaks appear on the printout as numbers, but in the .res file they are given names beginning with 'Q' and followed by the same numbers. To simplify interpretation of the lineprinter plots, extra symmetry-generated atoms are added, so that atoms or peaks may appear more than once. A table of the appropriate coordinates and symmetry transformations appears at the end of the output. See also MOLE for forcing molecules (and their environments) to be printed separately.

PSEE m[200] 2θ(max)[#]
The largest |m| E-values and the complete Patterson map are dumped into the name.res file in fixed format for use by the Patterson search program PATSEE. 2θ(max) should be used to limit the resolution of the E-values generated; the default value corresponds to sinθ=λ/2. The 2θ(max) value is also written to the .res file, so it is possible to restrict the resolution of the E-values actually used by PATSEE to a lower 2θ(max) by editing this file without rerunning SHELXS; of course the E-values with higher 2θ than the value used in SHELXS were not written to the .res file and so cannot be recovered in this way. When m is negative a super-sharp Patterson with coefficients √(E³F) is used; if m is positive a standard sharpened Patterson with coefficients (EF) is employed. The resulting name.res file must be renamed name.inp (or name.pat if the search fragment and encoded Patterson are to be read from separate files) for use by PATSEE. After a PSEE instruction, UNIT is followed by the strongest E-values and the full Patterson map in this output file (which may be rather long !).

REM
Followed by a comment on the same line. This comment is ignored by the program but is copied to the results file (.res).

SFAC elements
These element symbols define the order of scattering factors to be employed by the program. The first 94 elements of the periodic system are recognized. SHELXS uses absorption coefficients from International Tables (1991) volume C. For organic structures the first two SFAC types should be C and H, in that order; the E-Fourier recycling generally assigns the first SFAC type (i.e. C) to peaks.

SFAC a1 b1 a2 b2 a3 b3 a4 b4 c df' df" mu r wt
Scattering factor in the form of an exponential series, followed by real and imaginary corrections, linear absorption coefficient, covalent radius and atomic weight. In addition, a 'label' consisting of up to 4 characters beginning with a letter (e.g. Ca2+) may be included before a1. The two SFAC formats may be used in the same .ins file; the order of the SFAC instructions (and the order of element names in the first type of SFAC instruction) define the scattering factor numbers which are referenced by atom instructions. Not all numbers on this instruction are actually used by SHELXS, but the full data must be given for compatibility with SHELXL.

SPIN phi1[0] phi2[0] phi3[0]
The following fragment (which should begin with a FRAG instruction) is rotated by the specified angles (in radians). This instruction is used to reinput angles from the Patterson search program PATSEE.

SYMM symmetry operation
Symmetry operators, i.e. coordinates of the general positions as given in the International Tables, volume A. The operator X, Y, Z is always assumed, so may NOT be input. If the structure is centrosymmetric, the origin MUST lie on a center of symmetry. Lattice centering should be indicated by LATT, not SYMM. The symmetry operators may be specified using decimal or fractional numbers, e.g. 0.5-x,0.5+y,-z or Y-X,-X,Z+1/6; the three components are separated by commas. At least one SYMM instruction must be present unless the structure is triclinic.

TIME t [#]
If the time t (measured in seconds from the start of the job) is exceeded, SHELXS performs no further blocks of phase permutations (direct methods), but goes on to the final E-map recycling etc. In the case of Patterson interpretation, no further vector superpositions are performed after this time has expired. This instruction is a relic from the days when a SHELXS job took hours rather than a fraction of a second!

TITL [ ]
Title of up to 76 characters, to appear at suitable places in the output.

TREF np[100] nE[#] kapscal[#] ntan[#] wn[#]
np is the number of direct methods attempts; if negative, only the solution with code number |np| is generated (the code number is in fact a random number seed). Since the random number generation is very machine dependent, this can only be relied upon to generate the same results when run on the same model of computer. This facility is used to generate E-maps for solutions which do not have the 'best' combined figure of merit. No other parameter may be changed if it is desired to repeat a solution in this way. nE reflections are employed in the full tangent formula phase refinement. Values of nE that give fewer than 20 unique phase relations per reflection for the full phase refinement are not recommended. kapscal multiplies the products of the three E-values used in triplet phase relations; it may be regarded as a fudge factor to allow for experimental errors and also to discourage overconsistent (uranium atom) solutions in symorphic space groups. If it is negative the cross-term criteria for the negative quartets are relaxed (but all three cross-term reflections must still be measured), and more negative quartets are used in the phase refinement, which is also useful for symorphic space groups. ntan is the number of cycles of full tangent formula refinement, which follows the phase annealing stage and involves all nE reflections; it may be increased (at the cost of CPU time) if there is evidence that the refinement is not converging well.
To avoid overconsistency, cos-1(<α>/α) is added to the modified tangent formula phase when <α> is less than α. α is the weighted sum of the cosines of the triple phase invariants and <α> is its statistically predicted value; the sign of the correction is chosen to give the best agreement with the negative quartets (a random sign is used if there are no negative quartets involving the phase in question). This tends to drive the figures of merit Rα and NQUAL simultaneously to desirable values. If ntan is negative, a penalty function (<Σ1>−Σ1)² is added to CFOM (see below) if and only if Σ1 is less than its estimated value <Σ1>. Σ1 is a weighted sum of the products of the expected and observed signs of one-phase seminvariants, normalized so that it must lie in the range -1 to +1. This is useful (i.e. better than nothing) if no negative quartets have been found. wn is a parameter used in calculating the combined figure of merit CFOM:

CFOM = Rα  (NQUAL < wn)    or    Rα + (wn−NQUAL)²  (NQUAL >= wn)

wn should be about 0.1 more negative than the anticipated value of NQUAL. Only the TREF instruction is essential to specify direct methods; appropriate INIT, PHAN, FMAP, GRID and PLAN instructions are then generated automatically if not given.

UNIT n1 n2 ...
Number of atoms of each type in the cell, in SFAC order.

ZERR Z esd(a) esd(b) esd(c) esd(α) esd(β) esd(γ)
Z-value (number of formula units per cell) followed by the estimated errors in the unit-cell dimensions. This information is not actually required by SHELXS but is allowed for compatibility with SHELXL.