Citing LGA program, GDT and LCS measures:
Zemla A., "LGA - a Method for Finding 3D Similarities in Protein Structures",
Nucleic Acids Research, 2003, Vol. 31, No. 13, pp. 3370-3374.
[MEDLINE]
Server accessible at:
http://proteinmodel.org/
http://as2ts.llnl.gov/
LGA program is being developed for structure comparative analysis of two selected 3D protein structures or fragments of 3D protein structures. By default the calculations are performed on CA atoms. However, user can select other than CA atoms or define position within residues on which the calculations will be made (see options below: "-atom","-bmo", and "-cb").
Structure comparative analysis can be made in two general modes:
The two novel measures LCS and GDT have been designed and developed by Adam Zemla to serve as a basis for a scoring function of the LGA alignment algorithm. While comparing two protein structures, the LCS procedure is able to localize (along the sequence) the Longest Continuous Segments of residues that can fit under selected RMSD cutoff. The Global Distance Test (GDT) algorithm is designed to complement evaluations made with LCS by searching for the largest (not necessary continuous) set of "equivalent" residues deviating by no more than a specified DISTANCE cutoff. In the structure alignment search procedure, for each calculated superposition and generated list of equivalent residues, the following values are calculated:
LCS_vi - percent of residue pairs from molecule1 and moledule2 (continuous set; relative to molecule2)
that can fit under the RMSD cutoffs of vi Angstroms (for vi = 1.0, 2.0, ...), and
GDT_vi - an estimation of the percent of residue pairs from molecule1 and moledule2 (largest set) that
can fit under the distance cutoffs of vi Angstroms (for vi = 0.5, 1.0, ...)
By combining results (see LGA_S score) from these two techniques (RMSD based and distance based), the LGA program not only identifies the "best" superposition between two proteins (meaning "under certain RMSD and distance cutoffs"), but also identifies the regions of local similarities, and quantifies the level of the overall structure similarity in terms of the percentage of similar residue conformations.
A set of additional new GDT-like measures GDC (Global Distance Calculation) has been developed
to allow detailed structure comparison and evaluation of structure similarity of proteins
using all atoms or a list of selected atom positions (not only Calpha positions).
D. A. Keedy, C. J. Williams, J. J. Headd, W. B. Arendall III, V. B. Chen,
G. J. Kapral, R. A. Gillespie, J. N. Block, A. Zemla, D. C. Richardson,
J. S. Richardson. "The other 90% of the protein: Assessment beyond the Calphas
for CASP8 template-based and high-accuracy models", Proteins: Structure, Function,
Bioinformatics, 2009, 77, pp. 29-49.
[MEDLINE]
Developed numerical measures and algorithms: LCS, GDT (GDT_TS, GDC_sc, GDC_all) for evaluation structural models, and LGA - structure comparison and alignment program are routinely used by CASP organizers and assessors to evaluate accuracy of predicted structural models.
Author: Adam Zemla US Patent: 8024127 Copyright: CP01155 For licensing instructions please check: License Agreement Business Development Executive Lawrence Livermore National Laboratory 7000 East Ave., L-795 Livermore, CA 94551 Phone: (925) 423-9724 Fax: (925) 423-8988 https://ipo.llnl.gov
The data for LGA processing should contain two sets of 3D structures coordinates (molecule1 and molecule2) in the format of the PDB standard ATOM records. As a result of LGA processing user will get the rotated coordinates of the first structure (molecule1) , and (optionally) the coordinates of the second structure (target - molecule2, not changed).
Suggested set of parameters for GDT and LCS structure similarity analysis of structures
with identical residue numbering: -3 -sda -o2 -gdc -stral
Suggested set of parameters for structure alignment LGA searches: -4 -o2 -gdc -lga_m -stral
Details about specific options that may help meet user's needs better:
options: [ -h | -aa | -al | -batch ]
[ -1 | -2 | -3 | -4 | -5 ]
[ -mol1:name1 | -mol2:name2 ]
[ -sda | -sia | -fit:b:gap:res | -stral | -stral:f | -lN:n ]
[ -atom:CA | -bmo:b:m:o | -cb:f | -ah:i | -ch1:c | -ch2:c ]
[ -aa1:n1:n2 | -aa2:n1:n2 | -gap1:n1:n2 | -gap2:n1:n2 ]
[ -er1:s1:s2 | -er2:s1:s2 ]
[ -gdc_set:s1:s2 | -gdc_sup:s1:s2 | -gdc_at:a1,a2 | -gdc_eat:e1:e2 ]
[ -gdc_sc | -gdc | -gdc:n | -gdc_ref | -gdc_ref:n ]
[ -o0 | -o1 | -o2 | -r | -rmsd | -swap | -fp | -ie ]
[ -d:f | -gdt | -lw:n | -lga:f | -lga_m ]
where: -h help information
-1 standard RMSD
-2 RMSD using ISP (Iterative Superposition Procedure)
-3 GDT and LCS analysis
-4 structure alignment analysis
-5 structure best fit analysis: S => S-gap-S , S-gap-S => S
-mol1:name1 name of the molecule1 that will be used in output file
(name1.name2). The alphanumeric characters and '_'
are allowed.
-mol2:name2 name of the molecule2 that will be used in output file
(name1.name2). The alphanumeric characters and '_'
are allowed.
-atom:CA CA (Calpha) atoms will be used for calculations.
NOTE: to specify special character "'" use ",".
For example: use "-atom:CB" to select CB atom,
use "-atom:H5,1" to select H5'1 atom.
-bmo:b:m:o CB and M (M = Mid: C,CA,CB,N) atoms will be calculated.
The coordinates of the point representing amino-acid
position (BMO; backbone model) for LGA processing are
defined by the following vectors:
vector CA-CB: -5.0 <= b <= 5.0
vector CA-M: -5.0 <= m <= 5.0
vector CA-O: -5.0 <= o <= 5.0
For example: CA = -bmo:0:0:0 (default)
-cb:f CB (Cbeta) atom position will be calculated for each amino-acid,
and the coordinates of the point representing amino-acid
position (BMO; backbone model) for LGA processing will be
defined by the vector CA-CB: -5.0 <= f <= 5.0 , (e.g. f=0
corresponds to CA position, and f=1 represents CB position)
NOTE: if "-cb:f" is combined with "-atom:CB" then all
existing CB atoms are leveraged and only missing CB atoms
are calculated. This option is equvalent to -bmo:f:0:0
-ch1:c chain c selected from molecule1
-ch2:c chain c selected from molecule2
-ah:i ATOM or HETATM records are used for calculations:
i=0 both
i=1 ATOM
i=2 HETATM
-lga:f weight LCS and GDT measures: LCS 0.0 <= f <= 1.0 GDT,
default: f = 0.75
-lga_m maximum value of LGA_S (LGA_M) reported in SUMMARY line
-d:f DIST distance cutoff (f Angstroms; default f=5.0)
-opt:n optimization parameter: 0, 1, 2. Default: 1
-gdt can be combined with "-3" option. If used then the
superposition that fits maximum number of residues under
a given distance cutoff is reported. Otherwise standard
superposition calculated using the set of identified N
residues is reported (rotated molecule1)
-lw:n "Lesk window", rms calculated on residue window
(length of the window = 2*n+1)
-lN:n limit on N superimposed residues (if calculated N S-gap-S)
-sia sequence independent analysis,
structure conformation independent analysis (S-gap-S => S)
-fit:b:g:r search for the best fit,
b number of residues below 0.5 A (b - integer 0 <= b <= 9)
g length of the gap (g - integer 0 <= g <= 99)
r residue number after which the gap appears (r - string)
-er1:s1:s2 exact range of residues from the molecule1 used for
calculations (s1 , s2 - strings e.g.: s1 = 13L_A <= s2 = 45_B)
the si pairs (ranges beg:end) can be separated by ',':
-er1:s1:s2,s3:s4,s5:s6,s7:s8,s9:s10
NOTE: single residues or chains can be separated by ','(no beg:end required):
-er1:s1,s2,s3,
Up to 50 er1 parameters are allowed (WARNING: no overlaps)
-er2:s1:s2 exact range of residues from the molecule2 used for
calculations (s1 , s2 - strings e.g.: s1 = 13L_A <= s2 = 45_B)
the si pairs (ranges beg:end) can be separated by ',':
-er2:s1:s2,s3:s4,s5:s6,s7:s8,s9:s10
NOTE: single residues or chains can be separated by ','(no beg:end required):
-er2:s1,s2,s3,
Up to 50 er2 parameters are allowed (WARNING: no overlaps)
-gdc:n n - number of bins used for GDC evaluation of atom pairs from the
corresponding residues (1 <= n <= 20; bins: <0.5, <1.0, ... <10.0).
NOTE: this option changes the default number of "bins" (n=20) for
GDC calculations (GDC_all - all atoms, GDC_mc - main chain atoms,
and GDT_at - selected atoms). The default number n=20 defines bins
from 0.5 to 10.0 Angstroms.
-gdc GDC score is calculated using all identical atoms from the target as
a frame of reference (equivalent to: -gdc_ref:2 -swap)
-gdc_ref:n GDC score is calculated: 0 - requesting a complete set of atoms within
each residue, 1 - using atoms from the target as a frame of reference,
2 - using all identical atoms from the target as a frame of reference.
The default set is -gdc_ref:0
-gdc_sup:s1:s2 exact range of residues from the molecule2 used for
GDC superposition calculations. This additional standard (-1)
superposition is calculated on CA atoms from the set of
amino-acid ranges (s1,s2) defined by s1 and s2 strings.
e.g. -gdc_sup:s1:s2,s3:s4,s5:s6,s7:s8,s9:s10
Format is the same as for er2 parameters.
NOTE: this option is applied to the molecule2 only. Corresponding
residues from molecule1 are automatically determined using main
superposition.
-gdc_sup expands an option "-rmsd". If used then the superposition which is
used for GDC calculations is reported and used to rotate molecule1.
Otherwise the standard LGA superposition is reported.
-gdc_set:s1:s2 exact range of residues from the molecule2 for which the
"Global Distance Calculations" (GDC) will be performed.
e.g. -gdc_set:s1:s2,s3:s4,s5:s6,s7:s8,s9:s10
Format is the same as for er2 parameters.
NOTE: this option is applied to the molecule2 only. Amino-acids
from the molecule2 serve as a frame of reference for GDC evaluation
(corresponding amino-acids or atoms that are missing in molecule1
are counted as 0 scores in GDC calculations).
-gdc_at:a1,a2 amino-acid atom names (one atom per one name of amino-acid) from
the molecule2 for which the GDC calculations (distances and GDC
summary) will be calculated.
Format example (aaname.atom): -gdc_at:a1,a2,a3,a4
where: a1 = V.CG1, a2 = C.SG, a3 = T.OG1, a4 = H.NE2
NOTE: this option is applied to the molecule2 only. The
corresponding atoms from the molecule1 will be detected based
on the calculated alignment. Up to 20 representative atoms
(one atom per each of 20 amino-acid) can be selected for
GDC evaluation. Number of identified identical "amino-acid.atom"
pairs serve as a frame of reference for GDC evaluation.
Results from the GDC-at calculations are reported in Dist_at and
GDC_at columns.
-gdc_at:*.at allows a selection of one mainchain or CB atom (at: N,CA,C,O,CB)
the same for all amino acids (e.g. -gdc_at:*.N).
NOTE: amino-acids from the molecule2 serve as a frame of reference
for GDC evaluation (corresponding amino-acids or atoms that are
missing in molecule1 are counted as 0 scores in GDC calculations).
-gdc_eat:e1:e2 exact atom "e1" from the molecule1 and "e2" from the molecule2 for
which the GDC calculations (distances and GDC summary) will be
calculated. Format example (aanumber_chain.atom):
-gdc_eat:e1:e2,e3:e4,e5:e6
where: for each pair (em:en) em is a selected atom from the
molecule1, and en is an atom from the molecule2.
For example: e1 = 10_A.OD2, e2 = 21_B.ND2
-gdc_sc automated selection of all flags required for GDC_sc calculations:
-swap -gdc:10 -gdc_at:V.CG1,L.CD1,I.CD1,P.CG,M.CE,F.CZ,W.CH2,S.OG
-gdc_at:T.OG1,C.SG,Y.OH,N.OD1,Q.OE1,D.OD2,E.OE2,K.NZ,R.NH2,H.NE2
NOTE: this option changes the default number of "bins" (see the
selection "-gdc:n"; n=10). All GDC calculations (GDC_all - all atoms,
GDC_mc - main chain atoms, and GDT_at - selected atoms) will be
performed using n=10 as a number of bins from 0.5 to 5.0 Angstroms.
Results from the GDC_sc calculations are reported in GDC_at column.
-aa1:n1:n2 range of residues from the molecule1 used for calculations
-9999 < n1 < n2 < 9999 (n1, n2 - integer)
NOTE: only one aa1 parameter is allowed.
-aa2:n1:n2 range of residues from the molecule2 used for calculations
-9999 < n1 < n2 < 9999 (n1, n2 - integer)
NOTE: only one aa2 parameter is allowed.
-gap1:n1:n2 range of residues from the molecule1 removed from calculations
-9999 < n1 < n2 < 9999 (n1, n2 - integer)
NOTE: only one gap1 parameter is allowed.
-gap2:n1:n2 range of residues from the molecule2 removed from calculations
-9999 < n1 < n2 < 9999 (n1, n2 - integer)
NOTE: only one gap2 parameter is allowed.
-aa generates a list of all residues from the molecule1 and
(molecule2 AAMOL* records)
-al calculations will be made only on the set of residues from the
attached AAMOL* or LGA records
-o0 no coordinates are printed out
-o1 only molecule 1 (rotated) is printed out into the
subdirectory TMP
-o2 molecule 1 (rotated) and molecule 2 (target) both are
printed out into the subdirectory TMP
-r the residue ranges of compared structures are reported in the
SUMMARY line: e.g. (1_A:214_A:7_A:196_A)
-rmsd additional RMSD and GDC calculations will be performed on all
aligned CA, MC and ALL atoms.
RMSD is "rmsd-based" measures: see MC and ALL colums
GDC is "distance-based" measures: see Dist_max, GDC_mc, and GDC_all
-swap expands an option "-rmsd". RMSD and GDC calculations will be
performed with checking for swapping atoms in amino acids:
ASP, GLU, PHE, and TYR
-fp full print output
-check reports amino acids with missing pre-selected atoms
-ie ignores errors in PDB data (force calculations). If "-ie" not
present then in case of ERROR detected in input data the
calculations are terminated
-stral additional information about identified structural SPANS (regions with tight
superpositions) is reported: S_nb - number of SPANS, S_N - combined number
of residues within SPANS, S_Id - average sequence identity within SPANS
(standalone version: two output files in TMP directory are created: *.stral and *.pdb)
-stral:f cutoff for local RMSD for stral calculations (0.01 <= f <= 10.0)
default: f = 0.5
-batch:frun it allows to run several different lga calculations on the
same mol1.mol2 pair of structures. File frun contains a list
of parameters. Maximum number of RUN lines is limited to 400
(see below).
If two structures from PDB have to be analyzed then please use the following notation:
1cpi_A for PDB entry: 1cpi, chain: 'A' 1akf for PDB entry: 1akf, chain: ' 'and specifying NMR MODEL:
1bve_B_5 for PDB entry: 1bve, chain: 'B', model: 5 1rel___4 for PDB entry: 1rel, chain: ' ', model: 4
If your data (two structures) is already prepared as one file then please check if each one of the two 3D structures begins with MOLECULE record and ends with END record.
### Example of usage of the standalone LGA program:
./lga -4 -o2 -gdc -lga_m -stral STR1.STR2
Input: file_name
file_name - the file (e.g.: STR1.STR2) is located inside the subdirectory MOL2, and
contains two structures "STR1" and "STR2" in PDB format.
Each structure for LGA analysis should begin with
MOLECULE and end with END record:
MOLECULE name1
ATOM ........
........
ATOM ........
END
MOLECULE name2
ATOM ........
........
ATOM ........
END
Input files (e.g.: STR1.STR2) are located inside the subdirectory MOL2.
Output: file_name.pdb, file_name.lga
file_name.pdb - contains two superimposed PDB structures: 1 => 2
file_name.lga - contains calculated residue equivalences
NOTE: if options: -mol1:name1 and -mol2:name2 are used
then output file_name = name1.name2
Output files are written into the subdirectory TMP.
### Example of calculating GDT_HA, GDT_TS or any other combination of GDT scores from LGA (-3) output:
# formula for calculating GDT_HA:
./lga -3 -sda STR1.STR2 | grep "GDT PERCENT_AT" | awk '{ V=($3+$4+$6+$10)/4.0; printf "GDT_HA = %6.2f\n",V; }'
# formula for calculating GDT_TS:
./lga -3 -sda STR1.STR2 | grep "GDT PERCENT_AT" | awk '{ V=($4+$6+$10+$18)/4.0; printf "GDT_TS = %6.2f\n",V; }'
-------------------------------------------------------------------------------
Example of the output from the LGA program ("-4" - structure alignment search):
LGA-parameters used: -4 -d:2.3 -swap
# Molecule1: number of CA atoms 99 ( 760), selected 22 , name 1sip_A
# Molecule2: number of CA atoms 99 ( 1560), selected 31 , name 1bve_B_5
# PARAMETERS: 1sip_A.1bve_B_5 -4 -d:2.3 -swap -aa1:25:46 -aa2:20:50
# Search for Atom-Atom correspondence
# Structure alignment analysis
# Checking swapping
# possible swapping detected: D 30_A D 30_B
# Molecule1 Molecule2 DISTANCE Mis MC All Dist_max GDC_mc GDC_all
LGA - - K 20_B - - - - - - -
LGA - - E 21_B - - - - - - -
LGA - - A 22_B - - - - - - -
LGA - - L 23_B - - - - - - -
LGA - - L 24_B - - - - - - -
LGA D 25_A D 25_B 1.295 0 0.067 0.282 1.545 81.429 83.750
LGA T 26_A T 26_B 1.342 0 0.076 0.813 3.538 85.952 76.122
LGA G 27_A G 27_B 0.619 0 0.171 0.171 1.071 90.595 90.595
LGA A 28_A A 28_B 0.415 0 0.126 0.113 0.538 97.619 98.095
LGA D 29_A D 29_B 0.335 0 0.195 0.437 1.720 95.238 91.845
LGA D 30_A D 30_B 0.942 0 0.086 0.767 3.322 85.952 74.643
LGA S 31_A T 31_B 0.978 2 0.190 0.214 1.130 85.952 60.748
LGA I 32_A V 32_B 0.885 2 0.131 0.168 1.460 88.214 62.041
LGA V 33_A L 33_B 0.865 3 0.118 0.205 1.350 90.476 55.417
LGA T 34_A E 34_B 1.598 4 0.088 0.081 2.505 69.048 38.783
LGA G 35_A E 35_B - - - - - - -
LGA I 36_A M 36_B 2.065 3 0.040 0.061 2.714 71.190 44.702
LGA E 37_A S 37_B 0.338 1 0.037 0.059 0.938 95.238 78.571
LGA L 38_A L 38_B 0.472 0 0.704 0.627 1.912 88.452 85.060
LGA G 39_A P 39_B # - - - - - -
LGA P 40_A G 40_B 2.563 0 0.616 0.616 5.018 51.310 51.310
LGA H 41_A R 41_B 1.616 6 0.044 0.042 1.726 77.143 34.675
LGA Y 42_A W 42_B 0.919 9 0.095 0.120 1.160 88.214 31.667
LGA T 43_A K 43_B 1.421 4 0.136 0.140 1.477 81.429 45.238
LGA P 44_A P 44_B 1.239 0 0.068 0.278 1.239 81.429 82.721
LGA K 45_A K 45_B 0.583 0 0.288 1.176 2.594 84.048 77.302
LGA I 46_A M 46_B 1.241 3 0.047 0.069 2.020 79.286 47.738
LGA - - I 47_B - - - - - - -
LGA - - G 48_B - - - - - - -
LGA - - G 49_B - - - - - - -
LGA - - I 50_B - - - - - - -
# RMSD_GDC results: CA MC common percent ALL common percent GDC_mc GDC_all
NUMBER_OF_ATOMS_AA: 20 80 80 100.00 155 118 76.13 31
SUMMARY(RMSD_GDC): 1.227 1.374 1.450 53.813 42.291
#CA N1 N2 DIST N RMSD Seq_Id LGA_S GDT_HA4
SUMMARY(LGA) 22 31 2.3 20 1.23 45.00 64.078 48.387
Unitary ROTATION matrix and the SHIFT vector superimpose molecules (1=>2)
X_new = 0.207331 * X + 0.070492 * Y + -0.975728 * Z + 21.289257
Y_new = 0.207127 * X + -0.977951 * Y + -0.026640 * Z + -17.874228
Z_new = -0.956092 * X + -0.196577 * Y + -0.217360 * Z + 14.324877
Euler angles from the ROTATION matrix. Conventions XYZ and ZXZ:
Phi Theta Psi [DEG: Phi Theta Psi ]
XYZ: 0.784907 1.273364 -2.406362 [DEG: 44.9718 72.9584 -137.8744 ]
ZXZ: -1.543500 1.789906 -1.773575 [DEG: -88.4361 102.5540 -101.6183 ]
# END of job
The output (see above) from LGA calculations contains the following information:
1) The residue-residue equivalences are reported in LGA lines,
2) In the DISTANCE column the distances in Angstroms between corresponding residues
are reported when final global superposition is applied ("-" is present when
residues are not aligned under selected distance cutoff DIST).
The "#" in the sequence alignment (DISTANCE column) indicates that the calculated
distance between corresponding residues is above selected cutoff, and potentially
these residues can be included to the alignment if DIST cutoff is changed.
User may vary DIST cutoff to calculate more tight (accurate) or more relaxed
(to recognize overall similarity) superpositions (the default: DIST=5 Angstroms),
3) The option "-rmsd" allows the calculation of RMSD values on aligned CA, MC
(main chain; N,CA,C,O), and ALL atoms. If the option "-swap" is chosen then
calculating RMSD on ALL atoms "swapping" is considered. It means that in amino
acids where atom names can be switched, i.e.
for ASP: OD1 <-> OD2
for GLU: OE1 <-> OE2
for PHE: CD1 <-> CD2
CE1 <-> CE2
for TYR: CD1 <-> CD2
CE1 <-> CE2
cartesian rmsd is calculated with an option to minimize its value. Sets (CD1, CE1) and
(CD2, CE2) in PHE and TYR, as well as atoms OD1 and OD2 in ASP, OE1 and OE2 in GLU are
exchanged and more favorable contributions to rmsd are taken into account. In the above
example the possible swapping was detected for residue pair: D 30_A - D 30_B
# possible swapping detected: D 30_A D 30_B
In the "Mis" column the number of missing atoms in a given amino acid is reported. It is
calculated relative to the definition (see "-gdc_ref:0") of the amino acid from the second
molecule (in this example: target=1bve_B_5).
For more options please check the flag: -gdc.
The following atoms are expected for a given amino acid:
aa 1 2 3 4 5 6 7 8 9 10 11 12 13 14
A: N CA C O CB : Alanine
V: N CA C O CB CG1 CG2 : Valine
L: N CA C O CB CG CD1 CD2 : Leucine
I: N CA C O CB CG1 CG2 CD1 : Isoleucine
P: N CA C O CB CG CD : Proline
M: N CA C O CB CG SD CE : Methionine
F: N CA C O CB CG CD1 CD2 CE1 CE2 CZ : Phenylalanine
W: N CA C O CB CG CD1 CD2 NE1 CE2 CE3 CZ2 CZ3 CH2 : Tryptophan
G: N CA C O : Glycine
S: N CA C O CB OG : Serine
T: N CA C O CB OG1 CG2 : Threonine
C: N CA C O CB SG : Cysteine
Y: N CA C O CB CG CD1 CD2 CE1 CE2 CZ OH : Tyrosine
N: N CA C O CB CG OD1 ND2 : Asparagine
Q: N CA C O CB CG CD OE1 NE2 : Glutamine
D: N CA C O CB CG OD1 OD2 : Aspartic acid
E: N CA C O CB CG CD OE1 OE2 : Glutamic acid
K: N CA C O CB CG CD CE NZ : Lysine
R: N CA C O CB CG CD NE CZ NH1 NH2 : Arginine
H: N CA C O CB CG ND1 CD2 CE1 NE2 : Histidine
X: N CA C O CB : Nonstandard (ATOM or HETATM records)
#: N CA C O : Unknown (ATOM records)
4) There are three "distance based" values calculated for each selected amino acid: Dist_max,
GDC_mc and GDC_all (GDC - Global Distance Calculation). Dist_max is a maximum distance
between atoms from the corresponding (superimposed, equivalent) amino acids. This measure
can help evaluate how far from each other the side chain ends are for a given amino acid
under calculated superposition. GDC_mc and GDC_all are the measures (range: 0 - 100) which
for each listed and aligned amino acid combine the percentages of atoms (mainchain atoms
and all atoms) that fit under the selected distances: 0.5, 1.0, 1.5, ..., 10.0 (a similar
procedure as in GDT and LGA_S measures; see below).
NOTE: when different amino-acids are superimposed then "rmsd All", "Dist_max", and
"GDC_all" calculations are restricted to provided coordinates of mainchain+CB atoms
only (i.e.: N,CA,C,O,CB). If identical amino-acids are superimposed, then all corresponding
atoms (if provided) are evaluated. For both cases the rmsd "MC" and "GDC_mc" measures are
calculated on mainchain atoms only (i.e.: N,CA,C,O).
5) The SUMMARY(RMSD_GDC) line reports values of RMSD calculated on all aligned CA atoms,
MC atoms, and ALL atoms from aligned amino acids. The GDC_mc from the SUMMARY(RMSD_GDC)
line contains a sum of all calculated GDC_all values devided by the number of amino acids
selected in the molecule2 (in this example: 31).
NOTE: the option "-rmsd" can be combined with "-lw:n" to specify the length of
sliding window for calculating local RMSDs,
6) In the SUMMARY(LGA) line the following information is reported:
#CA N1 N2 DIST N RMSD Seq_Id LGA_S GDT_HA4
SUMMARY(LGA) 22 31 2.3 20 1.23 45.00 64.078 48.387
| | | | | | | |
where | | | | | | | |
| | | | | | | |
number of residues | | | | | | |
from mol1 (model) | | | | | | |
| | | | | | |
number of residues from | | | | | |
mol2 (target) | | | | | |
| | | | | |
selected distance cutoff DIST | | | | |
| | | | |
N number of residues superimposed under | | | |
distance cutoff DIST | | | |
| | | |
RMSD calculated on N residues superimposed | | |
under the distance DIST | | |
| | |
Sequence Identity. Percent of identical residues from | |
the total of N aligned under the distance DIST | |
| |
LGA_S score (0.00 - 100.00) calculated with reference to the |
number of residues in target (name2 - here 18 residues) |
|
GDT_HA4 ("hight accuracy" version of GDT_TS) score calculated for local
and global residue-residue correspondences established by LGA
-------------------------------------------------------------------------------
Example of the output from the LGA program ("-3" - LCS and GDT analysis).
LGA-parameters used: -3 -sda -o0 -d:4.0 -ch1:A -ch2:B
# FIXED Atom-Atom correspondence
# GDT and LCS analysis
LCS - RMSD CUTOFF 5.00 length segment l_RMS g_RMS
LONGEST_CONTINUOUS_SEGMENT: 46 26_B - 71_B 4.99 6.22
LONGEST_CONTINUOUS_SEGMENT: 46 27_B - 72_B 4.95 6.14
LCS_AVERAGE: 53.38
LCS - RMSD CUTOFF 2.00 length segment l_RMS g_RMS
LONGEST_CONTINUOUS_SEGMENT: 15 58_B - 72_B 1.56 25.45
LCS_AVERAGE: 13.60
LCS - RMSD CUTOFF 1.00 length segment l_RMS g_RMS
LONGEST_CONTINUOUS_SEGMENT: 14 59_B - 72_B 0.62 25.61
LCS_AVERAGE: 10.28
LCS_GDT MOLECULE-1 MOLECULE-2 LCS_DETAILS GDT_DETAILS TOTAL NUMBER OF RESIDUE PAIRS: 72
LCS_GDT RESIDUE RESIDUE SEGMENT_SIZE GLOBAL DISTANCE TEST COLUMNS: number of residues under the threshold assigned to each residue pair
LCS_GDT NAME NUMBER NAME NUMBER 1.0 2.0 5.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0
LCS_GDT M 1_A M 1_B 3 5 21 3 3 3 6 7 10 14 20 23 33 43 53 61 69 72 72 72 72 72 72
LCS_GDT N 2_A N 2_B 4 9 21 3 4 6 6 9 9 13 19 23 31 41 53 61 69 72 72 72 72 72 72
LCS_GDT I 3_A I 3_B 4 9 21 3 4 6 6 9 9 13 13 18 26 34 53 60 69 72 72 72 72 72 72
LCS_GDT F 4_A F 4_B 6 9 21 3 4 6 6 9 9 10 15 23 32 41 53 61 69 72 72 72 72 72 72
LCS_GDT E 5_A E 5_B 6 9 21 4 5 6 8 11 11 13 21 26 33 43 53 61 69 72 72 72 72 72 72
LCS_GDT M 6_A M 6_B 6 9 21 4 5 6 6 9 9 13 15 23 28 35 53 61 69 72 72 72 72 72 72
LCS_GDT L 7_A L 7_B 6 9 21 4 5 6 6 9 9 10 12 18 26 35 53 61 69 72 72 72 72 72 72
...........................................................................
LCS_GDT K 65_A K 65_B 14 15 46 9 13 14 14 14 15 17 20 26 33 43 53 61 69 72 72 72 72 72 72
LCS_GDT L 66_A L 66_B 14 15 46 6 13 14 14 14 14 14 17 25 33 43 53 61 69 72 72 72 72 72 72
LCS_GDT F 67_A F 67_B 14 15 46 9 13 14 14 14 14 18 22 26 33 43 53 61 69 72 72 72 72 72 72
LCS_GDT N 68_A N 68_B 14 15 46 9 13 14 14 14 14 18 22 26 33 43 53 61 69 72 72 72 72 72 72
LCS_GDT Q 69_A Q 69_B 14 15 46 6 13 14 14 14 15 17 18 25 33 43 53 61 69 72 72 72 72 72 72
LCS_GDT D 70_A D 70_B 14 15 46 9 13 14 14 14 14 14 15 16 27 41 53 61 69 72 72 72 72 72 72
LCS_GDT V 71_A V 71_B 14 15 46 6 13 14 14 14 14 18 22 26 33 43 53 61 69 72 72 72 72 72 72
LCS_GDT D 72_A D 72_B 14 15 46 5 10 14 14 14 15 17 21 26 33 43 53 61 69 72 72 72 72 72 72
LCS_AVERAGE LCS_A: 25.75 ( 10.28 13.60 53.38 )
GLOBAL_DISTANCE_TEST (summary information about detected largest sets of residues (represented by selected AToms) that can fit under specified thresholds)
GDT DIST_CUTOFF 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 5.50 6.00 6.50 7.00 7.50 8.00 8.50 9.00 9.50 10.00
GDT NUMBER_AT 9 13 14 14 14 15 18 22 26 33 43 53 61 69 72 72 72 72 72 72
GDT PERCENT_AT 12.50 18.06 19.44 19.44 19.44 20.83 25.00 30.56 36.11 45.83 59.72 73.61 84.72 95.83 100.00 100.00 100.00 100.00 100.00 100.00
GDT RMS_LOCAL 0.33 0.55 0.62 0.62 0.62 1.94 2.70 2.93 3.25 4.01 4.43 5.09 5.26 5.54 5.65 5.65 5.65 5.65 5.65 5.65
GDT RMS_ALL_AT 26.69 25.68 25.61 25.61 25.61 7.05 7.10 7.07 7.08 6.11 6.00 5.81 5.71 5.66 5.65 5.65 5.65 5.65 5.65 5.65
# Molecule1 Molecule2 DISTANCE
LGA M 1_A M 1_A 9.592
LGA N 2_A N 2_A 11.124
LGA I 3_A I 3_A 13.468
LGA F 4_A W 4_A 11.355
LGA E 5_A E 5_A 8.107
LGA M 6_A M 6_A 13.142
LGA L 7_A L 7_A 13.326
LGA R 8_A R 8_A 8.502
LGA I 9_A I 9_A 6.853
LGA D 10_A D 10_A 10.670
LGA E 11_A E 11_A 10.752
LGA G 12_A G 12_A 10.538
LGA L 13_A L 13_A 10.580
LGA R 14_A R 14_A 9.468
LGA L 15_A L 15_A 9.420
LGA K 16_A K 16_A 8.212
.........................................
LGA K 60_A K 60_A 6.946
LGA D 61_A D 61_A 7.011
LGA E 62_A E 62_A 3.782
LGA A 63_A G 63_A 3.027
LGA E 64_A E 64_A 4.870
LGA K 65_A K 65_A 5.735
LGA L 66_A L 66_A 5.332
LGA F 67_A F 67_A 2.681
LGA N 68_A N 68_A 4.077
LGA Q 69_A Q 69_A 8.089
LGA D 70_A D 70_A 7.413
LGA V 71_A A 71_A 2.131
LGA D 72_A D 72_A 7.762
#CA N1 N2 DIST N RMSD GDT_TS LGA_S3 GDT_HA SeqID
SUMMARY(GDT) 72 72 4.0 22 2.93 42.014 33.626 35.297 95.31
LGA_LOCAL RMSD: 2.929 Number of atoms: 22 under DIST: 4.00
LGA_ASGN_ATOMS RMSD: 8.532 Number of assigned atoms: 72
Std_ASGN_ATOMS RMSD: 5.648 Standard rmsd on all 72 assigned CA atoms
Unitary ROTATION matrix and the SHIFT vector superimpose molecules (1=>2)
X_new = 0.407935 * X + -0.032836 * Y + 0.912420 * Z + 11.435461
Y_new = 0.509052 * X + -0.821424 * Y + -0.257154 * Z + 61.613953
Z_new = 0.757928 * X + 0.569372 * Y + -0.318373 * Z + -36.757996
Euler angles from the ROTATION matrix. Conventions XYZ and ZXZ:
Phi Theta Psi [DEG: Phi Theta Psi ]
XYZ: 0.895225 -0.860131 2.080649 [DEG: 51.2926 -49.2818 119.2124 ]
ZXZ: 1.296085 1.894809 0.926514 [DEG: 74.2602 108.5646 53.0853 ]
--------------------------------------------------------------------------------
After setting an option: -lw:3
the LGA records will look like below:
# Molecule1 Molecule2 DISTANCE RMSD(lw:3)
LGA M 1_A M 1_A 9.592 -
LGA N 2_A N 2_A 11.124 -
LGA I 3_A I 3_A 13.468 -
LGA F 4_A W 4_A 11.355 2.541
LGA E 5_A E 5_A 8.107 1.718
LGA M 6_A M 6_A 13.142 1.511
LGA L 7_A L 7_A 13.326 1.622
LGA R 8_A R 8_A 8.502 2.042
LGA I 9_A I 9_A 6.853 2.876
LGA D 10_A D 10_A 10.670 3.337
LGA E 11_A E 11_A 10.752 3.222
where in the last column for each residue a RMSD value is
calculated on 3+1+3=7 residues window. This information can be
helpful to detect local similarity of structures when such
a similarity is difficult to capture from the global superposition.
-------------------------------------------------------------------------------
There are several ways how to select from both structures the set
of residues for calculations. Here are some described options and examples:
-sda - amino-acids identical by numbering and chain IDs are selected
-ch2:B - chain B from molecule2 is selected
-aa1:1:317 - residues 1 till 317 from molecule1
-gap1:152:156 - remove residues 152 - 156 from molecule1
-aa2:45:361 - residues 45 till 361 from molecule2
-er2:45_B:50_B - residues 45 till 50 from molecule2 chain B
Let us note that with "-sda" mode the two protein structures have to overlap
by the numbering of amino acids and also by the chain IDs (unless the chains
are specified using parameters: -ch1:A -ch2:B ,...).
The mode "-sia" has to be used for structure comparison of regions where proteins
differ in residue numbering.
Example1:
If user needs to perform LCS and GDT analysis ("-3" option) of two structures
(mol1 and mol2) in selected regions, then "-sia" mode and the exact range of
residues (-er1:s1:s2 -er2:s1:s2) may be used:
-3 -sia -o1 -d:5.0 -er1:10:23 -er2:45_B:50_B,56_B:63_B
And the following residue correspondence is established:
mol1 mol2
10 45_B
11 46_B
12 47_B
13 48_B
14 49_B
15 50_B
16 56_B
17 57_B
18 58_B
19 59_B
20 60_B
21 61_B
22 62_B
23 63_B
Only residue-pairs above will be used for "-3 -sia" calculations.
Example2:
The following sets of parameters are equivalent:
-3 -sia -d:5.0 -lw:3 -aa1:1:317 -ch2:B -aa2:45:361 -gap1:152:156
and
-3 -sia -d:5.0 -lw:3 -er1:1:151,157:317 -er2:45_B:361_B
And in both cases the following residue-residue correspondence is established
for "-3 -sia" calculation:
mol1 mol2
1 45_B
2 46_B
--- - ---
151 195_B
157 201_B
--- - ---
316 360_B
317 361_B
Example3:
Running lga program with an option: -aa
lga -aa mol1.mol2
the following list of amino-acids from both structures is generated:
............
AAMOL1 44 CA PRO A 44 11.895 -3.179 6.411 1.00 0.25 P
AAMOL1 45 CA LYS A 45 10.950 -3.861 9.969 1.00 0.47 K
AAMOL1 46 CA ILE A 46 10.943 -2.854 13.584 1.00 0.23 I
AAMOL1 47 CA VAL A 47 11.713 -5.569 16.139 1.00 0.90 V
AAMOL1 48 CA GLY A 48 11.015 -5.370 19.871 1.00 0.32 G
AAMOL1 49 CA GLY A 49 13.564 -6.389 22.407 1.00 0.35 G
AAMOL1 50 CA ILE A 50 14.197 -5.657 26.148 1.00 0.30 I
AAMOL1 51 CA GLY A 51 14.921 -1.941 26.352 1.00 0.28 G
AAMOL1 52 CA GLY A 52 13.330 -0.914 23.036 1.00 0.37 G
AAMOL1 53 CA PHE A 53 12.838 -1.655 19.390 1.00 0.62 F
AAMOL1 54 CA ILE A 54 15.143 -1.706 16.475 1.00 0.17 I
............
AAMOL2 25 CA ASP B 25 8.355 2.887 20.497 1.00 6.13 D
AAMOL2 26 CA THR B 26 6.153 1.507 23.318 1.00 6.74 T
AAMOL2 27 CA GLY B 27 4.727 -0.899 20.732 1.00 5.25 G
AAMOL2 28 CA ALA B 28 8.095 -2.602 20.027 1.00 4.63 A
AAMOL2 29 CA ASP B 29 9.157 -5.564 22.158 1.00 10.93 D
AAMOL2 30 CA ASP B 30 12.717 -5.124 20.840 1.00 10.93 D
AAMOL2 31 CA THR B 31 15.176 -2.485 19.633 1.00 5.17 T
AAMOL2 32 CA VAL B 32 15.713 -2.539 15.844 1.00 8.25 V
AAMOL2 33 CA LEU B 33 18.305 -0.371 14.098 1.00 8.85 L
AAMOL2 34 CA GLU B 34 18.800 0.083 10.364 1.00 19.16 E
AAMOL2 35 CA GLU B 35 21.637 -1.821 8.658 1.00 23.35 E
AAMOL2 36 CA MET B 36 25.047 -1.128 10.270 1.00 24.89 M
AAMOL2 37 CA ASN B 37 28.299 -3.021 10.681 1.00 39.03 N
AAMOL2 38 CA LEU B 38 28.793 -3.464 14.423 1.00 33.97 L
AAMOL2 39 CA PRO B 39 31.839 -5.455 15.462 1.00 32.47 P
............
User can attach to the file "mol1.mol2" a set of selected AAMOL* records and run lga
with an option "-al". In this case only residues listed in AAMOL* records will be
used for calculations.
Example4:
User can attach to the file "mol1.mol2" a set of selected "LGA" records (see below),
and run lga with an option "-al". In this case only residue pairs for which the
DISTANCE column is different than "-" will be used for calculations.
# Molecule1 Molecule2 DISTANCE
LGA - - A 30_B -
LGA - - A 31_B -
LGA - - I 32_B -
LGA - - A 33_B -
LGA - - K 34_B -
LGA - - E 35_B -
LGA L 39_A L 36_B 0.401
LGA K 40_A K 37_B 0.409
LGA - - L 38_B -
LGA D 42_A D 39_B 0.350
LGA Y 43_A Y 40_B 0.236
LGA E 44_A E 41_B 0.560
LGA L 45_A L 42_B 0.466
LGA K 46_A K 43_B -
LGA P 47_A P 44_B -
LGA M 48_A M 45_B 0.329
LGA D 49_A D 46_B 0.089
LGA F 50_A F 47_B 0.037
LGA S 51_A S 48_B 0.186
LGA G 52_A G 49_B 0.176
LGA I 53_A I 50_B #
LGA I 54_A I 51_B #
LGA P 55_A P 52_B 0.210
LGA A 56_A A 53_B 0.558
LGA L 57_A L 54_B 0.398
LGA Q 58_A - - -
LGA T 59_A - - -
LGA K 60_A K 57_B #
LGA N 61_A N 58_B #
LGA V 62_A V 59_B #
LGA D 63_A D 60_B #
LGA L 64_A L 61_B #
LGA A 65_A A 62_B #
LGA L 66_A L 63_B #
LGA A 67_A A 64_B #
LGA G 68_A G 65_B #
LGA I 69_A I 66_B #
LGA T 70_A T 67_B #
LGA - - I 68_B -
LGA - - T 69_B -
LGA - - D 70_B -
LGA - - E 71_B -
MOLECULE mol1
ATOM 269 N LEU A 39 16.096 -48.145 12.331 1.00 12.81 N
ATOM 270 CA LEU A 39 15.692 -49.459 12.808 1.00 13.11 C
ATOM 271 C LEU A 39 16.406 -50.631 12.156 1.00 16.36 C
----
END
MOLECULE mol2
ATOM 237 N ALA B 30 7.845 28.839 9.911 1.00 16.17 N
ATOM 238 CA ALA B 30 8.434 30.179 9.855 1.00 15.10 C
ATOM 239 C ALA B 30 9.116 30.407 8.502 1.00 17.22 C
ATOM 240 O ALA B 30 8.909 31.432 7.859 1.00 16.39 O
----
ATOM 552 OE1 GLU B 71 -7.284 5.475 5.563 1.00 46.00 O
ATOM 553 OE2 GLU B 71 -6.414 4.507 7.314 1.00 42.95 O
END
-------------------------------------------------------------------------------
Remember:
The options -1, -2, -3 work on already established residue-residue
correspondence. The residue-residue correspondence will not be changed
during calculations.
If user needs to find structure alignment (automatically establish the
residue-residue correspondence), then the option "-4" has to be used.
LGA has been designed to search for the best structure superposition of two
protein structures or fragments of protein structures.
Structure comparative analysis can be made in two general modes:
- Fixed residue-residue correspondence (options: -1, -2, -3).
This mode can be used when user knows how to establish residue-residue
correspondence for LGA processing (the residue-residue correspondence will
not be changed during the calculations). For example by using the option
"-3 -sda" the program will select for calculations the residues that are
identical ("-sda") by the numbering of amino acid and chain id, and then
identify the fragments where two structures are similar or structurally
different ("-3": LCS and GDT analysis).
- Search for residue-residue correspondence (option: -4).
This mode can be used for structural comparison of any two proteins.
For example using the option "-4 -sia" the best superposition (according
to the LGA technique) is calculated completely ignoring sequence
relationship ("-sia") between the two proteins, and the suitable amino
acid correspondence (structural alignment) is reported ("-4").
Most of the structure comparison programs are built on the principle that a
suitable scoring function can be defined with its optimum corresponding to the
most significant structural match. Many established comparison techniques
define structural similarity by two numbers, the root mean square deviation
(RMSD) between two superimposed structures together with the number of
"equivalent" (structurally aligned) residues. However, it is impossible
to optimize these two quantities simultaneously, since one can be optimized
on the expense of the other. The structural aligner DALI by L. Holm [1] solves
the optimization problem by combining several numbers to a single quantity,
called z-score. ProSup aligner by M. Sippl [2] maximizes the number of equivalent
residues while RMSD is kept close to the constant value.
As a basis for scoring function for the LGA (Local Global Alignment) program [3] serve
two new measures LCS and GDT. These two measures established by A. Zemla for detection
of local and global structure similarities between two proteins were tested and successfully
verified during CASP process [4]-[7] providing very good ranking of evaluated protein models.
Comparing two protein structures LCS procedure is able to localize
(along the sequence) the Longest Continuous Segments of residues that can fit
under selected RMSD cutoff. Global Distance Test (GDT) algorithm is designed to
complement evaluations made with LCS searching for the largest (not necessary
continuous) set of "equivalent" residues deviating by no more than a specified
DISTANCE cutoff. In comparison with LCS, which provides numerically exact results,
generation of maximal sets of residues that are not necessarily continuous along
the main chain is only approximate. The algorithm however uses many different
DISTANCE cutoffs to find the best global structural match.
LCS, GDT, and LGA_S description (see [3], [8])
Longest Continuous Segments under specified CA RMSD cutoff (LCS).
The algorithm identifies the longest continuous segments of residues
in the target deviating from the model by not more than specified
CA RMSD cutoff. Each residue in a target is assigned to the longest
of such segments provided if is a part of that segment (see LCS_GDT records).
For different values of the CA RMSD cutoff (1.0 A, 2.0 A, and 5.0 A) the
longest continuous segments in the target are reported.
Global Distance Test (GDT). The algorithm identifies in the target
the sets of residues deviating from the model by no more than
specified CA DISTANCE cutoff using many different superpositions.
Each residue from the target is assigned to the largest set of the residues
(not necessary continuous) deviating from the model by no more than a
specified distance cutoff (see LCS_GDT records: GDT_DATA_COLUMNS).
For different values of DISTANCE cutoff (0.5 A, 1.0 A, 1.5 A, ... 10.0 A)
the several measures are reported:
NUMBER_CA - the number of CA's from the "largest set" that can fit
under specified distance cutoff
PERCENT_CA - percent of CA's from the "largest set" comparing to the
total number of CA's in target (see GDT_Pn below)
RMS_LOCAL - RMSD (root mean square deviation) calculated on the
"largest set" of CA's
RMS_ALL_CA - RMSD calculated on all CA after superposition of the
prediction structure to the target structure based on
the "largest set" of CA's
GDT_TS = (GDT_P1 + GDT_P2 + GDT_P4 + GDT_P8)/4.0
where GDT_Pn is an estimation of the percent of residues that can
fit under distance cutoff <= n.0 Angstroms
The GDT procedure is the following. Each three-residue segment and each
continuous segment found by LCS is used as a starting point to give an
initial equivalences (model-target CA pairs) for a superposition.
The list of equivalences is iteratively extended to produce the largest
set of residues that can fit under considered distance cutoff.
For collecting data about largest sets of residues the Iterative
Superposition Procedure (ISP) is implemented.
The goal of the ISP method is to exclude from the calculations atoms
that are more than some threshold (cutoff) distance between the
model and the target structure after the transform is applied.
Starting from the initial set of atoms (C-alphas) the algorithm is the
following:
a) calculate the transform
b) identify in superimposed structures all atom pairs for which the
distance is not larger than the threshold
c) calculate a new transform on the set of identified atom pairs
d) exclude from that set the atoms for which the distance (after
applying a new transform) is larger than the threshold
e) repeat a) - d) until the set of atoms used in calculations
is the same for two cycles running
Results of the analysis given by LCS algorithm show rather local features
of the model compared to the target, while the residues considered in GDT
come from the whole model structure (they do not have to maintain the continuity
along the sequence). From this point of view GDT can detect the kind of GLOBAL
level of structure similarity.
By combining these two techniques (RMSD based and distance based), LGA not only
calculates a "best" superposition between two proteins (meaning "under certain
RMSD and distance cutoffs"), but also identifies the regions of local similarity
between compared structures. In the structure alignment search procedure, for each
generated list of equivalent residues, the following values are calculated:
LCS_vi - percent of residues in target (continuous set) that can fit under an RMSD
cutoff of vi Angstroms (for vi = 1.0, 2.0, ...), and
GDT_vi - an estimation of the percent of residues in target (largest set) that
can fit under the distance cutoff of vi Angstroms (for vi = 0.5, 1.0, ...).
A scoring function (LGA_S - structure similarity score) is defined as a combination
of these values. For a given parameter w (0.0<=w<=1.0), representing a weighting
factor, LGA_S value is calculated by the formula (see [3], [8] for details):
LGA_S = w*S(GDT) + (1-w)*S(LCS)
where S(F) function is defined as follows:
S(F) = 2 * (k*F_v1 + (k-1)*F_v2 +...+ 1*F_vk) / ((k+1)*k)
This formula is used to calculate LGA_S values in both cases: the sequence
dependent ("-3") and in the sequence independent ("-4") modes.
NOTE: LGA_S values may slightly differ between "-3" and "-4" calculations even if
performed on the same set of residues. This is because "-3" and "-4" modes use
different procedures to search for the "best" sets of residue pairs to calculate
"optimal" superpositions (to detect maximum number of residues that can fit under
rmsd and distance cutoffs).
In order to distinguish these two cases ("-3" and "-4") the calculated value LGA_S
is named LGA_S3 when the option "-3" is used.
For the purpose of structure similarity search or ordering of models (or PDB templates),
the target (frame of the reference, second molecule) should be fixed and then user may
sort models (see SUMMARY results) by the number of superimposed residues N (under one
selected DIST cutoff), or by the values of GDT_TS (average from four distance cutoffs),
or LGA_S (weighted results from the full set of distance cutoffs). Let us notice that
LGA_S can be used to evaluate the level of structure similarity between proteins in
sequence dependent ("-3") mode as well as in structure alignment search ("-4") mode.
The experiments show that LGA_S3 (which combines both: LCS and GDT measures) is slightly
more sensitive and accurate in scoring structural similarity than GDT_TS alone.
A set of additional GDT-like measures GDC (Global Distance Calculation) have been developed
to allow detailed structure comparison and evaluation of structure similarity of proteins
using a list of selected atom positions, not only Calpha positions. For example, to apply
superposition-based scoring to the functional ends of protein sidechains, a GDC score
for sidechains ("-gdc_sc") uses a characteristic atom near the end of each sidechain type for
the evaluation of residue residue distance deviations. The selection of atoms for GDC
calculations can be done by the "-gdc_at" flag in the LGA command line (see [9] for details).
REFERENCES
[1] L. Holm, C. Sander: "Protein structure comparison by alignment of distance
matrices", J Mol Biol, 1993, 233, pp. 123-138.
[2] Z. K. Feng, M. J. Sippl: "Optimum superimposition of protein structures:
ambiguities and implications", Fold Des, 1996, 1, pp. 123-132.
[3] A. Zemla: "LGA - A Method for Finding 3-D Similarities in Protein Structures",
Nucleic Acids Research, 2003, Vol. 31, No. 13, pp. 3370-3374.
[4] A. Zemla, C. Venclovas, A. Reinhardt, K. Fidelis, T. J. Hubbard: "Numerical
criteria for the evaluation of ab initio predictions of protein structure",
PROTEINS: Structure, Function, and Genetics, 1997, Suppl.1, pp. 140-150.
[5] A. Zemla, C. Venclovas, J. Moult, K. Fidelis: "Processing and analysis of CASP3
protein structure predictions", Proteins: Structure, Function, and Genetics,
Volume 37, Issue S3, 1999, pp. 22-29.
[6] A. Zemla, C. Venclovas, J. Moult, K. Fidelis: "Processing and evaluation of
predictions in CASP4", Proteins: Structure, Function, and Genetics, Volume 45,
Issue S5, 2001, pp. 13-21.
[7] S. Cristobal, A. Zemla, D. Fischer, L. Rychlewski, A. Elofsson: "A study
of quality measures for protein threading models", BMC Bioinformatics,
2001 2: 5.
[8] A. Zemla, B. Geisbrecht, J. Smith, M. Lam, B. Kirkpatrick, M. Wagner, T. Slezak,
C.E. Zhou. "STRALCP structure alignment-based clustering of proteins", Nucleic
Acids Research, 2007, 35, 22, Pp. e150; doi: 10.1093/nar/gkm1049.
[9] D. A. Keedy, C. J. Williams, J. J. Headd, W. B. Arendall III, V. B. Chen,
G. J. Kapral, R. A. Gillespie, J. N. Block, A. Zemla, D. C. Richardson,
J. S. Richardson. "The other 90% of the protein: Assessment beyond the Calphas
for CASP8 template-based and high-accuracy models", Proteins: Structure, Function,
Bioinformatics, 2009, 10.1002/prot.22551
-------------------------------------------------------------------------------
Changes, improvements, development:
-------------------------------------------------------------------------------
### Date: 15 Oct 1999
First version of the LGA program was tested.
### Date: 21 Mar 2000
An extensive analysis of the structure comparison results from PROSUP and LGA programs
used to evaluate CASP3 models was performed. Evaluation results were compared with Alexey
Murzin's "Fold recognition" CASP3 assessment.
### Date: 10 May 2000
The performance of LGA program and other structure comparison programs was
analysed. Collaborative work with: S. Cristobal, D. Fischer, L. Rychlewski,
and A. Elofsson.
### Date: 29 Aug 2000
The results of the comparison of different measures used for the analysis of the
quality of protein structure predictions were prepared for the manuscript [7]:
S. Cristobal, A. Zemla, D. Fischer, L. Rychlewski, A. Elofsson: "A study
of quality measures for protein threading models", BMC Bioinformatics
2001 2: 5, 2001.
### Date: 20 Mar 2001
Thanks to the suggestion from Daniel Barsky (barsky@llnl.gov) an option to
perform calculation on selected CA atoms was included (AAMOL1 and AAMOL2 records).
### Date: 06 Sep 2001
"Lesk window" option was included to the program. RMSD value calculated
on length=2*n+1 residue window (-lw:n).
### Date: 15 Jul 2002
Thanks to the suggestion from Dat H. Nguyen (nguyend@gps01.llnl.gov) an option to
perform calculations on chosen atoms (NOT only CA) was included.
-atom:CB CB atoms will be used for calculations. NOTE (special character
in the PARAMATER-OPTIONS line): use , instead of '
(for example: H5,1 to select H5'1 atom)
-ah:i ATOM or HETATM records are used for calculations:
i=0 both (default)
i=1 ATOM
i=2 HETATM
### Date: 05 Jan 2003
Thanks to the discussions with Michael Levitt (michael.levitt@stanford.edu) the
accuracy of LGA (GDT_TS) calculations was improved, and the problem with erroneous
calculations on "singular structures" (compressed coordinates, very small distances
between atoms) was reduced.
### Date: 02 Mar 2003
Thanks to the discussions with Nick Grishin (grishin@chop.swmed.edu)
LGA_S scoring function was improved.
### Date: 11 Oct 2003
Thanks to the suggestion from Bernhard Rupp (br@llnl.gov) the calculation of Euler
angles has been included:
The convention used (XYZ):
phi is about x-axis
theta is about y-axis
psi is about z-axis
and the translation formulas are the following:
c1 = cos(phi); s1 = sin(phi);
c2 = cos(theta); s2 = sin(theta);
c3 = cos(psi); s3 = sin(psi);
r[1][1] = c1 * c2;
r[2][1] = c1 * s2 * s3 - s1 * c3;
r[3][1] = c1 * s2 * c3 + s1 * s3;
r[1][2] = s1 * c2;
r[2][2] = s1 * s2 * s3 + c1 * c3;
r[3][2] = s1 * s2 * c3 - c1 * s3;
r[1][3] = -s2;
r[2][3] = c2 * s3;
r[3][3] = c2 * c3;
LGA reports ROTATION matrix, VECTOR and Euler angles in the following format:
Unitary ROTATION matrix and the SHIFT vector superimpose molecules (1=>2)
X_new = 0.407935 * X + -0.032836 * Y + 0.912420 * Z + 11.435461
Y_new = 0.509052 * X + -0.821424 * Y + -0.257154 * Z + 61.613953
Z_new = 0.757928 * X + 0.569372 * Y + -0.318373 * Z + -36.757996
Euler angles from the ROTATION matrix. Conventions XYZ and ZXZ:
Phi Theta Psi [DEG: Phi Theta Psi ]
XYZ: 0.895225 -0.860131 2.080649 [DEG: 51.2926 -49.2818 119.2124 ]
ZXZ: 1.296085 1.894809 0.926514 [DEG: 74.2602 108.5646 53.0853 ]
### Date: 21 Dec 2003
Alignment verification module has been improved.
### Date: 11 Jan 2004
New options: -er1:s1:s2 and -er2:s1:s2 have been included. This allows to select
the exact ranges of residues from molecule1 and molecule2.
Example: -er1:10_A:16_A -er1:B:B -er2:8_A:20_A -er2:7S_B:7_C
where: -er1:10_A:16_A selects in molecule1 the residues 10-16 (chain A)
-er1:B:B selects in molecule1 all residues from chain B
-er2:8_A:20_A selects in molecule2 the residues 8-20 (chain A)
-er2:7S_B:7_C selects in molecule2 the residues 7S_B (residue 7 insertion S
from chain B) up to 7_C (residue 7 from chain C)
### Date: 05 Aug 2004
To run lga calculation on the selected set of residues defined by the
attached AAMOL* or LGA records, user has to use the parameter: -al
otherwise the attached records are ignored.
### Date: 07 Jan 2006
The residue selection module has been improved.
### Date: 23 Jun 2006
The reported total number of atoms in compared structures has been corrected.
It was calculated based on the number of selected residues, not based on the
actual number of residues in compared structures.
Thanks to Andriy Kryshtafovych (akryshtafovych@ucdavis.edu) for reporting the issue.
### Date: 25 Sept 2006
The residue selection options "-er1:s1:s2" and "-er2:s1:s2" were corrected.
Thanks to Yun He (jarod@spg.biosci.tsinghua.edu.cn) for poining out the error.
The residue selection options -er1:s1:s2 (s1 , s2 - strings) have been upgrated.
Now, if several "-er1" or "-er2" options are used, then the si pairs (ranges) can be
separated by ',' -er1:s1:s2,s3:s4,s5:s6,s7:s8,s9:s10
### Date: 15 Oct 2006
The following option has been introduced: -cb:f
The coordinates of the point representing amino-acid position for LGA processing
can be defined by the point f on the CA-CB vector: -5.0 <= f <= 5.0
For example: -cb:0 is equivalent to CA position, and -cb:1 is equivalent to CB position
NOTE: for each amino-acid a complete set of main chain atoms (N,CA,C,O) is required
in the input structures.
### Date: 28 Dec 2007
The following options have been introduced: -rmsd , -swap
They allow to calculate RMSD values on aligned CA, MC (main chain), and ALL atoms.
If the option "-swap" is chosen then calculating RMSD on ALL atoms "swapping"
is considered. It means that in amino acids where atom names can be switched, i.e.
for ASP: OD1 <-> OD2
for GLU: OE1 <-> OE2
for PHE: CD1 <-> CD2
CE1 <-> CE2
for TYR: CD1 <-> CD2
CE1 <-> CE2
cartesian rmsd is calculated with an option to minimize its value. Sets (CD1, CE1) and
(CD2, CE2) in PHE and TYR, as well as atoms OD1 and OD2 in ASP, OE1 and OE2 in GLU are
exchanged and more favorable contributions to rmsd are taken into account.
For example, if "-rmsd" option is included (./lga 2gff_A.1lq9_A -4 -rmsd) then program
will produce results in the following format:
# Molecule1 Molecule2 DISTANCE Mis MC All Dist_max GDC_mc GDC_all
..........................
LGA I 52_A N 62_A 0.500 3 0.031 0.038 0.639 92.857 58.929
LGA Y 53_A Y 63_A 0.745 0 0.017 1.384 3.159 88.214 80.040
LGA E 54_A A 64_A 0.907 0 0.095 0.095 1.019 88.214 88.667
LGA A 55_A Q 65_A 1.665 4 0.089 0.104 2.060 79.286 42.434
LGA Y 56_A W 66_A 1.275 9 0.076 0.099 1.556 79.286 28.469
LGA T 57_A E 67_A 1.446 4 0.026 0.030 1.614 81.429 44.286
LGA D 58_A S 68_A 1.400 1 0.070 0.118 1.400 81.429 67.857
LGA E 59_A E 69_A 1.595 0 0.082 1.042 2.146 75.000 77.884
LGA A 60_A Q 70_A 1.584 4 0.033 0.032 1.774 77.143 42.381
..........................
# RMSD_GDC results: CA MC common percent ALL common percent GDC_mc GDC_all
NUMBER_OF_ATOMS_AA: 91 364 364 100.00 700 490 70.00 112
SUMMARY(RMSD_GDC): 2.343 2.349 2.539 56.941 41.648
#CA N1 N2 DIST N RMSD Seq_Id LGA_S LGA_Q
SUMMARY(LGA) 97 112 5.0 91 2.34 18.68 62.085 3.724
where "Mis" column gives the number of missing atoms in a given amino acid (missing atom
pairs; relative to the amino acid defined in Molecule2), "MC" - rmsd calculated on main
chain atoms, and "All" - rmsd on all corresponding (common) atoms from aligned amino acids.
If both options are included "-rmsd -swap" (or just "-swap") then the following results
are reported:
# Checking swapping
# possible swapping detected: Y 53_A Y 63_A
# possible swapping detected: E 59_A E 69_A
# possible swapping detected: E 76_A E 87_A
# Molecule1 Molecule2 DISTANCE Mis MC All Dist_max GDC_mc GDC_all
..........................
LGA I 52_A N 62_A 0.500 3 0.031 0.038 0.639 92.857 58.929
LGA Y 53_A Y 63_A 0.745 0 0.017 0.058 1.037 88.214 88.214
LGA E 54_A A 64_A 0.907 0 0.095 0.095 1.019 88.214 88.667
LGA A 55_A Q 65_A 1.665 4 0.089 0.104 2.060 79.286 42.434
LGA Y 56_A W 66_A 1.275 9 0.076 0.099 1.556 79.286 28.469
LGA T 57_A E 67_A 1.446 4 0.026 0.030 1.614 81.429 44.286
LGA D 58_A S 68_A 1.400 1 0.070 0.118 1.400 81.429 67.857
LGA E 59_A E 69_A 1.595 0 0.082 0.640 1.898 75.000 80.741
LGA A 60_A Q 70_A 1.584 4 0.033 0.032 1.774 77.143 42.381
..........................
# RMSD_GDC results: CA MC common percent ALL common percent GDC_mc GDC_all
NUMBER_OF_ATOMS_AA: 91 364 364 100.00 700 490 70.00 112
SUMMARY(RMSD_GDC): 2.343 2.349 2.524 56.941 41.751
#CA N1 N2 DIST N RMSD Seq_Id LGA_S LGA_Q
SUMMARY(LGA) 97 112 5.0 91 2.34 18.68 62.085 3.724
These options can be combined with "-lw:n" to specify the length of sliding window for
calculating local RMSDs.
### Date: 02 Jan 2008
The output from the calculations of Euler angles from the ROTATION matrix has been
modified. The calculations for two most popular conventions XYZ and ZXZ (ZXZ is used
in CHIMERA) are now reported:
Unitary ROTATION matrix and the SHIFT vector superimpose molecules (1=>2)
X_new = -0.347115 * X + -0.009255 * Y + 0.937777 * Z + -11.467628
Y_new = -0.754312 * X + -0.591409 * Y + -0.285043 * Z + 10.637938
Z_new = 0.557247 * X + -0.806319 * Y + 0.198306 * Z + -8.800918
Euler angles from the ROTATION matrix. Conventions XYZ and ZXZ:
Phi Theta Psi [DEG: Phi Theta Psi ]
XYZ: -2.002079 -0.591067 -1.329643 [DEG: -114.7107 -33.8656 -76.1829 ]
ZXZ: 1.275714 1.371167 2.536865 [DEG: 73.0930 78.5621 145.3516 ]
The translation formulas for ZXZ convention are the following:
c1 = cos(phi); s1 = sin(phi);
c2 = cos(theta); s2 = sin(theta);
c3 = cos(psi); s3 = sin(psi);
r[1][1] = c1 * c3 - s1 * c2 * s3;
r[1][2] = s1 * c3 + c1 * c2 * s3;
r[1][3] = s2 * s3;
r[2][1] = -c1 * s3 - s1 * c2 * c3;
r[2][2] = -s1 * s3 + c1 * c2 * c3;
r[2][3] = s2 * c3;
r[3][1] = s1 * s2;
r[3][2] = -c1 * s2;
r[3][3] = c2;
Thanks to Bernhard Rupp (bernhardrupp@sbcglobal.net) for suggesting this modification.
### Date: 21 Feb 2008
The format of the LCS_GDT lines has been slightly modified to provide a better description
of the results reported in the LCS GDT section:
LCS_GDT MOLECULE-1 MOLECULE-2 LCS_DETAILS GDT_DETAILS ...
LCS_GDT RESIDUE RESIDUE SEGMENT_SIZE GLOBAL DISTANCE TEST COLUMNS: ...
LCS_GDT NAME NUMBER NAME NUMBER 1.0 2.0 5.0 0.5 1.0 1.5 2.0 2.5 3.0 ...
The option "-gdt" has been introduced. It can be combined ONLY with the "-3" option.
If "-3 -gdt" is used then the reported final superposition is the one that fits maximum
number of residues (N) under a given distance cutoff. This is exactly the same superposition
as is reported by default in the previous versions of the LGA program when "-3" option was used.
From now the default reported superposition for "-3" mode is the standard superposition
calculated using the set of identified N residues.
NOTE: when the standard superposition is applied then not all residues from N identified by
LGA (GDT algoritm) may stil fit under a selected distance cutoff DIST.
### Date: 10 July 2008
The option of calculating CB atom positions "-cb:f" can be combined with "-atom:CB".
If two options are combined (e.g. "-cb:1 -atom:CB"), then all existing CB atoms are
leveraged and only missing CB atoms are calculated.
A new option "-check" has been introduced to check and report amino acids with missing
pre-selected atoms ("CA" atoms are pre-selected as default atoms for LGA calculations).
If "-cb:f" option is used, then program will report amino-acids with missing main chain
atoms (N, CA, C, or O).
### Date: 18 July 2008
The new two options "-gdc_sup" and "-gdc_set" have been introduced to allow calculate
an additional superposition on a selected set of amino acids and use this superposition
to evaluate distances between atoms from another set of selected amino acids.
Thanks to Yun He (jarodpardon@gmail.com) and Daniel Barsky (barsky@llnl.gov) for
suggesting this modification.
When "-swap" or "-rmsd" options are used, then the GDC (Global Distance Calculations)
analysis (as default) is performed on all amino acids that are used for regular LGA
calculations.
To define a set of amino acids for calculating additional superposition for GDC analysis
we can make amino acids selection using an option "-gdc_sup:s1:s2,s3:s4".
To evaluate a selected set of amino acids we can use an option "-gdc_set:s5:s6,s7:s8".
For example, if we run the LGA program as:
./lga model.target -3 -sda -d:4 -swap -gdc_sup:s1:s2 -gdc_set:s5:s6,s7:s8
then the SUMMARY(GDT) results (GDT_TS, LGA_S3, N, ...) will be calculated as before
(using all (in common) amino acids from both structures (model and target)), but the
GDC results (Dist_max and GDC columns in LGA records, and SUMMARY(RMSD_GDC)) will be
calculated for s5:s6,s7:s8 ranges only using the superposition created based on the
amino acids from the range s1:s2.
Another example:
./lga 1hiv_A.1sip_A -4 -er2:10_A:70_A -gdc_sup:14_A:50_A -gdc_set:24_A:33_A
# Molecule1 Molecule2 DISTANCE Mis MC All Dist_max GDC_mc GDC_all
..........................
LGA E 21_A E 21_A 0.828 0 0.109 0.345 - - -
LGA A 22_A V 22_A 0.377 2 0.057 0.109 - - -
LGA L 23_A L 23_A 0.409 0 0.075 0.255 - - -
LGA L 24_A L 24_A 0.296 0 0.123 0.142 0.714 100.000 96.429
LGA D 25_A D 25_A 0.242 0 0.136 0.346 0.787 100.000 96.429
LGA T 26_A T 26_A 0.393 0 0.074 0.236 0.501 100.000 98.639
LGA G 27_A G 27_A 0.181 0 0.032 0.032 0.273 100.000 100.000
LGA A 28_A A 28_A 0.481 0 0.103 0.203 0.681 97.619 96.190
LGA D 29_A D 29_A 0.355 0 0.121 0.157 0.563 100.000 98.810
LGA D 30_A D 30_A 0.484 0 0.075 0.531 2.046 100.000 88.869
LGA T 31_A S 31_A 0.726 1 0.025 0.059 0.762 97.619 80.159
LGA V 32_A I 32_A 0.473 3 0.095 0.149 0.857 100.000 61.310
LGA L 33_A V 33_A 0.287 2 0.086 0.096 0.722 97.619 68.707
LGA E 34_A T 34_A 0.791 2 0.095 0.102 - - -
LGA E 35_A G 35_A 3.617 0 0.609 0.609 - - -
LGA M 36_A I 36_A 2.135 3 0.044 0.095 - - -
LGA S 37_A E 37_A 1.098 4 0.029 0.042 - - -
..........................
# RMSD_GDC results: CA MC common percent ALL common percent GDC_mc GDC_all
NUMBER_OF_ATOMS_AA: 61 244 244 100.00 457 361 78.99 10
SUMMARY(RMSD_GDC): 1.281 1.245 1.560 99.286 88.554
#CA N1 N2 DIST N RMSD Seq_Id LGA_S LGA_Q
SUMMARY(LGA) 99 61 5.0 61 1.28 45.90 95.952 4.417
In the example above the main superposition and the distances between CA atoms (DISTANCE
column) were calculated using selected set of CA atoms (see range: -er2:10_A:70_A) from
the target (molecule2; 1sip_A). MC and All columns contain "local" RMSD values calculated
on mainchain (MC) and all (All) atoms from the given aligned amino acids. The GDC columns
(Dist_max, GDC_mc and GDC_all) contain results from distance calculations using an additional
superposition which is calculated as a standard CA-based superposition applied to the
restricted set (see range "-gdc_sup:14_A:50_A" from molecule2) of residue-residue pairs
(correspondences) identified by the main LGA superposition. The additional superposition is
used for GDC calculations applied to the set of residue-residue pairs from the range defined
by "-gdc_set:24_A:33_A". The row SUMMARY(RMSD_GDC) contains an average value from all 10 (in
this example) calculated GDC_mc and 10 GDC_all values. Dist_max is a maximum distance between
corresponding atoms from the aligned (equivalent) amino acids.
For each amino acid from the set "-gdc_set:24_A:33_A" the values of GDC_mc and GDC_all are
calculated by the following GDC algorithm:
1) superposition is calculated using the range "-gdc_sup:14_A:50_A" of amino acids from
the molecule2
2) the distances between corresponding atoms (model.target) from each selected amino acid
are assigned to the k=20 distance bins: 0.5A, 1.0A, 1.5A, 2.0A, 2.5A, ...
(NOTE: the lowest distance deviation bin is defined as a range: 0.0 - 0.5 Angstroms,
the second bin is defined as" 0.0 - 1.0 Angstroms, third: 0.0 - 1.5A, etc)
3) for each bin_i (i=1 ... 20) the percentages Pa_i of assigned atoms are calculated
4) all percentages are added by the formula:
GDC_all = 100.0 * 2 * (k*Pa_1 + (k-1)*Pa_2 +...+ 1*Pa_k) / ((k+1)*k), where k=20.
NOTE: The ranges defined by the options "-gdc_sup" and "-gdc_set" have to be the subsets
of the list of residues used for main superposition. It is because the LGA program needs
to identify residue-residue correspondences (equivalences) before GDC evaluation of the
selected residues and atoms can be performed.
If ranges "-gdc_sup:s1:s2" and "-gdc_set:s3:s4" are not specified, then the GDC calculations are
performed on the same set of amino acids as is used for regular LGA calculations (main
superposition).
### Date: 31 July 2008
Many thanks to Jane Richardson (dcrjsr@kinemage.biochem.duke.edu) and the members of
the Richardson Lab. A number of improvements and new options has been introduced to
the LGA program. Details are below.
A new option "-gdc_sup" has been introduced to report and rotate molecule1 using the
superposition that is used for GDC calculations (e.g. defined by "-gdc_sup:s1:s2").
If "-gdc_sup" is not specified then the standard LGA superposition is reported.
A new option: -gdc_at:a1,a2,a3,a4 has been implemented. It allows to select atoms (one
atom per one name of amino-acid) from the molecule2 for which the GDC calculations
(distances and GDC summary) will be calculated.
Format example (aa.atom): a1 = V.CG1, a2 = C.SG, a3 = T.OG1, a4 = H.NE2
NOTE: this option is applied to the molecule2 only. The corresponding atoms from the
molecule1 will be detected based on the calculated alignment. Up to 20 representative
atoms (one atom per each of 20 amino-acids) can be selected for GDC evaluation.
The following "aa.atom" naming scheme is allowed:
aa atom
A: N CA C O CB
V: N CA C O CB CG1 CG2
L: N CA C O CB CG CD1 CD2
I: N CA C O CB CG1 CG2 CD1
P: N CA C O CB CG CD
M: N CA C O CB CG SD CE
F: N CA C O CB CG CD1 CD2 CE1 CE2 CZ
W: N CA C O CB CG CD1 CD2 NE1 CE2 CE3 CZ2 CZ3 CH2
G: N CA C O
S: N CA C O CB OG
T: N CA C O CB OG1 CG2
C: N CA C O CB SG
Y: N CA C O CB CG CD1 CD2 CE1 CE2 CZ OH
N: N CA C O CB CG OD1 ND2
Q: N CA C O CB CG CD OE1 NE2
D: N CA C O CB CG OD1 OD2
E: N CA C O CB CG CD OE1 OE2
K: N CA C O CB CG CD CE NZ
R: N CA C O CB CG CD NE CZ NH1 NH2
H: N CA C O CB CG ND1 CD2 CE1 NE2
X: N CA C O CB
NOTE: if selected atom is not present in the coordinates of superimposed amino-acids
in both molecules (molecule1 and molecule2), then particular amino-acid position will
not be evaluated.
Example of the complete list of atoms (side chain ends) selected for each amino-acid:
-gdc_at:G.CA,A.CB,V.CG1,L.CD1,I.CD1,M.CE,S.OG,T.OG1,C.SG,N.OD1,Q.OE1,D.OD2,E.OE2,K.NZ
-gdc_at:R.NH2,P.CG,W.CH2,H.NE2,F.CZ,Y.OH
Example of the command line for running LGA program (the same example as shown above):
./lga 1hiv_A.1sip_A -4 -er2:10_A:70_A -gdc_sup:14_A:50_A -gdc_set:24_A:33_A -gdc_at:G.CA,A.CB,V.CG1,L.CD1,I.CD1,M.CE,S.OG,T.OG1,C.SG,N.OD1,Q.OE1,D.OD2,E.OE2,K.NZ,R.NH2,P.CG,W.CH2,H.NE2,F.CZ,Y.OH
The LGA program will produce the following output:
# Molecule1 Molecule2 DISTANCE Mis MC All Dist_max GDC_mc GDC_all Dist_at
................................................
LGA E 21_A E 21_A 0.828 0 0.109 0.345 - - - -
LGA A 22_A V 22_A 0.377 2 0.057 0.109 - - - -
LGA L 23_A L 23_A 0.409 0 0.075 0.255 - - - -
LGA L 24_A L 24_A 0.296 0 0.123 0.142 0.714 100.000 96.429 0.714
LGA D 25_A D 25_A 0.242 0 0.136 0.346 0.787 100.000 96.429 0.787
LGA T 26_A T 26_A 0.393 0 0.074 0.236 0.501 100.000 98.639 0.501
LGA G 27_A G 27_A 0.181 0 0.032 0.032 0.273 100.000 100.000 0.216
LGA A 28_A A 28_A 0.481 0 0.103 0.203 0.681 97.619 96.190 0.681
LGA D 29_A D 29_A 0.355 0 0.121 0.157 0.563 100.000 98.810 0.563
LGA D 30_A D 30_A 0.484 0 0.075 0.531 2.046 100.000 88.869 2.046
LGA T 31_A S 31_A 0.726 1 0.025 0.059 0.762 97.619 80.159 -
LGA V 32_A I 32_A 0.473 3 0.095 0.149 0.857 100.000 61.310 -
LGA L 33_A V 33_A 0.287 2 0.086 0.096 0.722 97.619 68.707 -
LGA E 34_A T 34_A 0.791 2 0.095 0.102 - - - -
LGA E 35_A G 35_A 3.617 0 0.609 0.609 - - - -
LGA M 36_A I 36_A 2.135 3 0.044 0.095 - - - -
LGA S 37_A E 37_A 1.098 4 0.029 0.042 - - - -
................................................
# RMSD_GDC results: CA MC common percent ALL common percent GDC_mc GDC_all GDC_at
NUMBER_OF_ATOMS_AA: 61 244 244 100.00 457 361 78.99 10 7
SUMMARY(RMSD_GDC): 1.281 1.245 1.560 99.286 88.554 88.163
#CA N1 N2 DIST N RMSD Seq_Id LGA_S LGA_Q
SUMMARY(LGA) 99 61 5.0 61 1.28 45.90 95.952 4.417
Another example of the command line for running LGA program:
./lga 1m2f_A_2.1m2e_A -3 -gdc_at:G.CA,A.CB,V.CG1,L.CD1,I.CD1,M.CE,S.OG,T.OG1,C.SG,N.OD1,Q.OE1,D.OD2,E.OE2,K.NZ,R.NH2,P.CG,W.CH2,H.NE2,F.CZ,Y.OH -gdc_set:100_A:110_A
The LGA program will produce the following output:
# Molecule1: number of CA atoms 135 ( 2092), selected 135 , name 1m2f_A_2
# Molecule2: number of CA atoms 135 ( 2091), selected 135 , name 1m2e_A
# PARAMETERS: 1m2f_A_2.1m2e_A -3 -gdc_at:G.CA,A.CB,V.CG1,L.CD1,I.CD1,M.CE,S.OG,T.OG1,C.SG,N.OD1,Q.OE1,D.OD2,E.OE2,K.NZ,R.NH2,P.CG,W.CH2,H.NE2,F.CZ,Y.OH -gdc_set:100_A:110_A
# FIXED Atom-Atom correspondence
# GDT and LCS analysis
................................................
# Molecule1 Molecule2 DISTANCE Mis MC All Dist_max GDC_mc GDC_all Dist_at
................................................
LGA K 95_A K 95_A 0.975 0 0.443 1.011 - - - -
LGA E 96_A E 96_A 1.543 0 0.128 0.130 - - - -
LGA Q 97_A Q 97_A 1.169 0 0.056 0.702 - - - -
LGA L 98_A L 98_A 0.808 0 0.067 0.162 - - - -
LGA Y 99_A Y 99_A 0.356 0 0.024 0.128 - - - -
LGA H 100_A H 100_A 0.720 0 0.024 0.144 0.887 90.476 90.476 0.509
LGA S 101_A S 101_A 1.141 0 0.006 0.611 1.420 83.690 82.937 1.073
LGA A 102_A A 102_A 1.001 0 0.015 0.016 1.022 85.952 85.048 1.022
LGA E 103_A E 103_A 0.627 0 0.060 0.777 1.947 90.476 89.630 1.475
LGA L 104_A L 104_A 0.499 0 0.016 0.050 0.796 100.000 96.429 0.796
LGA H 105_A H 105_A 0.458 0 0.002 0.222 0.949 100.000 94.286 0.817
LGA L 106_A L 106_A 0.403 0 0.046 0.088 0.708 97.619 97.619 0.502
LGA G 107_A G 107_A 0.486 0 0.027 0.027 0.486 100.000 100.000 0.486
LGA I 108_A I 108_A 0.561 0 0.035 0.075 0.904 90.476 90.476 0.861
LGA H 109_A H 109_A 0.765 0 0.046 1.005 6.852 90.476 59.190 6.852
LGA Q 110_A Q 110_A 0.374 0 0.029 0.460 1.399 100.000 94.815 1.238
LGA L 111_A L 111_A 0.381 0 0.006 0.042 - - - -
LGA E 112_A E 112_A 0.468 0 0.029 0.160 - - - -
LGA Q 113_A Q 113_A 0.475 0 0.015 0.630 - - - -
................................................
# RMSD_GDC results: CA MC common percent ALL common percent GDC_mc GDC_all GDC_at
NUMBER_OF_ATOMS_AA: 135 540 540 100.00 1054 1054 100.00 11 11
SUMMARY(RMSD_GDC): 0.914 0.949 1.486 93.561 89.173 81.039
#CA N1 N2 DIST N RMSD GDT_TS LGA_S3 LGA_Q
SUMMARY(GDT) 135 135 5.0 135 0.91 96.296 98.268 13.314
LGA_LOCAL RMSD: 0.914 Number of atoms: 135 under DIST: 5.00
LGA_ASGN_ATOMS RMSD: 0.914 Number of assigned atoms: 135
Std_ASGN_ATOMS RMSD: 0.914 Standard rmsd on all 135 assigned CA atoms
In "Dist_at" column are provided results from the distance calculations between
corresponding atoms (model:1m2f_A_2 - target:1m2e_A) using standard LGA (-3)
superposition.
In the "GDC_at" column is shown the number of amino-acids for which "Dist_at"
values are calculated and the summary value GDC_at is calculated using similar
algorithm as for calculating GDC_mc and GDC_all:
1) the distances (Dist_at) between corresponding atoms (model.target) from each
selected amino acid are assigned to the k=20 distance bins: 0.5A, 1.0A, 1.5A,
2.0A, 2.5A, ...
2) for each bin_i (i=1 ... 20) the percentages Pa_i of assigned atoms are calculated
3) all percentages are added by the formula:
GDC_at = 100.0 * 2 * (k*Pa_1 + (k-1)*Pa_2 +...+ 1*Pa_k) / ((k+1)*k), where k=20.
A new option: -gdc_eat:e1:e2,e3:e4 has been implemented. It allows to select exact
atoms from the molecule1 and molecule2 for the GDC calculations (distances and GDC
summary).
Format example (aanumber.atom): e1 = 132_A.CG1, e2 = 124_B.SG, e3 = 400.FE, e4 = 300.FE
NOTE1: this option allows calculate the distances between any atoms from the molecule1
and molecule2. The distances are calculated after superposition is applied.
NOTE2: "-gdc_eat:e1:e2" provides an information about the distances between any exact atom
positions (as they are loaded from the PDB file), so in this case a "-swap" option is
not fixing a possible ambiguity in atom names. See example below:
Example of the command line:
./lga 1m2f_A_2.1m2e_A -4 -gdc_set:20_A:30_A -swap -gdc_at:D.OD1 -gdc_eat:27_A.OD1:27_A.OD1,27_A.OD1:27_A.OD2,27_A.OD2:27_A.OD1,27_A.OD2:27_A.OD2
Created output:
# Molecule1: number of CA atoms 135 ( 2092), selected 135 , name 1m2f_A_2
# Molecule2: number of CA atoms 135 ( 2091), selected 135 , name 1m2e_A
# PARAMETERS: 1m2f_A_2.1m2e_A -4 -gdc_set:20_A:30_A -swap -gdc_at:D.OD1 -gdc_eat:27_A.OD1:27_A.OD1,27_A.OD1:27_A.OD2,27_A.OD2:27_A.OD1,27_A.OD2:27_A.OD2
# Search for Atom-Atom correspondence
# Structure alignment analysis
# Checking swapping
# possible swapping detected: D 27_A D 27_A
................................................
# Molecule1 Molecule2 DISTANCE Mis MC All Dist_max GDC_mc GDC_all Dist_at
................................................
LGA Q 18_A Q 18_A 0.271 0 0.082 0.430 - - - -
LGA D 19_A D 19_A 0.644 0 0.046 0.155 - - - -
LGA C 20_A C 20_A 0.405 0 0.013 0.062 0.505 97.619 98.413 -
LGA Q 21_A Q 21_A 0.448 0 0.024 0.087 0.871 95.238 92.593 -
LGA R 22_A R 22_A 0.871 0 0.031 0.841 4.423 90.476 68.052 -
LGA A 23_A A 23_A 0.767 0 0.025 0.029 0.778 90.476 90.476 -
LGA L 24_A L 24_A 0.453 0 0.027 0.054 0.593 92.857 96.429 -
LGA S 25_A S 25_A 0.746 0 0.067 0.108 0.916 90.476 90.476 -
LGA A 26_A A 26_A 0.550 0 0.037 0.046 0.647 90.476 92.381 -
LGA D 27_A D 27_A 0.720 0 0.020 0.231 0.846 90.476 90.476 0.818
LGA R 28_A R 28_A 0.613 0 0.026 0.293 1.315 90.476 91.385 -
LGA Y 29_A Y 29_A 0.562 0 0.025 0.627 1.799 90.476 88.413 -
LGA Q 30_A Q 30_A 0.857 0 0.009 1.029 2.645 90.476 81.905 -
LGA L 31_A L 31_A 0.970 0 0.072 0.437 - - - -
LGA Q 32_A Q 32_A 0.471 0 0.043 0.113 - - - -
................................................
GDC_eat: ASP 27_A.OD1 ASP 27_A.OD1 distance: 2.386
GDC_eat: ASP 27_A.OD1 ASP 27_A.OD2 distance: 0.846
GDC_eat: ASP 27_A.OD2 ASP 27_A.OD1 distance: 0.818
GDC_eat: ASP 27_A.OD2 ASP 27_A.OD2 distance: 1.985
# RMSD_GDC results: CA MC common percent ALL common percent GDC_mc GDC_all GDC_at GDC_eat
NUMBER_OF_ATOMS_AA: 135 540 540 100.00 1054 1054 100.00 11 1 4
SUMMARY(RMSD_GDC): 0.914 0.949 1.461 91.775 89.182 90.476 79.643
In the lines "GDC_eat:" are provided results from the distance calculations between selected
atoms (model:1m2f_A_2 - target:1m2e_A) using standard LGA (-4) superposition.
In the section "# RMSD_GDC results:" are provided summary results from the distance
calculations ("GDC_eat" column). It is shown the number of compared pairs of atoms (4) and
the summary value GDC_eat calculated using a similar algorithm as is used for calculating
"GDC_at" (see above).
### Date: 07 August 2008
The following addition has been introduced to the option: -gdc_at:a1,a2,a3,a4
Now the selection of CB position for glycine is allowed: G.CB (the CB coordinates will be
calculated automatically based on the main chain atoms possitions).
NOTE: a complete set of main chain atoms (N,CA,C,O) is required for both input structures.
### Date: 28 August 2008
The following addition to the option "-gdc_at" has been introduced: -gdc_at:*.atom
The selection of one mainchain or CB atom (N,CA,C,O,CB) the same for all amino-acids ('*')
is now allowed (e.g. -gdc_at:*.N).
NOTE: amino-acids from the molecule2 serve as a frame of reference for GDC evaluation
(corresponding amino-acids or atoms that are missing in molecule1 are counted as 0 scores
in GDC calculations). If the option "-gdc_at:*.CB" is selected, then for "Dist_at" and "GDC_at"
calculations the coordinates for CB possitions are automatically calculated for GLYcines only
(the CB coordinates for other than GLY amino-acids have to be present in the provided files).
### Date: 14 March 2009
A new option "-gdc:n" has been introduced to define a number of bins used for GDC evaluation
of atom pairs from the corresponding residues (1 <= n <= 20; bins: <0.5, <1.0, ... <10.0).
If "-gdc:n" is not specified then n=20 (default).
Many thanks to Jane Richardson (dcrjsr@kinemage.biochem.duke.edu) and the members of the
Richardson Lab for introducing a new GDT-like score called GDC_sc (global distance calculation
for sidechains). Instead of comparing residue positions on the basis of Calphas, GDC_sc uses a
characteristic atom near the end of each sidechain type for the evaluation of residue-residue
distance deviations. The list of 18 atoms is given by the -gdc_at flags in the LGA command shown
below, where each one-letter amino-acid code is followed by the PDB-format atom name to be used.
List of flags to perform GDC_sc calculations:
-swap -gdc:10 -gdc_at:V.CG1,L.CD1,I.CD1,P.CG,M.CE,F.CZ,W.CH2,S.OG
-gdc_at:T.OG1,C.SG,Y.OH,N.OD1,Q.OE1,D.OD2,E.OE2,K.NZ,R.NH2,H.NE2
Gly and Ala are not included, since their positions are directly determined by the backbone.
The -swap flag takes care of the possible ambiguity in Asp or Glu terminal oxygen naming.
For GDC_sc, the "optimal" LGA superposition is used to calculate percentages of corresponding
model-target atom pairs that fit under 10 distance-limit values from 0.5A to 5A.
The procedure assigns each reference atom to the relevant bin for its model vs target distance:
<0.5A, <1.0A, ... <4.5A, <5.0A; for each bin_i, the fraction (Pa_i) of assigned atoms is calculated;
finally the fractions are added and scaled to give a GDC_sc value between 0 and 100, by the formula:
GDC_sc = 100*2*(k*Pa_1 + (k-1)*Pa_2 ... + 1*Pa_k) / (k+1)*k, where k=10.
A new flag: "-gdc_sc" has been introduced to the LGA program to facilitate GDC_sc calculations.
This new flag selects all parameters required for GDC_sc calculations (see list of GDC_sc flags
above).
### Date: 21 April 2009
A new option "-gdc_ref:n" has been introduced to allow GDC evaluation using atoms from the target
as a frame of reference (missing atoms in compared amino acids are calculated relative to the
reference structure: second molecule).
-gdc_ref:0 - requesting a complete set of atoms within each residue from both structures.
The score is calculated refering to the definition of the amino acid from the
target structure (second molecule). Missing atoms lower the GDC scores.
-gdc_ref:1 - using existing atoms from the target as a frame of reference. Atoms that are
missing in the model structure (first molecule) are lowering the GDC scores.
-gdc_ref:2 - using existing atoms from the target as a frame of reference. When identical
residues are aligned then the atoms that are missing in the model structure (first
molecule) are lowering the score. In the case of different residues aligned only
the main-chain and CB atoms are taken into account.
The shortcut flag "-gdc" corresponds to "-gdc_ref:2 -swap".
### Date: 16 September 2011
A residue selection options -er1:s1:s1,s2:s2,s3:s3 (si - strings: single residues or chains) have
been improved. Now, if several "single" residues or chains need to be selected then the si pairs
(ranges si:si) can be simplfied by: -er1:s1,s2,s3, (single residues or chains can be separated
by ','(no beg:end required)).
A format of the output from the option "-aa" listing selected residues has been improved.
### Date: 01 September 2019
Performance of the program has been improved.
### Date: 20 February 2024
The LGA_Q scores reported in the SUMMARY lines have been replaced by the GDT_HA scores.
For example, when the similarity between two PDB structures 1sip_A 1cpi_B is evaluated using
"GDT and LCS analysis, FIXED Atom-Atom correspondence" (option "-3"):
runlga.mol_mol.pl 1sip_A 1cpi_B -3
the following scores in the SUMMARY lines are reported:
#CA N1 N2 DIST N RMSD GDT_TS LGA_S3 GDT_HA Seq_Id
SUMMARY(GDT) 99 99 5.0 99 1.06 93.182 96.934 79.040 50.51
In case, when the similarity between two PDB structures 1sip_A 1cpi_B is evaluated using
"Structure alignment analysis, Search for Atom-Atom correspondence" (option "-4"):
runlga.mol_mol.pl 1sip_A 1cpi_B -4
the following scores in the SUMMARY lines are reported:
#CA N1 N2 DIST N RMSD Seq_Id LGA_S GDT_HA4
SUMMARY(LGA) 99 99 5.0 99 1.06 50.51 97.089 79.545
Where the GDT_HA is sometimes called a "high accuracy" version of the GDT_TS as it is computed by selection
of smaller cutoff distances (half the size of GDT_TS). The conventional GDT_TS total score is the average
result of cutoffs at 1, 2, 4, and 8 Ã… while GDT_HA uses 0.5, 1, 2, and 4 Ã….
The user should be aware that calculated scores of LGA_S3 and GDT_HA (from option "-3") and corresponding
scores of LGA_S and GDT_HA4 (from option "-4") may differ. It is because with option "-3" the local and
global structure similarities are evaluated using fixed residue-residue corespondences. With option "-4"
the LGA processing starts from establishing residue-residue correspondences based on the calculated "optimal"
structure-based alignment (for different distance cutoffs), i.e. not taking into account the sequence
similarities. It means that if we are interested in evaluation of similarities between structure conformations
of two proteins for which we know the correct residue-residue correspondence (e.g. different models of the same
protein), then option "-3" can be used. However, when we are interested in similarity between structural folds
of two protein structures, then option "-4" can to be used as it will establish "optimal" local and global
structure-based residue-residue correspondences first.