LMC Chemical Information Services at http://c...

来源:百度文库 编辑:神马文学网 时间:2024/04/26 23:58:21
CACTVS Hashcodes and Hashcode-Based Identifiers
CACTVS can calculate hashcodes from the molecular structure [more precisely: from the molecular "ensemble", e.g. everything that is in one CTAB block in an SD file, which may include several molecules such as solute and solvent]. They are 64-bit unsigned integer numbers, represented as 16-digit hexadecimal strings. Several variants of hashcodes can be calculated, depending on whether sensitivity to certain molecular or atomic features is turned on or not. For example, hashcode variants can be calculated that are sensitive to stereochemistry, i.e. will yield a different hashcode for different stereoisomers; or, it can be requested that stereochemistry be ignored.
Eight such hashcodes are included in the canonicalized SD files posted here. Two of those are included in the .index files. They are shown in bold in the following table.
Hashcodes Property Name Fragment-sensitive (1) Isotope-sensitive (2) Charge-sensitive (3) Tautomer-sensitive (4) Stereochemistry-sensitive (5) Tag
E_HASHY yes yes yes yes no FICTu
E_HASHSY yes yes yes yes yes FICTS
E_TAUTO_HASH (6) yes yes yes no no FICuu
E_STEREO_TAUTO_HASH (7) yes yes yes no yes FICuS
E_MAXFRAG_HASHY no no no yes no uuuTu
E_MAXFRAG_HASHSY no no no yes yes uuuTS
E_MAXFRAG_HASHTY (8) no no no no no uuuuu
E_MAXFRAG_HASHSTY no no no no yes uuuuS
Technical notes:
(1) Realized by special CACTVS preprocessing procedure (part of the script), which extracts maximum fragment from ensemble.
(2) Turned on by setting the "useisotope" flag in mhash command.
(3) Turned on by use of input ensemble property E_PARENT_STRUCTURE [this does more than just uncharge the ensemble].
Note: Sensitivities "F", "I" and "C" could be turned on and off separately. However, in the current script they are coupled by being applied in the same procedure.
(4) Requested by selection of hashcode type.
(5) Requested by selection of hashcode type.
(6) Formerly also called E_HASHTY.
(7) Formerly also called E_HASHSTY. This (with tag appended) is also called the "FICuS" identifier.
(8) This (with tag appended) is also called the "uuuuu" identifier.
These hashcodes span the range from very lenient to very strict in terms of distinguishing between chemicals.  E_MAXFRAG_HASHTY is the most lenient one, and will equate structures that may contain different isotopes, be the salts of different metals (since only the ensemble‘s largest fragment is evaluated), be charged or uncharged, be different stereoisomers, and be represented in different tautomeric forms.  E_HASHSY, conversely, is the strictest one, distinguishing between all those cases.  However, it will interpret different tautomers of the same compound as different chemicals, which is not usually desirable from a chemical point of view when, e.g., comparing databases. E_STEREO_TAUTO_HASH, being tautomer-invariant, but sensitive to everything else, is probably the most useful of the hash codes included, since it is essentially a calculable, unique identifier for any small-molecule chemical. Note that this hash code will, e.g., distinguish between O, O-, O2-, OH., OH-, H2O, D2O, H3O+ etc., species that many other systems and databases - especially if they don‘t include hydrogens - will project onto the same entity "O".
Since all CACTVS hashcode variants are represented as 16-digit hexadecimal numbers, there is no way from a hashcode itself to tell what type, flags, preprocessing procedure(s) etc. were used in its calculation. It is generally meaningless to compare different types of hashcodes, e.g. to use two different variants in an overlap analysis between two databases. To make them usable as independent identifiers, we have therefore added composite tags to all hashcodes in the files posted here. Its first part denoted the type of the hashcode (in the example above, 0C21C07B196E1B00-FICuS-3.256, the string "FICuS"). The second part is the version of CACTVS with which the hashcode was calculated (here: "3.256"). They are connected by a single dash each. The syntax of the type part of the tag is explained in thehashcode table.
While lookups, comparisons, counts etc. can in principle be conducted on the hashcode part alone of each identifier, we strongly recommend to keep the entire identifier in all applications, files and databases. We strive to make sure that all the hashcodes remain constant, i.e. will stay the same for the same structure when calculated with future versions of CACTVS. However, it is possible that changes in the algorithm (which may become necessary as more and more of small-molecule chemistry space is being explored) may cause a change of a structure‘s hashcode in some rare cases. The addition of the CACTVS version to the identifier string will help identify, and deal with, such cases. We will post a list of changed hashcodes, as far as we become aware of them, on this web site.