M2OR data collection is comprised in 3 different tables:
- Experiment Table: Gathers all the OR-molecules experiments, with detailed description of both molecule and receptor as well as the response and the assay realized.
- Blast Table: Aggregated view on the OR sequences using BLAST algorithm.
- Pair Table: Aggregated view on the OR-molecule pairs, with only one entry per pair.
Experiment Table can be split into 5 parts: Molecule description, Receptor information, Response of the experiment, Description of the Bioassay and Resources.
Molecule:
- Molecule Name: Name of the molecule as reported in the source.1
- CID: Compound Identifier from PubChem.2
- CAS: CAS Registry Number.1
- InChIKey: International Chemistry Identifier Key from PubChem.2,3
- SMILES: Simplified Molecular-Input Line-Entry System, which is a text representation of the structure of the molecule.3,4
- Mixture: Distinguish between blend of multiple molecules (“mixture”), combination of isomers (“sum of isomers”), or a mono molecular compound (“mono molecular”).5
[1] Name and CAS are preferably retained as they are in the source, although there might be some inconsistencies due to CAS being a proprietary identifier and name not standardized.
[2] The International Chemistry Identifier Key (InChIKey) serves as a standardized molecular identifier in the M2OR database. It is retrieved from Pubchem using the provided molecular identifiers found in the articles. Identifiers used by the authors were preferentially kept as stated in the orignial publication, and other identifiers, such as CID, were inferred from PubChem using InChIKey.
[3] Mixtures are identified by a space-separated list for both InChIKey and SMILES, for each of its components.
[4] If only the molecular structure is available, a SMILES representation was inferred and further used to search for the corresponding InChIKey. In one study, the authors used a newly synthesized molecule which is only identified by its SMILES string.
[5] The diversity of the composition of the tested compounds is defined by three classes: 'mixture', 'sum of isomers' and 'mono molecular'. The term 'mixture' indicates that the experiments were carried out using a blend of multiple molecules (e.g. essential oils or artificial composition) and mixtures are identified by a space-separated list of InChIKeys of each compound. For all the molecules, information on isomerism is carefully researched. Number of chiral centers and geometric isomerism are automatically determined from SMILES using the RDKit package. If a molecule has at least one chiral center and no specific information is provided about the enantiomery or the diasteroisomery, the molecule is a combination of isomers, and thus labelled 'sum of isomers'. In cases where the authors explicitly mentioned isomerism of a tested molecule, it is identified as 'mono molecular'. All achiral molecules are also labeled as 'mono molecular'.
Receptor:
- Species: Taxon name of the species from which the OR originates.
- Gene Name: Gene Name of the given OR, following Glusman et al. nomenclature.
- Uniprot ID: Identifier in the UniprotKB database.
- Mutation: Mutation realized by the authors on a given protein sequence.1
- Sequence: Protein sequence of the given OR.2
[1] For mutated receptors, the wild-type identifier and sequence is retrieved, and the mutation is separately indicated using XpositionY format where amino acid X from the wild-type sequence is mutated to amino acid Y at the given position. Additions are indicated by setting X to "+" and deletion with "-" as Y.
[2] Protein sequence serves as a standardized OR identifier in M2OR. If it is not available in the original publication, it is retrieved from Uniprot using the name or other identifier provided by the authors.
Response:
- Responsive: Binary-encoded response of the pair, with 0 representing non-agonists and 1 agonist1.
- Parameter: Type of value reported in Value column, can be “EC50” for dose-response measurement, “Raw” if the raw response of the receptor is available or “Norm_rec”, “Norm_pair”, “Norm_other” for responses normalized by either the receptor’s baseline, response of a given pair or an unknown normalization denominator.
- Value: The value reported by the source or EC50 value for dose response measurements2.
- Unit: Unit of the value reported in Value column.
- Concentration: The concentration used for a given screening experiment or the maximum concentration used for a dose-response experiment.
- ConcentrationUnit: Unit of the value reported in Concentration column.
- Nbr. Measurement: Number of repetitions for a given experiment.
[1] Decision about the response in the Responsive column, are solely made by the respective authors. If authors did not provide conclusion on the responsiveness a specific workflow is used to determine the response.
[2] Value can be “n.d.” for non-responsive pairs determined in dose-response measurements. When a mixture is tested, when available, the concentration of each of its components is described in a space-separated list.
Bioassay:
- Type: The measured quantity which could be a concentration of cAMP (“CAMP”), “Ca2+”, or Secreted embryonic alkaline phosphatase (“SEAP”), the fluorescence emitted by Green fluorescent protein (“GFP”) or by luciferase (“Luc”), or the membrane activity measured by intensity (“Intensity”) or “conductance”.
- Cell line: Type of Cell line used for OR expression is also specified. This could be HEK293T cells (“HEK”), OR-specific engineered derivatives like Hana3A cells (“H3A”) and ScL21 (“SCL21”), yeast-based systems, oocytes, olfactory sensory neurons (“OSN”), Neuroblastoma x Glioma hybrid (“NxG108CC15”), or human cancer cells (“HeLa/Olf”).
- Gprotein: The type of G protein used: “Golf”, “Galpha16”, “Galpha15/16”, “Galpha q”.
- Co-transfection: Any protein co-transfected with the olfactory receptors, as mentioned in the source.
- Tag: N-terminal modifications, known as tags: “Il-6-Halotag”, “Flag”, “Rho”, “GFP”, “MYC”, “Rho Lucy”
- Delivery: Delivery method such as “liquid” or “gas”
- Assay: Experimental conditions, such as “in vitro” or “ex vivo”
- Assay System: The tools used in the assay system.
Resources:
- Reference: Bibliographic reference of the source.
- DOI: Digital Object Identifier (DOI) of the source.
- Reference Position: Specific location of the information about the pair in the source. (e.g., Table1, Fig S1...)
The human genome includes approximately 1000 olfactory receptor genes, of which around 60\% are considered pseudogenes. Each OR has a distinct recognition spectrum and alterations in one or more amino acids can significantly change its response. Multiple variants and mutants of the same gene have been tested in the literature and they are gathered in the M2OR database. For instance, 42 different sequences share more than 99% sequence identity with the sequence annotated as OR1A1 in the Uniprot database (i.e. with the reference sequence) and some of these sequences show different responses compared to the reference. To facilitate the comparison of such cases, similar sequences are grouped under the same reference sequence. BLAST algorithm is used to compare each sequence in M2OR against the Uniprot database (2023 release 03). The name and Uniprot ID of the best match in terms of identity are then associated with these sequences. They are subsequently grouped by their best match’s name, resulting in the BLAST table.
- Gene Name: Gene Name of the given OR, following Glusman et al. nomenclature.
- Uniprot ID: Identifier in the UniprotKB database
- % Seq Identity: Percentage of identity between the sequence in Experiment Table and the sequence in Uniprot.
- Species: Taxon name of the species from which the OR originates.
- Mutation: Differences in amino acids between the sequence in Uniprot and the Sequence in M2OR.
- Sequence: Protein sequence of the given OR in Experiment Table.
- Ref. Sequence: Protein sequence of the given OR in Uniprot.
[1] The mutation is indicated using XpositionY format where amino acid X from the Uniprot sequence is mutated to amino acid Y at the given position. Additions are indicated by setting X to "+" and deletion with "-" as Y.
Multiple experiments for the same OR-molecule pair can be found in the experiment table. However, users are often interested in an aggregated view on the OR-molecule pairs for applications such as analysis of the combinatorial code, or new active pair prediction. Pair table is created to provide this consensus responsiveness for each OR-molecule pair.
Molecule:
- Molecule Name: Name of the molecule as reported in the source.1
- CID: Compound Identifier from PubChem.2
- CAS: CAS Registry Number.1
- InChIKey: International Chemistry Identifier Key from PubChem.2,3
- SMILES: Simplified Molecular-Input Line-Entry System, which is a text representation of the structure of the molecule.3,4
- Mixture: Distinguish between blend of multiple molecules (“mixture”), combination of isomers (“sum of isomers”), or a mono molecular compound (“mono molecular”).5
[1]: Name and CAS are preferably retained as they are in the source, although there might be some inconsistencies due to CAS being a proprietary identifier and name not standardized.
[2] The International Chemistry Identifier Key (InChIKey) serves as a standardized molecular identifier in the M2OR database. It is retrieved from Pubchem using the provided molecular identifiers found in the articles. Identifiers used by the authors were preferentially kept as stated in the orignial publication, and other identifiers, such as CID, were inferred from PubChem using InChIKey.
[3] Mixtures are identified by a space-separated list for both InChIKey and SMILES, for each of its components.
[4]: If only the molecular structure is available, a SMILES representation was inferred and further used to search for the corresponding InChIKey. In one study, the authors used a newly synthesized molecule which is only identified by its SMILES string.
[5]: The diversity of the composition of the tested compounds is defined by three classes: 'mixture', 'sum of isomers' and 'mono molecular'. The term 'mixture' indicates that the experiments were carried out using a blend of multiple molecules (e.g. essential oils or artificial composition) and mixtures are identified by a space-separated list of InChIKeys of each compound. For all the molecules, information on isomerism is carefully researched. Number of chiral centers and geometric isomerism are automatically determined from SMILES using the RDKit package. If a molecule has at least one chiral center and no specific information is provided about the enantiomery or the diasteroisomery, the molecule is a combination of isomers, and thus labelled 'sum of isomers'. In cases where the authors explicitly mentioned isomerism of a tested molecule, it is identified as 'mono molecular'. All achiral molecules are also labeled as 'mono molecular'.
Receptor:
- Species: Taxon name of the species from which the OR originates.
- Gene Name: Gene Name of the given OR, following Glusman et al. nomenclature.
- Uniprot ID: Identifier in the UniprotKB database.
- Mutation: Mutation realized by the authors on a given protein sequence.1
- Sequence: Protein sequence of the given OR.2
[1] For mutated receptors, the wild-type identifier and sequence is retrieved, and the mutation is separately indicated using XpositionY format where amino acid X from the wild-type sequence is mutated to amino acid Y at the given position. Additions are indicated by setting X to "+" and deletion with "-" as Y.
[2] Protein sequence serves as a standardized OR identifier in M2OR. If it is not available in the original publication, it is retrieved from Uniprot using the name or other identifier provided by the authors.
Response:
- Responsive: Consensus binary-encoded responsiveness of the pair, with 0 representing non-agonists and 1 agonist1.
- Data Quality: Reliability of the responsiveness decision. Most reliable are dose response experiments (“EC50”). In case of screening, “secondaryScreening” indicates that the pair was tested in at least two distinct concentrations and is less reliable than EC50. Finally, “primaryScreening” is the least reliable and indicates that the pair was tested in a single concentration.
- Number of Unique Value Screen: The number of distinct concentrations tested for a given pair.
[1] Consensus Responsiveness relies on the following decision: (1) The dose-response measurements are prioritized over screening data. When there is contradictory responsiveness between multiple dose-response measurements for a given pair, then the pair is discarded. (2) In case of the screening data, we exclude pairs that are responsive at low concentrations but not at higher concentrations. We also discard pairs that exhibit inconsistencies at the same concentration. (3) A consistent screening pair is considered responsive if it is responsive in at least one concentration.