Skip to content

Improve meta-data standardization of multiple Massbank records#141

Merged
meier-rene merged 28 commits intoMassBank:bachi55_fix_entry_formatfrom
bachi55:fix_entry_format
Sep 29, 2020
Merged

Improve meta-data standardization of multiple Massbank records#141
meier-rene merged 28 commits intoMassBank:bachi55_fix_entry_formatfrom
bachi55:fix_entry_format

Conversation

@bachi55
Copy link
Copy Markdown

@bachi55 bachi55 commented Sep 28, 2020

Hi,

I open this pull request to include consistency fixes I did to multiple Massbank records. As the changes concerning many entries, I want to give an explanation for some changes I did:

  1. Specifying SOLVENT A and SOLVENT B based on the SOLVENT information. Some of the raw data, doesn't specify A and B solvent separately. However, sometime this information is available and I suggest changes like this (see original diff):
 AC$CHROMATOGRAPHY: COLUMN_NAME Symmetry C18 Column, Waters
 AC$CHROMATOGRAPHY: FLOW_GRADIENT 0min:5%, 24min:95%, 28min:95%, 28.1:5% (acetonitrile)
 AC$CHROMATOGRAPHY: FLOW_RATE 0.3 ml/min 
 AC$CHROMATOGRAPHY: RETENTION_TIME 508.262 s
-AC$CHROMATOGRAPHY: SOLVENT CH3CN(0.1%HCOOH)/ H2O(0.1%HCOOH)
+AC$CHROMATOGRAPHY: SOLVENT A H2O(0.1%HCOOH)
+AC$CHROMATOGRAPHY: SOLVENT B CH3CN(0.1%HCOOH)

PS: Here, I assume that in an reversed phase LC column Water (H2O) is typically denoted as solvent A. Please correct me, if I am wrong. I think that it does not make much of a difference, as long as the gradient information specifies which percentages / fractions belong to which solvent.

  1. Fix external link for PubChem. It seems, that in the majority of Massbank records the PubChem CID is specified as follows: CH$LINK: PUBCHEM CID:392323. Some of the entries, however, lacked the :. I added it to simplify the downstream parsing. For example in EQ302601 or EQ305901
-CH$LINK: PUBCHEM CID: 10130527
+CH$LINK: PUBCHEM CID:10130527

or

-CH$LINK: PUBCHEM CID 107807
+CH$LINK: PUBCHEM CID:107807
  1. Unify LC column names within each contributor Let's consider for example the Eawag_additional_species contributor and take a look on the LC column specifications for two records:
  • ET260104: AC$CHROMATOGRAPHY: COLUMN_NAME XBridge C18 3.5 um, 2.1x50 mm, Waters with guard column
  • ETS00128: AC$CHROMATOGRAPHY: COLUMN_NAME X-bridge C18, 3.5um, 2.1x50mm, Waters

For both records apparently the same LC column was used (only one specified with a "guard column"). However, the there are small inconsistencies in naming, e.g. X-Bridge vs. XBridge. I fixed this in the following way (ET260104 respectively ETS00128):

-AC$CHROMATOGRAPHY: COLUMN_NAME XBridge C18 3.5 um, 2.1x50 mm, Waters with guard column
+AC$CHROMATOGRAPHY: COLUMN_NAME XBridge C18 3.5um, 2.1x50mm, Waters with guard column

respectively

-AC$CHROMATOGRAPHY: COLUMN_NAME X-bridge C18, 3.5um, 2.1x50mm, Waters
+AC$CHROMATOGRAPHY: COLUMN_NAME XBridge C18 3.5um, 2.1x50mm, Waters

As a result, both column names are equal "COLUMN_NAME XBridge C18 3.5um, 2.1x50mm, Waters" (not considering the addition with the "guard column").

  1. Adding the retention time unit If not provided, I added the retention time units for some entries. Here, I consider that retention times (without unit) over 100 are most likely seconds. See for example PN000023:
-AC$CHROMATOGRAPHY: RETENTION_TIME 260.387
+AC$CHROMATOGRAPHY: RETENTION_TIME 260.387 sec
  1. Deprecate some records KWR I found some inconsistencies, e.g. PubChem ID missmatch with the provided molecule name or CAS. For example in KW103103
+DEPRECATED: 2020-09-09 wrong annotation (PubChem ID, Name and CAS are not consistent)

General comment:
I understand the proposed changes concern very many files, making this pull-request challenging to review. However, my main concern was to build a local database of Massbank allowing me work with MS2 and RTs. Therefore, I am specifically interested in knowing the experimental conditions (LC and MS). For that, e.g., column names need to be standardized in order to perform proper grouping and have less noisy meta-data. Eventually, it would be possible for future releases to enforce some kind of standard format, e.g., for column names (I often found that the column was most likely the same, but slight variations in the naming make matching a harder task).

Best regards,

Eric

meier-rene and others added 27 commits May 5, 2020 11:59
@bachi55 bachi55 changed the title Fix entry format Improve meta-data standardization of multiple Massbank records Sep 28, 2020
@meier-rene
Copy link
Copy Markdown
Contributor

Thank you for your contribution. I will contact you in the next days once im back in the office.

@meier-rene
Copy link
Copy Markdown
Contributor

At first I will merge your PR to a different branch and then cherry pick the commits I really like. And for the remaining ones I will start a discussion here soon.

@meier-rene meier-rene changed the base branch from main to bachi55_fix_entry_format September 29, 2020 08:46
@meier-rene meier-rene merged commit d60bdaa into MassBank:bachi55_fix_entry_format Sep 29, 2020
@tsufz
Copy link
Copy Markdown
Member

tsufz commented Oct 2, 2020

@bachi55, thanks a lot for your curation efforts! However, I am a little bit critical to grab records where we don't have original information. The people should curate their records as much as possible by themselves or help with it. Assumptions on the experimental conditions and the mass spectral data itself cannot be made without interaction with the data providers.

If they are not available, we need to decide if this an minor issues (as the seconds / minutes in the gradients) or a major one which is an obvious error in the MS data. In the later case, the only way is to deprecate the record.

Other things can be curated automatically such as links to other databases etc.

Best,
Tobias

@meier-rene
Copy link
Copy Markdown
Contributor

@tsufz Thats why I'm reviewing the changes and make sure that the original literature is taken into account. When Im done I will give some more informations. @bachi55 I really appreciate your effort. More on this topic later...

@meier-rene
Copy link
Copy Markdown
Contributor

Hi @bachi55,
I promised to give you some feedback. Most of the modifications you proposed are now integrated and I really appreciate your effort. Next time :) please make it a little bit different to make my work a little bit easier.

  1. always make your pull requests against dev branch.
  2. make your pull requests smaller and put just one particular type of change into one pr if you do scripted changes
  3. please do not put any scripts in the repo. Instead just put the script in the pull request comment. I will delete all scripts in the data repo anyway. If you have looked up anything in the original literature, please also put the reference either in the commit message or in the pr comments.
  4. I will not mark any records as deprecated, because a pubchem id is not correct. There are hundreds or thousands of errors of this or similar type. As long as the InChI, SMILES, chemical formula are in agreement its ok too me.

@tsufz
Copy link
Copy Markdown
Member

tsufz commented Oct 19, 2020

Dear @bachi55,
I fully support @meier-rene 's opinion. The only reason to deprecate a record are concerns on the quality of mass spectra information. All others can be fixed, enriched or are from minor importance (such as missing scan ranges, information on chromatography conditions, which are a nice to have information, but not as substantial as the mass spectral data quality).

Best,
Tobias

@tsufz
Copy link
Copy Markdown
Member

tsufz commented Oct 19, 2020

Ah, sorry, I was imprecise. Of course the identify of a record must be clear and the mass spectra should fit to the structure. However, a wrong ID from an external database or mismatches in the meta data is not a reason to deprecate a record. Of course, it is a reason to keep care on the records and to curate the issues, either by our curation if possible or by inclusion of the data providers (as said before). Your work is really appreciated @bachi55, but open issues to discuss your findings before starting activities. This is also important with regards to transparency and FAIR principles.

@bachi55
Copy link
Copy Markdown
Author

bachi55 commented Oct 21, 2020

Hei,

I apologize that the pull-request was very hard to review as it included very many changes. Thanks for taking the effort and including some of the proposed changes!

I can comment on deprecation of entries. I agree that this is a quite strict choice. However, I use Massbank to develop machine learning algorithms and if I see an obvious inconsistency in the data, than I might simply remove it. Having two Identifiers, e.g. PubChem vs. Chemspider ID or PubChem vs. Molecule name, makes me loosing trust in all structural annotations, as I cannot know which source was used to input the SMILES, InChIs etc. But of course, I can exclude these spectra also in another way from my personal pipelines, which does not effect the whole Massbank repository :).

Regarding the FAIR principle: I think the process we have here is transparent and invites people to correct or reject my changes.

Again, thanks for your effort @meier-rene. I will take both (@meier-rene and @tsufz) of your advices into account when proposing changes in the Future.

Best regards,

Eric

@tsufz
Copy link
Copy Markdown
Member

tsufz commented Oct 21, 2020

Dear @bachi55,
Thanks a lot for your last comment. Your work is very much appreciated! The training aspect (real not ML) is also from importance. Therefore it is good to address issues also to the contributors in order to give them a chance to learn about their mass spectra and more about data science.

Best wishes,
Tobias

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants