Improve meta-data standardization of multiple Massbank records by bachi55 · Pull Request #141 · MassBank/MassBank-data

bachi55 · 2020-09-28T09:18:37Z

Hi,

I open this pull request to include consistency fixes I did to multiple Massbank records. As the changes concerning many entries, I want to give an explanation for some changes I did:

Specifying SOLVENT A and SOLVENT B based on the SOLVENT information. Some of the raw data, doesn't specify A and B solvent separately. However, sometime this information is available and I suggest changes like this (see original diff):

 AC$CHROMATOGRAPHY: COLUMN_NAME Symmetry C18 Column, Waters
 AC$CHROMATOGRAPHY: FLOW_GRADIENT 0min:5%, 24min:95%, 28min:95%, 28.1:5% (acetonitrile)
 AC$CHROMATOGRAPHY: FLOW_RATE 0.3 ml/min 
 AC$CHROMATOGRAPHY: RETENTION_TIME 508.262 s
-AC$CHROMATOGRAPHY: SOLVENT CH3CN(0.1%HCOOH)/ H2O(0.1%HCOOH)
+AC$CHROMATOGRAPHY: SOLVENT A H2O(0.1%HCOOH)
+AC$CHROMATOGRAPHY: SOLVENT B CH3CN(0.1%HCOOH)

PS: Here, I assume that in an reversed phase LC column Water (H2O) is typically denoted as solvent A. Please correct me, if I am wrong. I think that it does not make much of a difference, as long as the gradient information specifies which percentages / fractions belong to which solvent.

Fix external link for PubChem. It seems, that in the majority of Massbank records the PubChem CID is specified as follows: CH$LINK: PUBCHEM CID:392323. Some of the entries, however, lacked the :. I added it to simplify the downstream parsing. For example in EQ302601 or EQ305901

-CH$LINK: PUBCHEM CID: 10130527
+CH$LINK: PUBCHEM CID:10130527

or

-CH$LINK: PUBCHEM CID 107807
+CH$LINK: PUBCHEM CID:107807

Unify LC column names within each contributor Let's consider for example the Eawag_additional_species contributor and take a look on the LC column specifications for two records:

ET260104: AC$CHROMATOGRAPHY: COLUMN_NAME XBridge C18 3.5 um, 2.1x50 mm, Waters with guard column
ETS00128: AC$CHROMATOGRAPHY: COLUMN_NAME X-bridge C18, 3.5um, 2.1x50mm, Waters

For both records apparently the same LC column was used (only one specified with a "guard column"). However, the there are small inconsistencies in naming, e.g. X-Bridge vs. XBridge. I fixed this in the following way (ET260104 respectively ETS00128):

-AC$CHROMATOGRAPHY: COLUMN_NAME XBridge C18 3.5 um, 2.1x50 mm, Waters with guard column
+AC$CHROMATOGRAPHY: COLUMN_NAME XBridge C18 3.5um, 2.1x50mm, Waters with guard column

respectively

-AC$CHROMATOGRAPHY: COLUMN_NAME X-bridge C18, 3.5um, 2.1x50mm, Waters
+AC$CHROMATOGRAPHY: COLUMN_NAME XBridge C18 3.5um, 2.1x50mm, Waters

As a result, both column names are equal "COLUMN_NAME XBridge C18 3.5um, 2.1x50mm, Waters" (not considering the addition with the "guard column").

Adding the retention time unit If not provided, I added the retention time units for some entries. Here, I consider that retention times (without unit) over 100 are most likely seconds. See for example PN000023:

-AC$CHROMATOGRAPHY: RETENTION_TIME 260.387
+AC$CHROMATOGRAPHY: RETENTION_TIME 260.387 sec

Deprecate some records KWR I found some inconsistencies, e.g. PubChem ID missmatch with the provided molecule name or CAS. For example in KW103103

+DEPRECATED: 2020-09-09 wrong annotation (PubChem ID, Name and CAS are not consistent)

General comment:
I understand the proposed changes concern very many files, making this pull-request challenging to review. However, my main concern was to build a local database of Massbank allowing me work with MS2 and RTs. Therefore, I am specifically interested in knowing the experimental conditions (LC and MS). For that, e.g., column names need to be standardized in order to perform proper grouping and have less noisy meta-data. Eventually, it would be possible for future releases to enforce some kind of standard format, e.g., for column names (I often found that the column was most likely the same, but slight variations in the naming make matching a harder task).

Best regards,

Eric

Release version 2020.05

Release version 2020.06

Release version 2020.09

… entries (sec)

…ome entries (sec)

meier-rene · 2020-09-28T10:46:05Z

Thank you for your contribution. I will contact you in the next days once im back in the office.

meier-rene · 2020-09-29T08:41:40Z

At first I will merge your PR to a different branch and then cherry pick the commits I really like. And for the remaining ones I will start a discussion here soon.

tsufz · 2020-10-02T08:56:24Z

@bachi55, thanks a lot for your curation efforts! However, I am a little bit critical to grab records where we don't have original information. The people should curate their records as much as possible by themselves or help with it. Assumptions on the experimental conditions and the mass spectral data itself cannot be made without interaction with the data providers.

If they are not available, we need to decide if this an minor issues (as the seconds / minutes in the gradients) or a major one which is an obvious error in the MS data. In the later case, the only way is to deprecate the record.

Other things can be curated automatically such as links to other databases etc.

Best,
Tobias

meier-rene · 2020-10-02T09:00:57Z

@tsufz Thats why I'm reviewing the changes and make sure that the original literature is taken into account. When Im done I will give some more informations. @bachi55 I really appreciate your effort. More on this topic later...

meier-rene · 2020-10-19T12:05:54Z

Hi @bachi55,
I promised to give you some feedback. Most of the modifications you proposed are now integrated and I really appreciate your effort. Next time :) please make it a little bit different to make my work a little bit easier.

always make your pull requests against dev branch.
make your pull requests smaller and put just one particular type of change into one pr if you do scripted changes
please do not put any scripts in the repo. Instead just put the script in the pull request comment. I will delete all scripts in the data repo anyway. If you have looked up anything in the original literature, please also put the reference either in the commit message or in the pr comments.
I will not mark any records as deprecated, because a pubchem id is not correct. There are hundreds or thousands of errors of this or similar type. As long as the InChI, SMILES, chemical formula are in agreement its ok too me.

tsufz · 2020-10-19T12:12:20Z

Dear @bachi55,
I fully support @meier-rene 's opinion. The only reason to deprecate a record are concerns on the quality of mass spectra information. All others can be fixed, enriched or are from minor importance (such as missing scan ranges, information on chromatography conditions, which are a nice to have information, but not as substantial as the mass spectral data quality).

Best,
Tobias

tsufz · 2020-10-19T12:59:49Z

Ah, sorry, I was imprecise. Of course the identify of a record must be clear and the mass spectra should fit to the structure. However, a wrong ID from an external database or mismatches in the meta data is not a reason to deprecate a record. Of course, it is a reason to keep care on the records and to curate the issues, either by our curation if possible or by inclusion of the data providers (as said before). Your work is really appreciated @bachi55, but open issues to discuss your findings before starting activities. This is also important with regards to transparency and FAIR principles.

bachi55 · 2020-10-21T13:29:36Z

Hei,

I apologize that the pull-request was very hard to review as it included very many changes. Thanks for taking the effort and including some of the proposed changes!

I can comment on deprecation of entries. I agree that this is a quite strict choice. However, I use Massbank to develop machine learning algorithms and if I see an obvious inconsistency in the data, than I might simply remove it. Having two Identifiers, e.g. PubChem vs. Chemspider ID or PubChem vs. Molecule name, makes me loosing trust in all structural annotations, as I cannot know which source was used to input the SMILES, InChIs etc. But of course, I can exclude these spectra also in another way from my personal pipelines, which does not effect the whole Massbank repository :).

Regarding the FAIR principle: I think the process we have here is transparent and invites people to correct or reject my changes.

Again, thanks for your effort @meier-rene. I will take both (@meier-rene and @tsufz) of your advices into account when proposing changes in the Future.

Best regards,

Eric

tsufz · 2020-10-21T13:36:00Z

Dear @bachi55,
Thanks a lot for your last comment. Your work is very much appreciated! The training aspect (real not ML) is also from importance. Therefore it is good to address issues also to the contributors in order to give them a chance to learn about their mass spectra and more about data science.

Best wishes,
Tobias

meier-rene and others added 27 commits May 5, 2020 11:59

Bumped version number to 2020.05

6e5430e

Merge pull request MassBank#126 from MassBank/release-2020.05

dd6c34e

Release version 2020.05

Bumped version number to 2020.06

f38bcf4

Merge branch 'master' into release-2020.06

a702902

Merge pull request MassBank#128 from MassBank/release-2020.06

e07dc9f

Release version 2020.06

Bumped version number to 2020.09

312d8cd

Merge branch 'main' into release-2020.09

ab33ad2

Merge pull request MassBank#138 from MassBank/release-2020.09

48aac71

Release version 2020.09

Fix 'Athens_Univ'

fa412ed

Remove execution bit for some spec-files in 'MSSJ'

8c78cea

Fix encoding issue in 'MSJ00148.txt'

3f87a58

Specify solvent A and B for 'MPI_for_Chemical_Ecology'

2970fd0

Add 'ETS' prefix for the 'Eawag_Additional_Specs' contributor

ef7aaf3

Unify column descriptor string for 'Eawag_Additional_Specs'

7c65aa3

Specify solvent A and B for 'Fiocruz'

1e522d8

Specify solvent A and B for 'Fukuyama_Univ'

e0c3ebc

Specify solvent A and B for 'NAIST'

015b704

Unify column name for 'NaToxAq'

a588a72

Specify solvent A and B for 'IPB_Halle' and add time unit for the PN*…

45c10fb

… entries (sec)

Specify solvent A and B for 'RIKEN'

6b33ccc

Specify solvent A and B for 'Univ_Toyama' and add time unit for the s…

fe8fc37

…ome entries (sec)

Specify solvent A and B for 'MSSJ'

e9db524

Fix solvent information in 'Athens_Univ'

2cda10e

fix pubchem link line for some entries

a2cf5be

fix pubchem link line in one Univ_Toyama entry

ac6af46

fix some Eawag entries: pubchem link

6c85954

[KWR] Add deprecation information: Annotation seems to be not correct.

e913a54

bachi55 changed the title ~~Fix entry format~~ Improve meta-data standardization of multiple Massbank records Sep 28, 2020

meier-rene changed the base branch from main to bachi55_fix_entry_format September 29, 2020 08:46

Merge branch 'bachi55_fix_entry_format' into fix_entry_format

d0bc386

meier-rene merged commit d60bdaa into MassBank:bachi55_fix_entry_format Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve meta-data standardization of multiple Massbank records#141

Improve meta-data standardization of multiple Massbank records#141
meier-rene merged 28 commits intoMassBank:bachi55_fix_entry_formatfrom
bachi55:fix_entry_format

bachi55 commented Sep 28, 2020

Uh oh!

meier-rene commented Sep 28, 2020

Uh oh!

meier-rene commented Sep 29, 2020

Uh oh!

tsufz commented Oct 2, 2020

Uh oh!

meier-rene commented Oct 2, 2020

Uh oh!

meier-rene commented Oct 19, 2020

Uh oh!

tsufz commented Oct 19, 2020

Uh oh!

tsufz commented Oct 19, 2020

Uh oh!

bachi55 commented Oct 21, 2020

Uh oh!

tsufz commented Oct 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bachi55 commented Sep 28, 2020

Uh oh!

meier-rene commented Sep 28, 2020

Uh oh!

meier-rene commented Sep 29, 2020

Uh oh!

tsufz commented Oct 2, 2020

Uh oh!

meier-rene commented Oct 2, 2020

Uh oh!

meier-rene commented Oct 19, 2020

Uh oh!

tsufz commented Oct 19, 2020

Uh oh!

tsufz commented Oct 19, 2020

Uh oh!

bachi55 commented Oct 21, 2020

Uh oh!

tsufz commented Oct 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants