Improve meta-data standardization of multiple Massbank records#141
Conversation
Release version 2020.05
Release version 2020.06
Release version 2020.09
…ome entries (sec)
|
Thank you for your contribution. I will contact you in the next days once im back in the office. |
|
At first I will merge your PR to a different branch and then cherry pick the commits I really like. And for the remaining ones I will start a discussion here soon. |
|
@bachi55, thanks a lot for your curation efforts! However, I am a little bit critical to grab records where we don't have original information. The people should curate their records as much as possible by themselves or help with it. Assumptions on the experimental conditions and the mass spectral data itself cannot be made without interaction with the data providers. If they are not available, we need to decide if this an minor issues (as the seconds / minutes in the gradients) or a major one which is an obvious error in the MS data. In the later case, the only way is to deprecate the record. Other things can be curated automatically such as links to other databases etc. Best, |
|
Hi @bachi55,
|
|
Dear @bachi55, Best, |
|
Ah, sorry, I was imprecise. Of course the identify of a record must be clear and the mass spectra should fit to the structure. However, a wrong ID from an external database or mismatches in the meta data is not a reason to deprecate a record. Of course, it is a reason to keep care on the records and to curate the issues, either by our curation if possible or by inclusion of the data providers (as said before). Your work is really appreciated @bachi55, but open issues to discuss your findings before starting activities. This is also important with regards to transparency and FAIR principles. |
|
Hei, I apologize that the pull-request was very hard to review as it included very many changes. Thanks for taking the effort and including some of the proposed changes! I can comment on deprecation of entries. I agree that this is a quite strict choice. However, I use Massbank to develop machine learning algorithms and if I see an obvious inconsistency in the data, than I might simply remove it. Having two Identifiers, e.g. PubChem vs. Chemspider ID or PubChem vs. Molecule name, makes me loosing trust in all structural annotations, as I cannot know which source was used to input the SMILES, InChIs etc. But of course, I can exclude these spectra also in another way from my personal pipelines, which does not effect the whole Massbank repository :). Regarding the FAIR principle: I think the process we have here is transparent and invites people to correct or reject my changes. Again, thanks for your effort @meier-rene. I will take both (@meier-rene and @tsufz) of your advices into account when proposing changes in the Future. Best regards, Eric |
|
Dear @bachi55, Best wishes, |
Hi,
I open this pull request to include consistency fixes I did to multiple Massbank records. As the changes concerning many entries, I want to give an explanation for some changes I did:
SOLVENT AandSOLVENT Bbased on theSOLVENTinformation. Some of the raw data, doesn't specify A and B solvent separately. However, sometime this information is available and I suggest changes like this (see original diff):PS: Here, I assume that in an reversed phase LC column Water (H2O) is typically denoted as solvent A. Please correct me, if I am wrong. I think that it does not make much of a difference, as long as the gradient information specifies which percentages / fractions belong to which solvent.
CH$LINK: PUBCHEM CID:392323. Some of the entries, however, lacked the:. I added it to simplify the downstream parsing. For example in EQ302601 or EQ305901or
Eawag_additional_speciescontributor and take a look on the LC column specifications for two records:AC$CHROMATOGRAPHY: COLUMN_NAME XBridge C18 3.5 um, 2.1x50 mm, Waters with guard columnAC$CHROMATOGRAPHY: COLUMN_NAME X-bridge C18, 3.5um, 2.1x50mm, WatersFor both records apparently the same LC column was used (only one specified with a "guard column"). However, the there are small inconsistencies in naming, e.g. X-Bridge vs. XBridge. I fixed this in the following way (ET260104 respectively ETS00128):
respectively
As a result, both column names are equal "COLUMN_NAME XBridge C18 3.5um, 2.1x50mm, Waters" (not considering the addition with the "guard column").
+DEPRECATED: 2020-09-09 wrong annotation (PubChem ID, Name and CAS are not consistent)General comment:
I understand the proposed changes concern very many files, making this pull-request challenging to review. However, my main concern was to build a local database of Massbank allowing me work with MS2 and RTs. Therefore, I am specifically interested in knowing the experimental conditions (LC and MS). For that, e.g., column names need to be standardized in order to perform proper grouping and have less noisy meta-data. Eventually, it would be possible for future releases to enforce some kind of standard format, e.g., for column names (I often found that the column was most likely the same, but slight variations in the naming make matching a harder task).
Best regards,
Eric