The Data Aggregator is a web application that is used to integrate biodiversity data into a Darwin Core compatible data format. The application is built using the Phoenix, and Ash Frameworks which are development utilities written in Elixir. The application is designed to be modular and extensible, allowing for the addition of new features and functionality as needed. It is a project of the Swiss Academy of Sciences (SCNAT) and is developed by Zebbra.
If there are - in your opinion - parts missing or if you detect issues, please create an issue on Github or consider contribute by submitting a PR
We will describe the High- to Mid-Level concepts of the application. The goal is to give you a good understanding of the applications architecture and the main modules that cover the processes needed for import, enrichment, and publication of biodiversity data.
Throughout the documentation, we will use the for_coders: tag to indicate sections that are more technical in nature and contain hints to code sections where one can find corresponding elixir code modules. These sections are intended for developers and technical people who are interested in the inner workings of the application.
for_coders: checkout the Development section for more detailed information on how to work with the application as a developer and setup your local development environment. if you're a devops person, you might want to check out the Deployment section as well.
Consider as well play around with some tutorials or getting-started guides of the Ash Framework and the Phoenix Framework which are the core of the application.
The Data Aggregator uses a comprehensive data model that follows Darwin Core standards and includes additional fields for Swiss-specific requirements. The system's data structure is defined in detail in the Entity Relationship Diagram, which shows all entities, their attributes, and relationships.
This section details the main workflows and functionalities provided by the application.
Note: The term "Collection" is used throughout the codebase and resource definitions, but often corresponds to the concept of a "Dataset" from a user perspective.
Collections (Datasets) are the top-level containers for managing biodiversity records through their lifecycle. Different information is stored on the collection item:
- Metadata: Metadata such as name, code, description, type but also information from external ressources like GrSciColl (
grscicoll_reference,grscicoll_institution_key, etc.) and/or GBIF (gbif_dataset_key,gbif_doi). - Mapping: The import mapping is also stored in the dataset.
- State Management: The state of a collection (dataset) is stored on the object (
:idle,:mapping,:importing,:encoding,:exporting,:publishing,:validating,:deleting).
The import module handles uploading tabular data files (e.g., CSV, TSV), mapping source columns to the system's data model, validating data, and creating/updating Record resources within a Collection (Dataset).
- Upload tabular data files (CSV, TSV).
- Guided column mapping UI with mandatory field highlighting.
- Option to reuse previous mappings for a Collection.
- Automatic column detection and row counting.
- File type and size validation.
- Background processing for validation and record creation/update (upsert).
- Real-time progress monitoring (status, row counts).
- Detailed error logging for failed imports.
-
File Upload & Validation
- User uploads a tabular data file (CSV, TSV)
- System performs initial validation:
- File format check (must be CSV, TSV, TXT, ARROW, IPC, PARQUET, PQT)
- File size validation (max 200MB)
- Column header presence
- Line separator compatibility (Unix-style
\n) - Column separator validation (
,or;for CSV,\tfor TSV) - Minimum data requirements (at least one row, two columns)
- UTF-8 character validation
- File is stored in S3 storage with unique identifier
- Initial import object is created in database with metadata
-
Column Mapping
- User maps source columns to Darwin Core fields through UI
- System validates field type compatibility:
- String fields must map to string data
- Date fields must map to date-compatible data
- Numeric fields must map to numeric data
- Mapping configuration is stored with import
- System tracks mandatory field mappings
- Option to reuse previous mappings for efficiency
-
Background Processing
- Import job is enqueued via Oban worker (
DataAggregator.Records.Import.Workers.Importer) - Records are processed in configurable batch sizes (default: 1000 records)
- Progress is tracked and updated in real-time:
- Row count validation
- Record creation progress
- Error accumulation
- Errors are logged and can be reviewed in UI
- Process can be monitored through collection state
- Import job is enqueued via Oban worker (
-
Record Creation
- Valid records are created/updated in database using upsert logic
- Each record is linked to its collection
- Initial state is set to
:imported - Error log is generated for failed records
- Collection state is updated to reflect import status
for_coders: The import process is handled by the lib/data_aggregator/records/import modules. Key components include:
Import.Changes.ImportRecords: Handles the actual record creationImport.Workers.Importer: Manages background processingImport.Calculations.AttachmentData: Processes file data- Configuration options in
config/runtime.exscontrol batch sizes and timeouts
The Import resource tracks the progress of a single import job.
pending: Initial state after resource creation, before job enqueueing.import_queued: TheImporterbackground job has been scheduled.importing: TheImporterworker is actively validating rows and creating/updating records.imported: The import job completed successfully.failed: An error occurred during the import process or it was cancelled.
stateDiagram-v2
[*] --> pending: Create Import
pending --> import_queued: Enqueue Import Job
import_queued --> importing: Worker Starts Job
importing --> imported: Success
importing --> failed: Error / Cancel
imported --> [*]
failed --> [*]
The encoding module standardizes and enriches raw imported Record data using a sequence of defined strategies (e.g., date conversion, geocoding, taxonomic lookups). Results are stored in a corresponding EncodedRecord resource, aligned with Darwin Core standards.
- Standardizes data formats (e.g., dates to ISO 8601).
- Enriches records with external/internal data (Taxonomy, Geocoordinates, IUCN status, Image URLs).
- Populates Darwin Core fields in
EncodedRecord. - Calculates MIDS (Minimum Information about a Digital Specimen) levels.
- Uses modular, extensible encoding strategies.
- Logs detailed results (success/failure/unchanged) per strategy per record via
RecordEncodingResult. - Executes asynchronously via the
EncoderOban worker. - Creates/updates
EncodedRecordusing upsert logic.
The encoding process sequentially applies the following strategies, controlled by DataAggregator.Taxonomy.Catalog.get_catalogs() and dispatched via DataAggregator.Records.Encoding.Strategy:
:col_taxonomy: Looks up taxonomic names against the GBIF Backbone Taxonomy. Crucially, this first step also initializes/resets theEncodedRecordbased on the sourceRecorddata before applying its specific logic.:swiss_species: Looks up taxonomic information in the integrated Swiss Species catalog.:geo_reverse: Performs reverse geocoding (coordinates to administrative levels like country, canton) using an external service (likely OpenCage).:geo_forward: Performs forward geocoding (place names to coordinates) using an external service (likely OpenCage).:iucn_redlist: Determines the IUCN Red List conservation status, likely querying GBIF.:relate_images: Associates URLs of linkedImageattachments with theEncodedRecord.:convert_dates: Parses various date/time formats found in eventDate, dateIdentified, etc., into standardized formats.
-
Initialization
- User triggers encoding for selected records through UI
- Collection state is set to
:encoding - Records are enqueued for processing via Oban
- System validates collection state (must be
:idle) - Encoding job is created and tracked
-
Sequential Strategy Application
-
Records are processed through each strategy in order:
-
CoL Taxonomy (
:col_taxonomy):- Initializes/resets EncodedRecord from source Record
- Queries GBIF Backbone Taxonomy API
- Updates taxonomic information
- Minimum confidence level: 80%
-
Swiss Species (
:swiss_species):- Queries integrated Swiss Species catalog
- Updates taxonomic information
- Records registration status
-
Reverse Geocoding (
:geo_reverse):- Processes coordinates to administrative levels
- Updates country, canton, municipality
- Uses external geocoding service
-
Forward Geocoding (
:geo_forward):- Processes place names to coordinates
- Updates coordinate information
- Uses external geocoding service
-
IUCN Red List (
:iucn_redlist):- Queries GBIF for conservation status
- Updates IUCN category information
-
Image Association (
:relate_images):- Links related image URLs
- Updates image metadata
-
Date Standardization (
:convert_dates):- Parses various date formats
- Converts to ISO 8601 standard
- Updates event dates and identification dates
-
-
-
Progress Tracking
- Each strategy execution is logged via
RecordEncodingResult - Results include:
- Success/failure status
- Input values used
- Output values generated
- Error messages if applicable
- Collection state is monitored through polling
- Encoding status is updated per record
- Real-time progress updates in UI
- Each strategy execution is logged via
-
Completion
- Records are marked as
:encodedor:failed - Collection returns to
:idlestate - Results can be reviewed in UI
- Error logs are available for failed records
- Encoding statistics are updated
- Records are marked as
for_coders: The encoding process is handled by the lib/data_aggregator/records/encoding modules. Key components include:
Strategy: Main module for strategy selection and executionWorkers.Encoder: Manages background processing- Individual strategy modules in
strategies/directory - Configuration in
config/runtime.exsfor timeouts and batch sizes - State management through
RecordandCollectionresources
The encoding status is reflected in the Record resource's state machine.
imported: Initial state after successful import, before encoding.queued: Encoding job has been enqueued for this record.encoding: TheEncoderworker is actively processing the record.encoded: All encoding strategies completed successfully.failed: An error occurred during encoding.
stateDiagram-v2
imported --> queued: Enqueue Encoding Job
queued --> encoding: Worker Starts Job
encoding --> encoded: Success
encoding --> failed: Error / Cancel
encoded --> [*] % Or back to imported if re-encoding needed? Needs clarification
failed --> [*] % Or back to imported for retry? Needs clarification
[*] --> imported : Record Created (via Import)
The export module allows users to generate downloadable files (e.g., CSV, TSV) of selected record data from a Collection. Users can filter records, choose between raw imported or standardized encoded data, and select the source for output file headers.
- Generates downloadable data files (likely CSV/TSV).
- Allows filtering of records to be exported using current UI filters (
records_query). - Option to export "Raw" (original imported) or "Encoded" (standardized) data.
- Choice of header source: "Dataset Mapping" (import mapping) or "DWC Attributes" (standard Darwin Core).
- Background processing via the
ExporterOban worker. - Real-time progress monitoring and status updates.
- Downloadable export file upon completion.
-
Export Configuration
- User selects records to export (using UI filters)
- User chooses data type (Raw/Encoded)
- User selects header source (Dataset Mapping/DWC Attributes)
- System validates configuration
- Export job is created
-
Background Processing
- Export job is enqueued via Oban worker
- Records are processed in batches
- Progress is tracked in real-time
- Export file is generated
- File is stored in S3
-
Completion
- Export file is made available for download
- Export status is updated
- User is notified of completion
for_coders: The export process is handled by the lib/data_aggregator/records/export modules. Key components include:
Export.Workers.Exporter: Manages background processingExport.Calculations.ExportData: Processes record data- Configuration in
config/runtime.exsfor batch sizes and timeouts
The publication module handles the process of publishing data to GBIF through the SwissNatColl portal. It generates a Darwin Core Archive (DwC-A) and submits it to GBIF's registration API.
- Generates Darwin Core Archive (DwC-A) from encoded records
- Submits data to GBIF through SwissNatColl portal
- Tracks publication status and metadata
- Handles GBIF dataset registration
- Manages DOI assignment
- Provides publication history
-
Pre-publication Checks
- Validates record requirements
- Checks for mandatory fields
- Verifies data quality
- Ensures proper encoding
-
Darwin Core Archive Generation
- Creates DwC-A structure
- Includes metadata.xml
- Generates occurrence.txt
- Packages files into archive
- Stores archive in S3
-
GBIF Submission
- Registers dataset with GBIF
- Submits DwC-A
- Tracks submission status
- Handles response
-
Completion
- Updates collection state
- Stores GBIF metadata
- Records DOI
- Updates publication history
for_coders: The publication process is handled by the lib/data_aggregator/records/publication modules. Key components include:
Publication.Workers.Publisher: Manages background processingDarwinCore.Publication: Handles DwC-A generation- Configuration in
config/runtime.exsfor GBIF API settings
The validation module manages the process of having data validated by the InfoSpecies Switzerland team before publication to GBIF. It handles the creation of validation requests, notification of validators, and processing of validation responses.
- Creates validation requests for selected records
- Generates Darwin Core Archive for validation
- Notifies InfoSpecies team
- Processes validation responses
- Updates record validation status
- Maintains validation history
-
Validation Request Initiation
- User selects records for validation
- System identifies target InfoSpecies data center
- Validation request is created
- Collection state is set to
:validating
-
Data Package Preparation
- Selected records are extracted
- Darwin Core Archive is generated
- Package is stored in S3
- Download link is generated
-
Notification
- Email notification is generated
- Sent to InfoSpecies center
- Includes download link
- Contains validation request details
-
Response Handling
- API endpoint for validation responses
- Processes validation results
- Updates records based on validation
- Notifies users of completion
for_coders: The validation process is handled by the lib/data_aggregator/records/validation modules. Key components include:
Validation.Workers.ValidationRequestHandler: Manages validation requestsValidation.Changes.CreateDwCA: Generates validation packages- Configuration in
config/runtime.exsfor validation settings
The image upload module manages the process of uploading and associating images with records. It handles file validation, storage, and linking to records based on catalog numbers and identifiers.
- Bulk image upload support
- Automatic record association
- Image metadata extraction
- S3 storage integration
- Progress tracking
- Error handling and reporting
-
Image Upload Initiation
- User selects images for upload
- System validates:
- File types (JPEG, PNG, etc.)
- File sizes
- Image dimensions
- Metadata presence
- Upload session is created
-
Background Processing
- Images are processed in batches
- Files are stored in S3
- Metadata is extracted
- Progress is tracked
-
Record Association
- Images are linked to records
- Association metadata is stored
- Links are verified
- Statistics are updated
-
Completion
- Upload statistics are updated
- Success/error summary is generated
- User is notified of completion
for_coders: The image upload process is handled by the lib/data_aggregator/records/image modules. Key components include:
Image.Workers.ImageProcessor: Manages background processingImage.Changes.ProcessImage: Handles image processing- Configuration in
config/runtime.exsfor upload settings
The deletion module manages the process of removing collections, records, and their associated data from the system. This is a complex process that involves handling cascading deletions, cleaning up external storage, and maintaining data integrity.
- Cascading deletion of all related resources
- Cleanup of external storage (S3)
- Partition management for database tables
- State tracking during deletion
- Background processing for large deletions
- Audit trail preservation
-
Collection Deletion
- Sets collection state to
:deleting - Triggers database partition cleanup
- Cascades deletion to all related resources:
- Records and their versions
- Encoded records and their versions
- Validated records
- Published records
- Import/Export files
- Validation requests/responses
- Image uploads and attachments
- Cleans up S3 storage:
- Deletes all associated media files
- Removes import/export files
- Cleans up validation packages
- Removes image attachments
- Sets collection state to
-
Record Deletion
- Cascades to related resources:
- Encoded records
- Validated records
- Published records
- Image attachments
- Updates collection record count
- Removes from GBIF on next publication
- Preserves audit trail
- Cascades to related resources:
-
Media Deletion
- Handles cleanup of S3 storage
- Removes file attachments
- Updates record associations
- Maintains referential integrity
for_coders: The deletion process is implemented across several modules:
Collection.Changes.DeleteAllMedia: Handles S3 cleanupCollection.Changes.SetDeleting: Manages deletion state- Database triggers for partition management
- Cascading foreign key constraints
- Configuration in
config/runtime.exs
The deletion status is reflected in the Collection resource's state machine.
idle: Normal statedeleting: Collection is being deleteddeleted: Collection has been removed
stateDiagram-v2
[*] --> idle: Create Collection
idle --> deleting: Start Deletion
deleting --> deleted: Success
deleting --> failed: Error
deleted --> [*]
failed --> [*]
The system utilizes the AshAuthentication extension for user management and authentication, combined with Ash's built-in policy authorization framework for permissions.
- Password-based strategy with email
- Case-insensitive email identity
- Sign-in token management
- Terms acceptance tracking
- Session handling
Authorization is primarily based on roles assigned to users. Roles are stored as an array of strings in the User.roles attribute:
admin: Has broad access across the system (often bypassing specific policy checks).collection_administrator: Manages users and resources (like Collections, Publications, etc.) within a specific institution. Their permissions are scoped by theinstitution_idassociated with their user account.data_digitizer: Has read access to resources within their associated institution.
- Framework: Permissions are enforced using
Ash.Policy.Authorizerdefined on resources. - Checks: Policies use built-in Ash checks (e.g.,
action_type/1) and custom checks defined inlib/data_aggregator/checks/:with_role(role_or_roles): Checks if the current user (actor) has at least one of the specified roles.it_is_myself(): Checks if the actor is the same as the user resource being accessed.it_is_admin(): Checks if the user resource being accessed has theadminrole.relates_to_institution_check(foreign_key): Checks if the actor and the resource being acted upon belong to the same institution (viainstitution_id).relates_to_institution_filter(foreign_key): Applies a filter to queries to restrict results to the actor's institution.
for_coders: The user management system is implemented using:
AshAuthenticationfor authenticationAsh.Policy.Authorizerfor permissions- Custom checks in
lib/data_aggregator/checks/ - Configuration in
config/runtime.exs
Most long-running background processes (Import, Export, Publication, Validation Request Send, Image Mapping) support cancellation.
- Mechanism: Each process resource (
Import,Export, etc.) typically has a specificcancel_action (e.g.,cancel_import,cancel_export). - Effect & Job Handling:
- Standard cancellation actions (
cancel_import, etc.) primarily transition the state machine to:failedand setfinished_at. They do not actively stop the running Oban job; the job may continue until completion, error, or timeout. - The admin-only
Collection.cancel_actionis more forceful. It finds the active process, calls its specificcancel_action, and also actively signals Oban to kill the associated running job(s) (ash_cancel_all_jobshelper inCancelActionchange).
- Standard cancellation actions (
- Outcome: A cancelled job's corresponding resource is marked as
:failed.
The system provides an audit trail for changes made to core data resources using the AshPaperTrail extension.
- Tracked Resources: Versioning is enabled for
RecordandEncodedRecordresources. - Tracking Mode: Only the changes made during an update are stored, not the full record state for each version (
change_tracking_mode :changes_only). - Recorded Information: Each version typically stores:
- The changes made to tracked attributes.
- The action that triggered the change (
store_action_name? true). - A reference to the user (actor) who performed the action (
belongs_to_actor :user). - A reference back to the original record (
reference_source? true). - Copies of certain key attributes for easier querying (e.g.,
mte_catalog_number,tax_scientific_nameonRecordversions).
- Scope:
- For
Record, versioning specifically tracks changes made via the:update_publication_statusand:update_validation_statusactions. - For
EncodedRecord, versioning tracks updates but ignores create/destroy actions. - Certain attributes (like timestamps, internal state fields) are explicitly ignored.
- For
Users can search for records within a Collection using PostgreSQL's full-text search capabilities, integrated via the AshPagify.Tsearch extension.
- Target: Full-text search operates on the data stored in the
EncodedRecordresource (the standardized version of the record), specifically targeting its pre-calculated text search vector (tsv) column. - Mechanism: Queries entered by the user are converted into PostgreSQL
ts_queryformat and executed efficiently against the indexedtsvcolumn. - Filtering: In addition to full-text search, predefined filter scopes are available (via
AshPagifyconfiguration) to easily filter records based on their status, such as:not_encoded,:not_published, and:not_validated.
for_coders: Additional features are implemented using:
AshPaperTrailfor versioningAshPagify.Tsearchfor search- Custom cancellation actions
- Configuration in
config/runtime.exs
The Data Aggregator interacts with several external biodiversity informatics platforms and services:
The Global Biodiversity Information Facility (GBIF) is the primary target for data publication and a source for taxonomic and institutional information.
- Dataset Publication: The core publication workflow generates Darwin Core Archives (DwC-A) and submits them to the GBIF registry endpoint associated with a Collection's registered
gbif_dataset_key. - Dataset Registration: Provides functionality (
Collection.register_at_gbifaction) to register the dataset entity with GBIF. - Taxonomic Resolution: Uses the GBIF Backbone Taxonomy during the Encoding process (
:col_taxonomystrategy) to standardize scientific names and classifications. - IUCN Status: Queries GBIF during encoding (
:iucn_redliststrategy) to enrich records with IUCN Red List conservation status. - GrSciColl Data Source: Leverages GBIF's API to retrieve information about institutions and collections registered in the Global Registry of Scientific Collections (GrSciColl).
- Publication Verification: Includes a background job (
PublicationVerifier) to periodically check via the GBIF API if records marked as published are actually discoverable on the GBIF portal.
The Global Registry of Scientific Collections (GrSciColl) is used for institutional context.
- Metadata Association: Collections within the Data Aggregator are linked to GrSciColl entries by storing
grscicoll_reference(for the collection) and associatedgrscicoll_institution_key,_code, and_name. - Metadata Retrieval: Uses the GBIF API to fetch and populate institution details based on the provided GrSciColl identifiers during Collection creation.
- Validation: May include specific validation rules related to GrSciColl identifiers (
GrSciCollValidator).
The system facilitates an external data review process with Swiss InfoSpecies data centers.
- Validation Workflow: Manages sending selected record data (as DwC-A) to the appropriate InfoSpecies center for expert review.
- Email Notification: Sends an email with a download link for the DwC-A to the relevant center (contacts managed via
InfospeciesCenterscatalog). - Results Ingestion: Processes a results file (provided back by the center via a URL) to update the
validation_statusof records and store corrected/validated data inValidatedRecordresources.
for_coders: Integration points are implemented in:
lib/data_aggregator/gbiflib/data_aggregator/grscicolllib/data_aggregator/infospecies- Configuration in
config/runtime.exs
The application provides a JSON Rest API to interact with the data. The API is built using the Ash Framework and provides a set of endpoints to access and manipulate the data. Read full Rest API documentation here.