This project is a data pipeline crafted to extract, transform, and load Marvel comic book data, then analyze and visualize it. We use Python, SQLite3, and a few other tools to make this happen.
#Generated using tree command Linux
Here's how the project files and folders are organized:
MarvelDataPipeline/
├── Comic_Character_Analysis.ipynb # Jupyter Notebook for data analysis
├── DataLayer/
│ ├── DataService.py # Handles SQLite database interactions
├── ETLApp.py # Executes the ETL pipeline
├── LICENSE # Project license
├── Models/
│ └── Comic.py # Data model for comic book info
├── README.md # This file
├── ResultSQL/
│ └── ResultSQL.sql # SQL result data
├── SSLcheck.py # Checks SSL certificate issues with API calls
├── Services/
│ ├── ComicService.py # Interacts with comic data
├── Utils/
│ ├── Config.py # Manages configuration settings
├── architecture/
│ ├── count.png # Character count result
│ ├── drawing.png # Data Flow Diagram
│ └── eclidrawpipeline0.png # Project Data Pipeline Diagram
├── cronjob.sh # Shell script to run ETL (with venv)
├── data_service.log # Log file for data service operations
├── logs/
│ └── dbt.log # dbt log file
├── marvel_dbt/
│ ├── README.md # dbt project README
│ ├── analyses/ # dbt analysis files
│ ├── dbt_project.yml # dbt project configuration
│ ├── macros/ # dbt macro files
│ ├── models/
│ │ └── example/
│ │ ├── my_first_dbt_model.sql
│ │ ├── my_second_dbt_model.sql
│ │ └── schema.yml
│ ├── seeds/ # dbt seed files
│ ├── snapshots/ # dbt snapshot files
│ └── tests/ # dbt test files
├── my_marvel.db # SQLite database file
├── sql/
│ ├── backup_001.db # Database backup
│ └── sql.sql # General SQL scripts
└── sqls/ # Additional SQL scripts
- ETL Pipeline (
ETLApp.py): Handles the extraction, transformation, and loading of data intomy_marvel.dbusingsqlite3. - Data Modeling (
Models/Comic.py): Defines the structure of our comic data. - Data Access (
DataLayer/DataService.py): Manages interactions with the SQLite database. - Data Transformation (
marvel_dbt/): Uses dbt for data transformations. - Data Analysis (
Comic_Character_Analysis.ipynb): Analyzes and visualizes data using Jupyter Notebook, including character and comic plots and character counts (seecount.png). - Configuration (
Utils/Config.py): Manages settings for the pipeline. - Services (
Services/ComicService.py): Provides the business logic for interacting with comic data. - Database (
my_marvel.db): Stores the comic data. - Logging: Logs operations for both the data service and dbt.
- Cron Job (
cronjob.sh): Automates the ETL process on a schedule.
The dbt models in marvel_dbt/models/ are still being developed. We're aiming to:
- Check for null values in the data.
- Identify and handle empty records.
- Create a materialized view to join character and comic data.
- Clone the Repository:
git clone https://github.com/dadenewyyt/MarvelDataPipeline.git cd MarvelDataPipeline - Install Dependencies:
pip install logging pandas requests dbt-core dbt-sqlite sqllitebrowser jupyter-notebook # dependencies as needed - Database Setup:
- Make sure SQLite is installed.
- Run
ETLApp.pyto create or updatemy_marvel.db.
- dbt Setup:
- Navigate to
marvel_dbt/. - Run
dbt deps. - Configure
profiles.ymlfor SQLite. - Run
dbt run.
- Navigate to
- Cron Setup (Linux/macOS):
- Make
cronjob.shexecutable:chmod +x cronjob.sh - Open crontab:
crontab -e - Add cron entry (e.g., daily at midnight):
0 0 * * * pathto/MarvelDataPipeline/cronjob.sh(adjust the path).
- Make
- Run ETL:
./cronjob.sh # Run the shell script to execute the ETL. - Run dbt:
cd marvel_dbt dbt run - Analyze Data:
- Open
Comic_Character_Analysis.ipynb. - Run the notebook cells.
- Open
- Use Services:
- Use
ComicService.pyfor data interactions.
- Use
- We're using
sqlite3directly for database operations. - dbt is used for data transformations (still in progress).
- Jupyter Notebook is used for data analysis and visualization.
- Logging is in place for service and dbt operations.
This document gives you an overview of the project. Check out the code and individual files for more details. Developed by: Daniel Wondyifraw


