Marvel Data Pipeline

This project is a data pipeline crafted to extract, transform, and load Marvel comic book data, then analyze and visualize it. We use Python, SQLite3, and a few other tools to make this happen.

#Generated using tree command Linux

Project Structure

Here's how the project files and folders are organized:

MarvelDataPipeline/
├── Comic_Character_Analysis.ipynb     # Jupyter Notebook for data analysis
├── DataLayer/
│   ├── DataService.py                  # Handles SQLite database interactions
├── ETLApp.py                         # Executes the ETL pipeline
├── LICENSE                           # Project license
├── Models/
│   └── Comic.py                      # Data model for comic book info
├── README.md                          # This file
├── ResultSQL/
│   └── ResultSQL.sql                 # SQL result data
├── SSLcheck.py                       # Checks SSL certificate issues with API calls 
├── Services/
│   ├── ComicService.py                 # Interacts with comic data
├── Utils/
│   ├── Config.py                     # Manages configuration settings
├── architecture/
│   ├── count.png                     # Character count result
│   ├── drawing.png                   # Data Flow Diagram
│   └── eclidrawpipeline0.png         # Project Data Pipeline Diagram
├── cronjob.sh                        # Shell script to run ETL (with venv)
├── data_service.log                  # Log file for data service operations
├── logs/
│   └── dbt.log                       # dbt log file
├── marvel_dbt/
│   ├── README.md                     # dbt project README
│   ├── analyses/                     # dbt analysis files
│   ├── dbt_project.yml                # dbt project configuration
│   ├── macros/                       # dbt macro files
│   ├── models/
│   │   └── example/
│   │       ├── my_first_dbt_model.sql
│   │       ├── my_second_dbt_model.sql
│   │       └── schema.yml
│   ├── seeds/                        # dbt seed files
│   ├── snapshots/                    # dbt snapshot files
│   └── tests/                        # dbt test files
├── my_marvel.db                      # SQLite database file
├── sql/
│   ├── backup_001.db                 # Database backup
│   └── sql.sql                       # General SQL scripts
└── sqls/                             # Additional SQL scripts

Key Components

ETL Pipeline (ETLApp.py): Handles the extraction, transformation, and loading of data into my_marvel.db using sqlite3.
Data Modeling (Models/Comic.py): Defines the structure of our comic data.
Data Access (DataLayer/DataService.py): Manages interactions with the SQLite database.
Data Transformation (marvel_dbt/): Uses dbt for data transformations.
Data Analysis (Comic_Character_Analysis.ipynb): Analyzes and visualizes data using Jupyter Notebook, including character and comic plots and character counts (see count.png).
Configuration (Utils/Config.py): Manages settings for the pipeline.
Services (Services/ComicService.py): Provides the business logic for interacting with comic data.
Database (my_marvel.db): Stores the comic data.
Logging: Logs operations for both the data service and dbt.
Cron Job (cronjob.sh): Automates the ETL process on a schedule.

DBT Notes (In Progress)

The dbt models in marvel_dbt/models/ are still being developed. We're aiming to:

Check for null values in the data.
Identify and handle empty records.
Create a materialized view to join character and comic data.

Setup and Installation

Clone the Repository:

git clone https://github.com/dadenewyyt/MarvelDataPipeline.git
cd MarvelDataPipeline

Install Dependencies:

pip install logging pandas requests dbt-core dbt-sqlite sqllitebrowser jupyter-notebook   # dependencies as needed

Database Setup:
- Make sure SQLite is installed.
- Run ETLApp.py to create or update my_marvel.db.
dbt Setup:
- Navigate to marvel_dbt/.
- Run dbt deps.
- Configure profiles.yml for SQLite.
- Run dbt run.
Cron Setup (Linux/macOS):
- Make cronjob.sh executable: chmod +x cronjob.sh
- Open crontab: crontab -e
- Add cron entry (e.g., daily at midnight): 0 0 * * * pathto/MarvelDataPipeline/cronjob.sh (adjust the path).

Usage

Run ETL:

./cronjob.sh  # Run the shell script to execute the ETL.

Run dbt:
```
cd marvel_dbt
dbt run
```
Analyze Data:
- Open Comic_Character_Analysis.ipynb.
- Run the notebook cells.
Use Services:
- Use ComicService.py for data interactions.

Key Notes

We're using sqlite3 directly for database operations.
dbt is used for data transformations (still in progress).
Jupyter Notebook is used for data analysis and visualization.
Logging is in place for service and dbt operations.

This document gives you an overview of the project. Check out the code and individual files for more details. Developed by: Daniel Wondyifraw

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Marvel Data Pipeline

Project Structure

Key Components

DBT Notes (In Progress)

Setup and Installation

Usage

Key Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
.ipynb_checkpoints		.ipynb_checkpoints
.vscode		.vscode
DataLayer		DataLayer
Models		Models
ResultSQL		ResultSQL
Services		Services
Utils		Utils
architecture		architecture
logs		logs
marvel_dbt		marvel_dbt
sql		sql
.env		.env
Comic_Character_Analysis.ipynb		Comic_Character_Analysis.ipynb
ETLApp.py		ETLApp.py
LICENSE		LICENSE
README.md		README.md
SSLcheck.py		SSLcheck.py
character_comic.sql.sqbpro		character_comic.sql.sqbpro
cronjob.sh		cronjob.sh
data_service.log		data_service.log
my_marvel.db		my_marvel.db

Folders and files

Latest commit

History

Repository files navigation

Marvel Data Pipeline

Project Structure

Key Components

DBT Notes (In Progress)

Setup and Installation

Usage

Key Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages