Welcome to Data Engineering Pipeline

clean, impute, handle outliers, feature engineer, visualize, analyze, containerize, parallelize workload, and build a pipeline

The 4 Milestones aim to build a Data Engineering Pipeline

View the Screenshots »
· Demo Video · Report Bug · Be a Contributer

💡 Description

This project focused on NYC taxi data. It started by studying and improving the dataset for green taxis, doing things like organizing, visualizing, and preparing the data for later analysis or machine learning. The work was made into easy-to-use packages using Docker, allowing it to be moved and used easily. It was then put into a PostgreSQL database for easy access. Using PySpark, similar steps were taken for yellow taxi data. Later, tasks were organized using Airflow in Docker, making it easier to clean, change, and add data. Overall, this project showed skills in handling data, making it better, and organizing tasks efficiently.

Note: Every Milestone Folder have it's own Readme for How to use it

Milestone 1 (Data Preparation and Exploration (Green Taxis))

The goal of this milestone is to load a csv file, perform exploratory data analysis with visualization, extract additional data, perform feature engineering and pre- process the data for downstream cases such as ML and data analysis. The dataset you will be working on is NYC green taxis dataset. It contains records about trips conducted in NYC through green taxis.

There are multiple datasets for this case study(a dataset for each month). Download dataset from here.

My dataset was 10/2016, the code is reproducible and can work with any month/year

Milestone 2 (Docker Packaging and PostgreSQL Integration)

The objective of this milestone is to package your milestone 1 code in a docker image that can be run anywhere. In addition, you will load your cleaned and prepared dataset as well as your lookup table into a PostgreSQL database which would act as your data warehouse.

Milestone 3 (Preprocessing Yellow Taxis Data with PySpark)

The goal of this milestone is to preprocess the dataset 'New York yellow taxis' by performing basic data preparation and basic analysis to gain a better understanding of the data using PySpark. Use the same month and year you used for the green taxis in milestone 1. Datasets (download the yellow taxis dataset).

Milestone 4 (Airflow Orchestration of Tasks)

For this milestone, we were required to orchestrate the tasks performed in milestones 1 and 2 using Airflow in Docker. For this milestone, we will primarily work on the green dataset and pre-process using pandas only for simplicity. The tasks you have performed in milestones 1 and 2 were as follows. Read csv(green_taxis) file >> clean and transform >> load to csv(both the cleaned dataset and the lookup table) >> extract additional resources(GPS coordinates) >> Integrate with the cleaned dataset and load back to csv >> load both csv files(lookup and cleaned dataset) to postgres database as 2 separate tables.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
M1		M1
M2		M2
M3		M3
M4		M4
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to Data Engineering Pipeline

clean, impute, handle outliers, feature engineer, visualize, analyze, containerize, parallelize workload, and build a pipeline

💡 Description

Milestone 1 (Data Preparation and Exploration (Green Taxis))

Milestone 2 (Docker Packaging and PostgreSQL Integration)

Milestone 3 (Preprocessing Yellow Taxis Data with PySpark)

Milestone 4 (Airflow Orchestration of Tasks)

About

Releases

Packages

Languages

License

omar-sherif9992/Data-Engineering-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Welcome to Data Engineering Pipeline

clean, impute, handle outliers, feature engineer, visualize, analyze, containerize, parallelize workload, and build a pipeline

💡 Description

Milestone 1 (Data Preparation and Exploration (Green Taxis))

Milestone 2 (Docker Packaging and PostgreSQL Integration)

Milestone 3 (Preprocessing Yellow Taxis Data with PySpark)

Milestone 4 (Airflow Orchestration of Tasks)

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages