Skip to content

Refine and analyze NYC green and yellow taxi datasets. Dockerized workflow for portability, PostgreSQL integration for easy access. PySpark utilized for yellow taxi data. Airflow in Docker streamlines task organization.

License

Notifications You must be signed in to change notification settings

omar-sherif9992/Data-Engineering-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to Data Engineering Pipeline

Logo

clean, impute, handle outliers, feature engineer, visualize, analyze, containerize, parallelize workload, and build a pipeline

The 4 Milestones aim to build a Data Engineering Pipeline

View the Screenshots »
· Demo Video · Report Bug · Be a Contributer

💡 Description

This project focused on NYC taxi data. It started by studying and improving the dataset for green taxis, doing things like organizing, visualizing, and preparing the data for later analysis or machine learning. The work was made into easy-to-use packages using Docker, allowing it to be moved and used easily. It was then put into a PostgreSQL database for easy access. Using PySpark, similar steps were taken for yellow taxi data. Later, tasks were organized using Airflow in Docker, making it easier to clean, change, and add data. Overall, this project showed skills in handling data, making it better, and organizing tasks efficiently.

Note: Every Milestone Folder have it's own Readme for How to use it

Milestone 1 (Data Preparation and Exploration (Green Taxis))

The goal of this milestone is to load a csv file, perform exploratory data analysis with visualization, extract additional data, perform feature engineering and pre- process the data for downstream cases such as ML and data analysis. The dataset you will be working on is NYC green taxis dataset. It contains records about trips conducted in NYC through green taxis.

There are multiple datasets for this case study(a dataset for each month). Download dataset from here.

My dataset was 10/2016, the code is reproducible and can work with any month/year

Milestone 2 (Docker Packaging and PostgreSQL Integration)

The objective of this milestone is to package your milestone 1 code in a docker image that can be run anywhere. In addition, you will load your cleaned and prepared dataset as well as your lookup table into a PostgreSQL database which would act as your data warehouse.

Milestone 3 (Preprocessing Yellow Taxis Data with PySpark)

The goal of this milestone is to preprocess the dataset 'New York yellow taxis' by performing basic data preparation and basic analysis to gain a better understanding of the data using PySpark. Use the same month and year you used for the green taxis in milestone 1. Datasets (download the yellow taxis dataset).

Milestone 4 (Airflow Orchestration of Tasks)

For this milestone, we were required to orchestrate the tasks performed in milestones 1 and 2 using Airflow in Docker. For this milestone, we will primarily work on the green dataset and pre-process using pandas only for simplicity. The tasks you have performed in milestones 1 and 2 were as follows. Read csv(green_taxis) file >> clean and transform >> load to csv(both the cleaned dataset and the lookup table) >> extract additional resources(GPS coordinates) >> Integrate with the cleaned dataset and load back to csv >> load both csv files(lookup and cleaned dataset) to postgres database as 2 separate tables.

About

Refine and analyze NYC green and yellow taxi datasets. Dockerized workflow for portability, PostgreSQL integration for easy access. PySpark utilized for yellow taxi data. Airflow in Docker streamlines task organization.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published