This repository contains an implementation of an end to end fake news classifier.
It is the final capstone project for MLOps Zoomcamp course from DataTalks.Club.
Misinformation is wide spread. The aim of the project is to train and deploy a model to detect the presence of fake claims in articles.
Emphasis is largely placed on the MLOps pipeline.
Data source: Fake and real news dataset
This data consists of about 40000 articles consisting of fake and real news. The data consists of two separate datasets - one for each news category with each dataset containing around 20000 articles.
Tools used include:
- Terraform is the Infrastructure as Code (IaC) tool used for creating resources.
- MLflow for experiment tracking and as a model registry.
- Docker for containerization.
- Prefect 2.0 for workflow orchestration.
- AWS Lambda for cloud deployment and inference.
- Flask for local deployment and inference.
- Evidently AI for monitoring.
- Github Actions for Continuous Integration and Continuous Delivery.
The model builds on ideas from Madhav Mathur's notebook.
Words are represented using GloVe Embeddings which is a word vector technique. GloVe incorporates global statistics (word co-occurrence) to obtain word vectors. More info about GloVe here.
An LSTM model with 5 layers was trained using Tensorflow and Keras.
Optional - Create a VM with about 8gbs of RAM. This would allow for fast training, downloading the fairly large dataset (~2gb), pulling and pushing of required docker containers.
Set up an AWS account.
Set up a Kaggle account for getting the data.
Install Python 3.9
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh
chmod +x Miniconda3-py39_4.12.0-Linux-x86_64.sh
./Miniconda3-py39_4.12.0-Linux-x86_64.sh
rm Miniconda3-py39_4.12.0-Linux-x86_64.sh
Install aws-cli
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip
unzip awscliv2.zip
sudo ./aws/install
rm -r awscliv2.zip aws/
Create AWS user with administrator access. Note the AWS_SECRET_ACCESS_KEY
and AWS_ACCESS_KEY_ID
.
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html
Install docker
https://docs.docker.com/desktop/install/linux-install/
Allow docker to run without sudo
https://docs.docker.com/engine/install/linux-postinstall/
Install docker-compose
sudo apt install docker-compose
Install Terraform
https://learn.hashicorp.com/tutorials/terraform/install-cli
If interested in testing automated deploy capabilities using Github Actions, fork the repository and clone fork to local machine.
OR
To test locally or manually deploy, clone the repository to your local machine.
git clone https:/IzicTemi/e2e_fake_news_classifier.git
cd e2e_fake_news_classifier
Edit set_env.sh in scripts folder.
-
Get
KAGGLE_USERNAME
andKAGGLE_KEY
following the instructions here. -
DATA_PATH
is path to store data. Preferrably "data". -
MODEL_BUCKET
is the intended name of s3 bucket to store MLflow artifacts. -
PROJECT_ID
is the tag to add to created resources to ensure uniqueness. -
MLFLOW_TRACKING_URI
is the tracking server url. Default is http://127.0.0.1:5000 for local MLflow setup. Leave empty if you want to setup Mlfow on AWS ec2 instance. -
TFSTATE_BUCKET
is the intended name of s3 bucket to store Terraform State files. -
AWS_SECRET_ACCESS_KEY
andAWS_ACCESS_KEY_ID
from user created above. -
AWS_DEFAULT_REGION
is the default region for resources to be created. -
ECR_REPO_NAME
is the intended name of ECR registry to store docker images. -
MODEL_NAME
is the name to which to register the trained models.
Optional
MLFLOW_TRACKING_USERNAME
andMLFLOW_TRACKING_PASSWORD
if using an authenticated MLflow server.
Run command:
source scripts/set_env.sh
This can be done from the console or by running
aws s3api create-bucket --bucket $TFSTATE_BUCKET \
--region $AWS_DEFAULT_REGION
make setup_tf_vars
Manually setup MLflow on an ec2 instance by following instructions here.
OR
Run
make mlflow_server
- The above command creates a free tier eligible t2.micro ec2 instance and installs MLflow on it using Terraform.
- It also creates a key pair called webserver_key and downloads the private key to the ec2 module folder in the infrastructure folder. This allows Terraform interact with the ec2 instance.
- An sqlite db is used as the backend store. In the future, a better implementation would be to use a managed RDS instance. This could be added later.
find . -type f -exec sed -i "s/us-east-1/$AWS_DEFAULT_REGION/g" {} \;
make setup
- The above command install pipenv which in turn sets up the virtual environment.
- It also installs the pre commit hooks.
pipenv shell
make create_bucket
python get_data.py
mlflow server --host 0.0.0.0 --backend-store-uri sqlite:///mlflow.db \
--default-artifact-root $ARTIFACT_LOC --serve-artifacts
Navigate to http://<IP>/5000
- <IP> is localhost or 127.0.0.1 if running on PC, else, it's the VM's public IP adress.
The model training process performs a hyperparameter search to get best parameters. This could take very long and is memory intensive. If interested in the full training process, run:
python train.py
For testing purposes, set a small number of optimization trials and lower the number of epochs required to train the model.
python train.py --n_evals 2 --epochs 3
- On completion of the optimization and training process, the best run is registered as a model and promoted to Production. This is implemented in the register_best_model function.
- Deploy web service locally using Flask.
- To make inferences, make a
POST
request to http://127.0.0.1:9696/classify. - The content of the
POST
request should be of the format: - Manually deploy web service to AWS Lambda.
- The above command uses Terraform to deploy the model to AWS Lambda and exposes it using an API gateway endpoint.
- The scripts outputs the endpoint of the Lambda function.
- To make inferences, make a
POST
request to the output url. - The content of the
POST
request should be of the format: - If you get a {'message': 'Endpoint request timed out'} error, retry the request; the initial model loading takes time.
cd web_service_local
./run.sh
{
'text': text
}
OR
Edit and run test.py in web_service_local folder.
python web_service_local/test.py
Note: Ensure you're using a hosted MLflow Server when running this. See step 4 in Preparing your Workspace above.
make publish
{
'text': text
}
OR
Edit and run test.py in web_service folder.
python web_service/test.py
A Production Environment is simulated to get insights into model metrics and behavior. To implement this, follow the steps below:
make monitor_setup
- The above command pulls the MongoDB docker image and runs it on port 27017.
- It also starts up the web service from web_service_local on port 9696.
2. Run send_data.py to simulate requests to the model web service.
python monitoring/send_data.py
- The above script creates a shuffled dataframe from the dataset and makes a
POST
request with text from each row to the model service for prediction. - It saves the real values and id to
target.csv
- To generate enough data, let this run for at least 30 minutes.
python prefect_monitoring.py
- The above command sets up a Prefect workflow which uses Evidently AI to calculate data drift, target drift and classification performance.
- This generates an HTML report
evidently_report.html
showing the metrics. - It also checks the performance of the Production model against the reference and triggers the training flow if poor (difference of 10% set).
An sample report is show below
make stop_monitor
To automate getting the data, training the model and running monitoring analysis on a schedule, we use Prefect deployment capabilities.
python prefect_deploy.py
prefect agent start --work-queue "main"
- The above script uses Prefect to automate the deployment using a Cron Scheduler.
- Two deployments are currently set up:
- One to run the training workflow which is set to run weekly by 00:00 on Monday,
- Another runs the model analysis workflow weekly by 00:00 on Thursday
- The second command sets up the agent to look for work and runs it at the appointed time.
- To change the schedule, edit the prefect_deploy.py file and change the Cron schedule.
- To view the scheduled deployments, run:
prefect orion start --host 0.0.0.0
Navigate to http://<IP>/4200
- <IP> is localhost or 127.0.0.1 if running on PC, else, it's the VM's public IP adress.
An example of scheduled runs is shown below
This runs linting and unit tests on the code. It also builds the web service and ensures that inferences can be successfully made.
Ensure you're in the base folder to run these.
make test
make integration_test
This allows for automatic tests and deployment by making and pushing changes to the repository.
git checkout test-branch
3. Perform all steps in Preparing your workspace above and steps 1 - 6 from Instructions.
- On the github repo, navigate to Settings -> Secrets -> Actions.
- Add new Secrets by clicking on "New repository secret".
- Copy the output of the command below and set as the value
SSH_PRIVATE_KEY
. This allows terraform interact with the MLflow Server.
cat infrastructure/modules/ec2/webserver_key.pem
5. Edit ci-tests.yaml and cd-deploy.yml in .github/workflows folder.
- Replace env variable
MODEL_NAME
in ci-tests.yaml and cd-deploy.yml. - Replace env variable
ECR_REPO_NAME
in ci-tests.yaml.
- This triggers the Continuous Integration workflow which runs unit tests, integration test and validates the Terraform configuration.
- This triggers the Continuous Deployment workflow which applies the Terrafrom configuration and deploys the infrastructure.
On completing the steps above, destroy all the setup infrastructure by running:
make destroy
Note: This destroys all created infrastructure except the Terraform state bucket. The process includes the destruction of the MLflow Server and models bucket. To prevent destruction of the models bucket, edit the s3 module in the Terraform configuration and set:
force_destroy = false
Empty and delete the Terraform state bucket from the console or by running:
aws s3 rm s3://$TFSTATE_BUCKET --recursive
aws s3api delete-bucket --bucket $TFSTATE_BUCKET
- The instructors of the MLOps Zoomcamp Course who taught most of the concepts used in the project.
- The DataTalksClub community.
This project is licensed under the MIT License - see the LICENSE.md file for details.