This project aims to detect phishing emails using federated learning for OS Android. The application processes emails for feature extraction and uses those features in a machine learning process as a dynamicly created datasets for phishing email classification. It also allows training and retraining of the model on new data, evaluating models, and includes a federated server for model`s weight management.
To install and set up the Android application, follow these steps:
-
Clone the repository:
git clone https:/your-username/phishing-emails-detection.git
-
Install the app through Android Studio:
-
Open the cloned project in Android Studio.
-
Set up debug key:
- Open
File
->Project Structure
. - Navigate to
SDK Location
->Debug keystore
. - Set the path to the
debug.keystore
file in the root directory.
- Build and run the app:
- Click
Run
->Run 'app'
. - Choose your device or an emulat Note: This app is currently in development mode and limited to test users.
For test access, contact [email protected].
To set up the federated server, follow these steps:
- Python 3.8
- pip3
- Create and activate a Python virtual environment:
cd server python3.8 -m venv env_server source ./env_server/bin/activate
- Install dependencies and run the server:
pip install -r requirements.txt python server.py
The app can import emails from various sources and process them for feature extraction.
- Gmail Import: Users can use their Google account to import emails directly from Gmail.
- EML Import: Users can import individual
.eml
files.
- MBOX Import: Users can import
.mbox
files containing multiple emails.
When importing, users are asked to label the emails as phishing
or safe
.
- Email Packaging: Users can combine multiple emails into packages for processing.
The app provides several features for machine learning, including feature extraction, training, and retraining.
- Feature Extraction: Users can extract features from emails using Python integration.
- Training: Users can train the model on the extracted features.
- Retraining: Users can retrain the model with new data.
- Model Evaluation: Users can evaluate the performance of the trained model.
- Phishing Detection: Users can use the selected model to classify a single email as phishing or safe using logistic regression.
The federated server handles weight management for federated learning.
- Upload Weights: Users can upload the local model weights.
-
Download Global Weights: Users can download the globally averaged weights.
-
Check Server Status: Users can ping the server to check its status.
- Google Login: Users can log in using their Google account.
- Logout: Users can log out from their account.
- Integration with Gmail API: Seamless integration with Gmail API for importing emails.
- Email Import: Users can import emails from Gmail,
.eml
, and.mbox
files. - Email Labeling: Users can label imported emails as
phishing
orsafe
. - Email Packaging: Combine multiple emails into packages for processing.
- Feature Extraction: Extract features from emails using integrated Python scripts.
- Machine Learning:
- Training: Train the model on extracted features.
- Retraining: Retrain the model with new data.
- Model Evaluation: Evaluate the performance of trained models.
- Phishing Detection: Classify individual emails as phishing or safe using logistic regression.
- Federated Learning:
- Upload Weights: Upload local model weights to the federated server.
- Download Weights: Download globally averaged weights from the server.
- Server Status: Check the operational status of the federated server.
- Set Federated Server IP: Dynamically set the IP address of the federated server.
The project is structured to separate concerns and ensure modularity. Below is an overview of the main directories and their purposes:
-
Data: Contains data-related classes, repositories, and entities for handling email data.
- Local: Local data sources and caches.
- Remote: Manages remote data sources, such as API calls.
- Repositories: Interfaces for data access and management.
- Auth: Handles user authentication.
- DB: Database configurations and access.
- Entity: Entity classes representing different data models such as
EmailFull
,EmailMinimal
,EmailPackageMetadata
, etc.
- Entity: Entity classes representing different data models such as
-
Python: Contains Python scripts and modules for machine learning and data processing.
- DataProcessing: Scripts for processing email data.
- EvaluateModel: Scripts for evaluating models.
- Prediction: Scripts for making predictions.
- Retraining: Scripts for retraining models.
- Training: Scripts for training models.
- WeightManager: Manages model weights.
- PythonSingleton: Singleton class for Python which starts and holds Python interpreter.
-
DI: Dependency injection modules.
- AppModule: Provides application-wide dependencies.
- DatabaseModule: Provides database-related dependencies.
- NetworkModule: Provides network-related dependencies.
-
UI: User interface components.
- Base: Base classes for UI components.
- component: Specific UI components for authentication, email detection, machine learning, and settings.
- App: Main application class.
- MainActivity: Main activity of the application.
- Utils: Utility classes and functions.
Our phishing detection uses several feature finders, each responsible for extracting specific elements from emails that are commonly used by phishing attempts:
- HTMLFormFinder: Identifies HTML forms within emails, a common phishing vector to solicit user information.
- IFrameFinder: Detects the use of IFrames, potentially embedding malicious content invisibly.
- FlashFinder: Searches for Flash content links, which could execute harmful scripts.
- AttachmentFinder: Counts email attachments, which may contain malicious payloads.
- HTMLContentFinder: Looks for specific HTML content indicative of phishing.
- URLsFinder: Extracts and evaluates URLs found within emails for malicious links.
- ExternalResourcesFinder: Identifies external resources linked within emails that could be harmful.
- JavascriptFinder: Detects JavaScript, which can be used in phishing for malicious activities.
- CssFinder: Searches for custom CSS that might be used to disguise phishing attempts.
- IPsInURLs: Checks for IP addresses in URLs, a technique used to bypass domain name suspicion.
- AtInURLs: Identifies '@' symbols in URLs, which can be a sign of deceptive links.
- EncodingFinder: Analyzes the content encoding for signs of obfuscation or unusual patterns.
This project builds upon and extends the work found at MachineLearningPhishing by Diego Ocampo.
The data used for training the phishing detection model were sourced from two main repositories, which provided a rich dataset of phishing emails:
- Phishing Pot Dataset by rf-peixoto (converted .eml to mbox using scripts in this repo)
- Phishing Dataset by jose at monkey.org (downloaded mbox files)
If you want to contribute to this project, please follow these guidelines:
- Fork the repository.
- Create a new branch.
- Make your changes and commit them.
- Push your changes to your fork.
- Create a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.