Skip to content

Commit

Permalink
pre-commit: standardize pre-commit across repos
Browse files Browse the repository at this point in the history
  • Loading branch information
DonHaul committed Oct 10, 2024
1 parent d59c610 commit 8124217
Show file tree
Hide file tree
Showing 18 changed files with 129 additions and 104 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,4 +54,4 @@ jobs:
type=ref,event=pr
type=ref,event=tag
username: ${{ secrets.HARBOR_USERNAME }}
password: ${{ secrets.HARBOR_SECRET }}
password: ${{ secrets.HARBOR_SECRET }}
2 changes: 1 addition & 1 deletion .github/workflows/pre-commit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,4 @@ jobs:
python-version: "3.11"

- name: Run pre-commit
uses: pre-commit/[email protected]
uses: pre-commit/[email protected]
2 changes: 1 addition & 1 deletion .github/workflows/pull-request-master.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ jobs:
uses: ./.github/workflows/pre-commit.yml
with:
ref: ${{ github.event.pull_request.head.sha }}
secrets: inherit
secrets: inherit
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -165,4 +165,4 @@ cython_debug/
*.csv
*.h5

!/tests/integration/fixtures/inspire_test_data.df
!/tests/integration/fixtures/inspire_test_data.df
25 changes: 14 additions & 11 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
repos:
- repo: https:/psf/black
rev: '24.4.2'
- repo: https:/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: black
- repo: https:/pycqa/isort
rev: '5.13.2'
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- id: fix-byte-order-marker
- id: mixed-line-ending
- id: name-tests-test
args: [ --pytest-test-first ]
exclude: '^(?!factories/)'
- repo: https:/astral-sh/ruff-pre-commit
rev: v0.6.9
hooks:
- id: isort
- repo: https:/pycqa/flake8
rev: '7.1.0'
hooks:
- id: flake8
args: ['--config=setup.cfg']
- id: ruff
args: [ --fix]
12 changes: 5 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Inspire Classifier

## About
## About
INSPIRE module aimed at automatically classifying the new papers that are added to INSPIRE, such as if they are core or not, or the arXiv category corresponding to each of them.

The current implemntation uses the ULMfit approach. Universal Language Model Fine-tuning, is a method for training text classifiers by first pretraining a language model on a large corpus to learn general language features (in this case a pre-loaded model, which was trained using the WikiText-103 dataset is used). The pretrained model is then fine-tuned on the title and abstract of the inpsire dataset before training the classifier on top.
The current implemntation uses the ULMfit approach. Universal Language Model Fine-tuning, is a method for training text classifiers by first pretraining a language model on a large corpus to learn general language features (in this case a pre-loaded model, which was trained using the WikiText-103 dataset is used). The pretrained model is then fine-tuned on the title and abstract of the inpsire dataset before training the classifier on top.



Expand All @@ -28,7 +28,7 @@ poetry run python scripts/create_dataset.py --year-from $YEAR_FROM --month-from


### 2. Run training and validate model
The [`train_classifier.py`](scripts/train_classifier.py) script will run the commands to train and validate a new model. Configurations changes like the amount of training epochs as well as the train-test split can be adjusted here. In short, the script first splits the pkl file from the first step into a training and a test dataset inside the `classifier/data` folder. The training set is then used to train the model, while the test set is used to evaluate the model after the training is finished. The model will be saved into `classifier/models/language_model/finetuned_language_model_encoder.h5`
The [`train_classifier.py`](scripts/train_classifier.py) script will run the commands to train and validate a new model. Configurations changes like the amount of training epochs as well as the train-test split can be adjusted here. In short, the script first splits the pkl file from the first step into a training and a test dataset inside the `classifier/data` folder. The training set is then used to train the model, while the test set is used to evaluate the model after the training is finished. The model will be saved into `classifier/models/language_model/finetuned_language_model_encoder.h5`

```
poetry run python scripts/train_classifier.py
Expand All @@ -49,15 +49,13 @@ poetry run python scripts/upload_to_s3.py

1. Build docker image: `docker build -t inspirehep/classifier:<NEW TAG> .`
2. Login with inspirehep user on dockerhub: `docker login`
3. Push image to dockerhub: `docker push inspirehep/classifier:<NEW TAG>`
3. Push image to dockerhub: `docker push inspirehep/classifier:<NEW TAG>`
4. Change `newTag` in the `kustomization.yml` file in the [k8s repo](https:/cern-sis/kubernetes/tree/master/classifier).




## How to run
For testing, the cli of the classifier can be used via `poetry run inspire-classifier 'example title' 'exmaple abstract'`, with the `-b` flag, the basepath to check for the training data, can be passed (which currently should be `-b classifier`).
For testing, the cli of the classifier can be used via `poetry run inspire-classifier 'example title' 'exmaple abstract'`, with the `-b` flag, the basepath to check for the training data, can be passed (which currently should be `-b classifier`).

In the production, the api is used to predict the 'coreness' of records using the `/api/predict/coreness` endpoint and passing `title` and `abstract` as json fields in a POST request (see [this file](inspire_classifier/app.py) for details).


4 changes: 3 additions & 1 deletion inspire_classifier/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,6 @@
# granted to it by virtue of its status as an Intergovernmental Organization
# or submit itself to any jurisdiction.

"""INSPIRE module aimed at automatically classifying the new papers that are added to INSPIRE, such as if they are core or not, or the arXiv category corresponding to each of them."""
"""INSPIRE module aimed at automatically classifying the new papers that are added to
INSPIRE, such as if they are core or not, or the arXiv category corresponding to each
of them."""
12 changes: 6 additions & 6 deletions inspire_classifier/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,9 @@ def split_data():
)
except IOError as error:
raise IOError(
"Training dataframe not found. Make sure the file is present in the right directory. "
"Please use the path specified in config.py for CLASSIFIER_DATAFRAME_PATH relative to the "
"CLASSIFIER_BASE_PATH."
"Training dataframe not found. Make sure the file is present in the right "
"directory. Please use the path specified in config.py for "
"CLASSIFIER_DATAFRAME_PATH relative to the CLASSIFIER_BASE_PATH."
) from error


Expand Down Expand Up @@ -93,8 +93,8 @@ def finetune_and_save_language_model():
)
except IOError as error:
raise IOError(
"Unable to save the finetuned language model. Please check that the language model data directory "
"exists."
"Unable to save the finetuned language model. Please check that the "
"language model data directory exists."
) from error


Expand Down Expand Up @@ -182,7 +182,7 @@ def predict_coreness(classifier, title, abstract):

predicted_class = categories[np.argmax(class_probabilities)]
output_dict = {"prediction": predicted_class}
output_dict["scores"] = dict(zip(categories, class_probabilities))
output_dict["scores"] = dict(zip(categories, class_probabilities, strict=False))

return output_dict

Expand Down
9 changes: 5 additions & 4 deletions inspire_classifier/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,13 @@
from prometheus_flask_exporter.multiprocess import GunicornInternalPrometheusMetrics
from webargs.flaskparser import use_args

from inspire_classifier import serializers
from inspire_classifier.api import initialize_classifier, predict_coreness

from . import serializers


class JsonResponse(Response):
""" "By creaitng this Response class, we force the response to always be in json, getting rid of the jsonify function."""
""" By creating this Response class, we force the response to always be in json,
getting rid of the jsonify function."""

@classmethod
def force_type(cls, rv, environ=None):
Expand All @@ -60,7 +60,8 @@ def create_app():

@app.route("/api/health")
def date():
"""Basic endpoint that returns the date, used to check if everything is up and working."""
"""Basic endpoint that returns the date, used to check if everything is up
and working."""
now = datetime.datetime.now()
return jsonify(now)

Expand Down
34 changes: 16 additions & 18 deletions inspire_classifier/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,10 @@ def inspire_classifier():
"-b", "--base-path", type=click.Path(exists=True), required=False, nargs=1
)
def predict(title, abstract, base_path):
with click_spinner.spinner():
with current_app.app_context():
if base_path:
current_app.config["CLASSIFIER_BASE_PATH"] = base_path
click.echo(predict_coreness(title, abstract))
with click_spinner.spinner(),current_app.app_context():
if base_path:
current_app.config["CLASSIFIER_BASE_PATH"] = base_path
click.echo(predict_coreness(title, abstract))


@inspire_classifier.command("train")
Expand All @@ -58,19 +57,18 @@ def predict(title, abstract, base_path):
"-b", "--base-path", type=click.Path(exists=True), required=False, nargs=1
)
def train_classifier(language_model_epochs, classifier_epochs, base_path):
with click_spinner.spinner():
with current_app.app_context():
if language_model_epochs:
current_app.config["CLASSIFIER_LANGUAGE_MODEL_CYCLE_LENGTH"] = (
language_model_epochs
)
if classifier_epochs:
current_app.config["CLASSIFIER_CLASSIFIER_CYCLE_LENGTH"] = (
classifier_epochs
)
if base_path:
current_app.config["CLASSIFIER_BASE_PATH"] = base_path
train()
with click_spinner.spinner(),current_app.app_context():
if language_model_epochs:
current_app.config["CLASSIFIER_LANGUAGE_MODEL_CYCLE_LENGTH"] = (
language_model_epochs
)
if classifier_epochs:
current_app.config["CLASSIFIER_CLASSIFIER_CYCLE_LENGTH"] = (
classifier_epochs
)
if base_path:
current_app.config["CLASSIFIER_BASE_PATH"] = base_path
train()


@inspire_classifier.command("validate")
Expand Down
5 changes: 4 additions & 1 deletion inspire_classifier/domain/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,11 @@ def initialize_learner(
self,
dropout_multiplier=0.5,
weight_decay=1e-6,
learning_rates=np.array([1e-4, 1e-4, 1e-4, 1e-3, 1e-2]),
learning_rates=None,
):
if learning_rates is None:
learning_rates = np.array([1e-4, 1e-4, 1e-4, 1e-3, 1e-2])

self.learner = text_classifier_learner(
self.dataloader,
AWD_LSTM,
Expand Down
8 changes: 5 additions & 3 deletions inspire_classifier/domain/preprocessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,11 @@
def split_and_save_data_for_training(dataframe_path, dest_dir, val_fraction=0.1):
"""
Args:
dataframe_path: The path to the pandas dataframe containing the records. The dataframe should have one
column containing the title and abstract text appended (title + abstract). The second
column should contain the label as an integer (0: Rejected, 1: Non-Core, 2: Core).
dataframe_path: The path to the pandas dataframe containing the records.
The dataframe should have one column containing the title and
abstract text appended (title + abstract). The second column
should contain the label as an integer
(0: Rejected, 1: Non-Core, 2: Core).
dest_dir: Directory to save the training/validation csv.
val_fraction: the fraction of data to use as the validation set.
"""
Expand Down
7 changes: 0 additions & 7 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -35,14 +35,12 @@ mock = "^5.1.0"


[tool.poetry.group.dev.dependencies]
black = "^24.4.2"
pre-commit = "*"
elasticsearch-dsl = "^7.4.0"
elasticsearch = "<7.14.0"
inspire-utils = "3.0.22"


isort = "^5.13.2"
boto3 = "^1.34.130"
[build-system]
requires = ["poetry-core"]
Expand All @@ -56,8 +54,3 @@ inspire-classifier = 'inspire_classifier.cli:inspire_classifier'
testpaths = [
"tests",
]

[tool.isort]
profile = "black"
multi_line_output = 3
atomic = true
28 changes: 28 additions & 0 deletions ruff.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
target-version = "py311"
[lint.flake8-tidy-imports]
ban-relative-imports = "all"

[lint]
select = [
# pycodestyle
"E",
# Pyflakes
"F",
# flake8-bugbear
"B",
# flake8-simplify
"SIM",
# isort
"I",
# flake8-tidy-imports
"TID",
# flake8-pytest-style
"PT",
]
ignore = ["B904"]

[lint.pycodestyle]
ignore-overlong-task-comments = true

[lint.pydocstyle]
convention = "google"
4 changes: 3 additions & 1 deletion scripts/train_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,9 @@ def train_classifier(
print("-----------------")

os.system(
f"inspire-classifier train -b classifier --classifier-epochs {number_of_classifier_epochs} --language-model-epochs {number_of_lanuage_model_epochs}"
f"inspire-classifier train -b classifier "
f"--classifier-epochs {number_of_classifier_epochs} "
f"--language-model-epochs {number_of_lanuage_model_epochs}"
)
print("training finished successfully!")
os.system(
Expand Down
18 changes: 0 additions & 18 deletions setup.cfg

This file was deleted.

11 changes: 6 additions & 5 deletions tests/integration/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ def app():
yield app


@pytest.fixture()
@pytest.fixture
def app_client(app):
return app.test_client()

Expand All @@ -55,9 +55,10 @@ class Mock_Learner(Learner):
"""
Mocks the fit method of the Learner.
This is done to reduce the model training time during testing by making the fit run once (as opposed to 2 times and
3 times for the LanguageModel and Classifier respectively). It stores the result of the first run and then returns
the same result for the other times fit is run.
This is done to reduce the model training time during testing by making the fit
run once (as opposed to 2 times and 3 times for the LanguageModel and Classifier
respectively). It stores the result of the first run and then returns the same
result for the other times fit is run.
"""

def fit(self, *args, **kwargs):
Expand All @@ -70,7 +71,7 @@ def fit(self, *args, **kwargs):

@pytest.fixture(scope="session")
@patch("fastai.text.learner.text_classifier_learner", Mock_Learner)
def trained_pipeline(app, tmp_path_factory):
def _trained_pipeline(app, tmp_path_factory):
app.config["CLASSIFIER_BASE_PATH"] = tmp_path_factory.getbasetemp()
create_directories()
shutil.copy(
Expand Down
Loading

0 comments on commit 8124217

Please sign in to comment.