Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CAI-118] Presidio #1183

Merged
merged 43 commits into from
Oct 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
9e43b53
Update webapp
mdciri Aug 8, 2024
851a1a7
Update modules
mdciri Aug 8, 2024
8dfe310
Add handler
mdciri Aug 8, 2024
64f7070
Add presidio
mdciri Aug 8, 2024
9723cb6
Update config
mdciri Aug 8, 2024
e9a24f1
Update poetry
mdciri Aug 8, 2024
d96da8b
Merge remote-tracking branch 'origin' into presidio/chatbot/cai-118
mdciri Oct 3, 2024
29b0b52
Update modules
mdciri Oct 3, 2024
7fa36df
Update webapp
mdciri Oct 3, 2024
a0c958d
Update poetry files
mdciri Oct 3, 2024
ec224cc
Remove handler script
mdciri Oct 3, 2024
3069e21
Update config
mdciri Oct 3, 2024
088b303
Update modules
mdciri Oct 3, 2024
980a25b
Update poetry files
mdciri Oct 3, 2024
92b2e65
Update config
mdciri Oct 3, 2024
fb7975c
Update config
mdciri Oct 3, 2024
fafffb2
Update modules
mdciri Oct 4, 2024
d2f1e44
Update config
mdciri Oct 7, 2024
2720399
Update modules
mdciri Oct 7, 2024
d6bb2ea
Update README
mdciri Oct 8, 2024
41ba85a
Update README
mdciri Oct 8, 2024
cfb00ab
Update README
mdciri Oct 8, 2024
f2c75e4
Update Redis tunnel bash script
mdciri Oct 8, 2024
10c8cca
Update modules
mdciri Oct 8, 2024
0bf8fce
Update changeset
mdciri Oct 8, 2024
c78a9d5
Update env variables
mdciri Oct 8, 2024
66e043d
Merge branch 'main' into presidio/chatbot/cai-118
christian-calabrese Oct 8, 2024
c79799f
Update env vars example
mdciri Oct 10, 2024
a06814c
Update config
mdciri Oct 10, 2024
0e110e4
Update modules
mdciri Oct 10, 2024
1ea0e67
Update redis tunnel
mdciri Oct 10, 2024
1c8c242
Update modules
mdciri Oct 10, 2024
7c33a96
Update env vars example
mdciri Oct 10, 2024
bba50d0
Update modules
mdciri Oct 10, 2024
902a129
Update config
mdciri Oct 10, 2024
0edc9da
feat: added index_id ssm parameter
christian-calabrese Oct 10, 2024
a23a0cb
fix: efs name
christian-calabrese Oct 10, 2024
44bbf1c
Merge branch 'main' into presidio/chatbot/cai-118
christian-calabrese Oct 10, 2024
f4a10ac
chore: ran terraform fmt
christian-calabrese Oct 10, 2024
92a8643
fix: ssm parameter type
christian-calabrese Oct 10, 2024
59a6fb3
Update webapp
mdciri Oct 10, 2024
4f50684
Update presidio model to medium size
mdciri Oct 11, 2024
028b827
feat: added presidio models caching
christian-calabrese Oct 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/long-camels-sell.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"chatbot": minor
---

"Add Presidio to detect and mask PII entities"
6 changes: 5 additions & 1 deletion apps/chatbot/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,16 @@ PYTHONPATH=app-path
LOG_LEVEL=DEBUG
CHB_AWS_ACCESS_KEY_ID=...
CHB_AWS_SECRET_ACCESS_KEY=...
CHB_AWS_DEFAULT_REGION=eu-west-3
CHB_AWS_DEFAULT_REGION=eu-south-1
CHB_AWS_BEDROCK_REGION=eu-west-3
CHB_AWS_S3_BUCKET=...
CHB_AWS_GUARDRAIL_ID=...
CHB_AWS_GUARDRAIL_VERSION=...
CHB_REDIS_URL=...
CHB_WEBSITE_URL=...
CHB_REDIS_INDEX_NAME=...
CHB_LLAMAINDEX_INDEX_ID=...
CHB_DOCUMENTATION_DIR=...
CHB_GOOGLE_API_KEY=...
CHB_PROVIDER=...
CHB_MODEL_ID=...
Expand Down
40 changes: 11 additions & 29 deletions apps/chatbot/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
# PagoPA Chatbot

This folder contains all the details to build a RAG using the documentation provided in [`PagoPA Developer Portal`](https://developer.pagopa.it/). The retriver chosen is the `Auto Merging Retriver` one and it was implemented using [`llama-index`](https://docs.llamaindex.ai/en/stable/). Check out `src/modules/retriever.py`.
This folder contains all the details to build a RAG using the documentation provided in [`PagoPA Developer Portal`](https://developer.pagopa.it/).

This chatbot uses [`AWS Bedrock`](https://aws.amazon.com/bedrock/) as provider, so be sure to have installed [`aws-cli`](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) and stored your credential in `~/.aws/credentials`.
This chatbot uses [Google](https://ai.google.dev/) or [AWS Bedrock](https://aws.amazon.com/bedrock/) as provider.
Even though the provider is the Google one, we stored its API key in AWS. So, be sure to have installed [aws-cli](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) and stored your credential in `~/.aws/credentials`.

All the parameters and prompts used to build the Retrieval-Augmented Generation (RAG) are available in `config`.
The Retrieval-Augmented Generation (RAG) was implemented using [llama-index](https://docs.llamaindex.ai/en/stable/). All the parameters and prompts used are stored in `config`.

## Environment Variables

Create a `.env` file inside this folder and store the environment variables listed in `.env.example`.

## Virtual environment

Expand All @@ -27,40 +32,17 @@ The working directory is `/developer-portal/apps/chatbot`. So, to set the `PYTHO

In this way, `PYTHONPATH` points to where the Python packages and modules are, not where your checkouts are.

## File for Environment Variables

Create a `.env` file inside the folder and write to the file the following environment variables:

CHB_AWS_ACCESS_KEY_ID=...
CHB_AWS_SECRET_ACCESS_KEY=...
CHB_AWS_DEFAULT_REGION=...
CHB_AWS_S3_BUCKET=...
CHB_AWS_GUARDRAIL_ID=...
CHB_AWS_GUARDRAIL_VERSION=...
CHB_REDIS_URL=...
CHB_REDIS_INDEX_NAME=...
CHB_WEBSITE_URL=...
CHB_GOOGLE_API_KEY=...
CHB_PROVIDER=...
CHB_MODEL_ID=...
CHB_MODEL_TEMPERATURE=...
CHB_MODEL_MAXTOKENS=...
CHB_EMBED_MODEL_ID=...
CHB_ENGINE_SIMILARITY_TOPK=...
CHB_ENGINE_SIMILARITY_CUTOFF=...
CHB_ENGINE_USE_ASYNC=...
CHB_ENGINE_USE_STREAMING=...

## Knowledge vector database
## Knowledge index vector database

To reach the remote redis instance, it is necessary to open a tunnel:

```
./scripts/redis-tunnel.sh
```

Verify that the HTML files that compose the Developer Portal documentation exist in a directory. Otherwise create the documentation. Once you have the documentation directory ready, put its path in `params` and, in the end, create the vector index doing:

```
```
python src/modules/create_vector_index.py --params config/params.yaml
```

Expand Down
52 changes: 50 additions & 2 deletions apps/chatbot/config/params.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,57 @@ vector_index:
path: index
chunk_sizes: [2816, 704, 176]
chunk_overlap: 20
use_redis: True
use_s3: False

engine:
response_mode: compact
verbose: False

config_presidio:
nlp_engine_name: spacy
models:
-
lang_code: en
model_name: en_core_web_md
-
lang_code: it
model_name: it_core_news_md
# -
# lang_code: de
# model_name: de_core_news_md
# -
# lang_code: es
# model_name: es_core_news_md
# -
# lang_code: fr
# model_name: fr_core_news_md
ner_model_configuration:
labels_to_ignore:
- ORDINAL
- QUANTITY
- ORGANIZATION
- ORG
- LANGUAGE
- PRODUCT
- MONEY
- PERCENT
- O
- CARDINAL
- EVENT
- WORK_OF_ART
- LAW
- MISC
model_to_presidio_entity_mapping:
PER: PERSON
PERSON: PERSON
LOC: LOCATION
LOCATION: LOCATION
GPE: LOCATION
ORG: ORGANIZATION
DATE: DATE_TIME
TIME: DATE_TIME
NORP: NRP
low_confidence_score_multiplier: 0.4
low_score_entity_names:
- ORGANIZATION
- ORG
default_score: 0.8
2 changes: 1 addition & 1 deletion apps/chatbot/config/prompts.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
qa_prompt_str: |
You are an customer services chatbot.
Your name is Discovery and your duty is to assist the user with the PagoPA DevPortal documentation!
Your name is Discovery and your duty is to assist the user with the PagoPA DevPortal documentation, homepage: https://dev.developer.pagopa.it!
--------------------
Context information:
{context_str}
Expand Down
2 changes: 2 additions & 0 deletions apps/chatbot/docker/app.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,6 @@ RUN poetry install

COPY . ${LAMBDA_TASK_ROOT}
RUN python ./scripts/nltk_download.py
RUN python ./scripts/spacy_download.py

CMD ["src.app.main.handler"]
Loading
Loading