Skip to content

Commit

Permalink
Merge pull request #457 from umccr/feat/filemanager-api-matching
Browse files Browse the repository at this point in the history
feat: filemanager API wildcards
  • Loading branch information
mmalenic authored Aug 9, 2024
2 parents 4cced9a + bce0135 commit 71ac639
Show file tree
Hide file tree
Showing 14 changed files with 1,556 additions and 211 deletions.
5 changes: 3 additions & 2 deletions lib/workload/stateless/stacks/filemanager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ Alternatively, just `brew install dbeaver-community` to easily browse the databa

## Local API server

To use the local API server, run:
For more details on the filemanager API, see the [`API_GUIDE.md`][api-guide]. To use the local API server, run:

```sh
make api
Expand Down Expand Up @@ -125,7 +125,7 @@ docker system prune -a --volumes
## Architecture

The filemanager ingest functionality operates to ensure eventual consistency in the database records. See the
[ARCHITECTURE.md][architecture] for more details.
[`ARCHITECTURE.md`][architecture] for more details.

## Project Layout

Expand All @@ -142,6 +142,7 @@ The project is divided into multiple crates that serve different functionality.
* [database]: Database migration files and queries.

[architecture]: docs/ARCHITECTURE.md
[api-guide]: docs/API_GUIDE.md
[filemanager]: filemanager
[filemanager-api-lambda]: filemanager-api-lambda
[filemanager-api-server]: filemanager-api-server
Expand Down
169 changes: 169 additions & 0 deletions lib/workload/stateless/stacks/filemanager/docs/API_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Filemanager API

The filemanager API gives access to S3 object records for all [S3 file events][s3-events] which are recorded in the database.

To start a local API server and view the OpenAPI documentation, run the following:

```sh
make api
```

This serves Swagger OpenAPI docs at `http://localhost:8000/swagger_ui` when using default settings.

The deployed instance of the filemanager API can be reached using the desired stage at `https://file.<stage>.umccr.org`
using the orcabus API token. To retrieve the token, run:

```sh
export TOKEN=$(aws secretsmanager get-secret-value --secret-id orcabus/token-service-jwt --output json --query SecretString | jq -r 'fromjson | .id_token')
```

## Querying records

The API is designed to have a standard set of REST routes which can be used to query for records. The API is version with a
`/api/v1` route prefix, and S3 object records can be reached under `/api/v1/s3_objects`.

For example, to query a single record, use the `s3_object_id` in the path, which returns the JSON record:

```sh
curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects/0190465f-68fa-76e4-9c36-12bdf1a1571d" | jq
```

Multiple records can be reached using the same route, which returns an array of JSON records:

```sh
curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects" | jq
```

This route is paginated, and by default returns 1000 records from the first page in a JSON list response:

```json
{
"next_page": 1,
"results": [
"<first 1000 s3_object records>..."
]
}
```

Use the `page` and `page_size` query parameters to control the pagination:

```sh
curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects?page=10&page_size=50" | jq
```

The records can be filtered using the same fields from the record by naming the field in a query parameter.
For example, query all records for a certain bucket:

```sh
curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects?bucket=umccr-temp-dev" | jq
```

Since the filemanager database keeps a copy of all S3 events that it receives, old records for deleted objects
are also kept in the database. In order to retrieve only current objects, that is, objects that are still in S3 and
don't have an associated `Deleted` event, use the `current_state` query parameter:

```sh
curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects?current_state=true" | jq
```

### Attributes

The filemanager has the ability to save JSON attributes on any records. Attributes can be used to query similar to
filtering on record fields. The syntax for attribute querying uses square brackets to access nested JSON fields, similar
to the syntax defined by the [qs] npm package. Brackets should be percent-encoded in URLs.

For example, query for a previously set `portal_run_id`:

```sh
curl --get -H "Authorization: Bearer $TOKEN" --data-urlencode "attributes[portal_run_id]=202405212aecb782" \
"https://file.dev.umccr.org/api/v1/s3_objects" | jq
```

> [!NOTE]
> Attributes on filemanager records start empty. They need to be added to the record to query on them later.
> See [updating records](#updating-records)
### Wilcard matching

The API supports using wildcards to match multiple characters in a value for most field. Use `%` to match multiple characters
and `_` to match one character. These queries get converted to postgres `like` queries under the hood. For example, query
on a key prefix:

```sh
curl --get -H "Authorization: Bearer $TOKEN" --data-urlencode "key=temp\_data%" \
"https://file.dev.umccr.org/api/v1/s3_objects" | jq
```

Case-insensitive wildcard matching, which gets converted to a postgres `ilike` statement, is supported by using `case_sensitive`:

```sh
curl --get -H "Authorization: Bearer $TOKEN" --data-urlencode "key=temp\_data%" \
"https://file.dev.umccr.org/api/v1/s3_objects?case_sensitive=false" | jq
```

Wildcard matching is also supported on attributes:

```sh
curl --get -H "Authorization: Bearer $TOKEN" --data-urlencode "attributes[portal_run_id]=20240521%" \
"https://file.dev.umccr.org/api/v1/s3_objects" | jq
```

## Updating records

As part of allowing filemanager to link and query on attributes, attributes can be updated using PATCH requests.
Each of the above endpoints and queries supports a PATCH request which can be used to update attributes on a set
of records, instead of listing records. All query parameters except pagination are supported for updates.
Attributes are update using [JSON patch][json-patch].

For example, update attributes on a single record:

```sh
curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
--data '{ "attributes": [ { "op": "add", "path": "/portal_run_id", "value": "202405212aecb782" } ] }' \
"https://file.dev.umccr.org/api/v1/s3_objects/0190465f-68fa-76e4-9c36-12bdf1a1571d" | jq
```

Or, update attributes for multiple records with the same key prefix:

```sh
curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
--data '{ "attributes": [ { "op": "add", "path": "/portal_run_id", "value": "202405212aecb782" } ] }' \
"https://file.dev.umccr.org/api/v1/s3_objects?key=%25202405212aecb782%25" | jq
```

## Count objects

There is an API route which counts the total number of records in the database, which supports
similar query parameters as the regular list operations.

For example, count the total records:

```sh
curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects/count" | jq
```

## The `objects` record

There is a similar record kept in the filemanager database called `object`. A similar REST API
is available for these records under `/api/v1/objects`, however the `object` currently don't server
a purpose in filemanager. They were initially included to support attribute linking, however they will likely
be removed because attribute linking can be accomplished using the `attributes` column on `s3_object`.

## Some missing features

There are some missing features in the query API which are planned, namely:

* There is no way to compare values with `>`, `>=`, `<`, `<=`.
* There is no way to express `and` or `or` conditions in the API.

There are also some feature missing for attribute linking. For example, there is no way
to capture matching wildcard groups which can later be used in the JSON patch body.

There is also no way to POST an attribute linking rule, which can be used to update S3 records
as they are received by filemanager. See [ATTRIBUTE_LINKING.md][attribute-linking] for a discussion on some approaches
for this. The likely solution will involve merging the above wildcard matching logic with attribute rules.

[json-patch]: https://jsonpatch.com/
[qs]: https:/ljharb/qs
[s3-events]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/EventNotifications.html
[attribute-linking]: ATTRIBUTE_LINKING.md
145 changes: 145 additions & 0 deletions lib/workload/stateless/stacks/filemanager/docs/ATTRIBUTE_LINKING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
> [!IMPORTANT]
> This document is a discussion on potential designs for filemanager attribute linking. The current implementation
> isn't exactly like this, although it will probably contain components of the design here.
# FileManager Attribute Linking

The filemanager needs to be able to store data from other microservices on `s3_object` records in order to perform some
logic, e.g. querying on the stored data. Ideally the filemanager should only deal with object records, without having
to know the domain of other microservices. This means that the filemanager needs a mechanism to store arbitrary
data, and be told what data to store for a given record.

## Storing attributes

The first part of the problem, i.e. storing data on records is solved by using the `attributes` column on `s3_object`.
The `attributes` can store arbitrary JSON data, which can be queried for using postgres JSON extensions.
E.g. To fetch all objects where the attributes contain an 'attribute_id' looks like:

```sql
select * from s3_object where attributes @> '{ "attribute_id": "some_id" }';
```

From the API a nested style syntax can be used:

```
/api/v1/s3_objects?attributes[attribute_id]=some_id
```

Attributes can be used by other services to perform logic, e.g. the UI could fetch all objects where
the `portal_run_id` equals `<some_id>`.

## Knowing what attributes to store

The filemanager needs to be told by other services what attributes to store. However, it only receives native S3 events
as input which cannot be edited.

A generalisable way to accomplish this is to allow the filemanager to accept a set of 'rules', which given
an S3 event as input, determine the attribute. For example, a rule could be:

* For all events where the bucket equals 'umccr-temp-dev', and the key starts with a prefix 'analysis_data/.../.../.../...'
extract the 4th path segment and add the attribute: `{ "portal_run_id": "<4th path segment>" }`.

### Rules engine

The microservice which knows about the rule could tell the filemanager about it. The rules could be published on the
event bus and use a JSON rules engine, similar to the way EventBridge rules are parsed.

For example, the workflow manager could tell filemanager a rule about matching buckets/keys using the following event message:

```json
{
"detail-type": [
"FileManagerAttributeRule"
],
"source": [
"orcabus.workflowmanager"
],
"detail": {
"rule": {
"bucket": "umccr-temp-dev",
"key": "some_prefix/(*)/*"
},
"apply": {
"some_attribute_id": "<first_wildcard_capture_group>"
},
"start_from": "<apply_to_events_after_this_date>",
"expires": "<date_where_rule_no_longer_applies>"
}
}
```

This rule would match all S3 events that have 'umccr-temp-dev' as the bucket, and keys with 'some_prefix' containing a
regex capture group. The rule only applies to events received between `starts_from` and `expires`.

Filemanager would store this rule, and check existing rules for each S3 event to see if it needs to add
attributes. If the rule is received by filemanager after an event has already fired, that's okay, the filemanager can
apply the rule retroactively to its database records.

The advantage of this approach is that it is quite general, and it means that the filemanager doesn't need to know any
details about other microservices' logic/domains. Rules also don't need to be emitted by services to be used. For example,
statically derived attributes that only need information from the S3 event could be initialized into the filemanager
database as it's deployed.

Rules could be updated if required. There could also be different operators that merge attributes in different ways,
e.g. 'append attribute', 'append if not exists', 'update attribute', 'overwrite attribute', etc.

A potential disadvantage is that the rules engine may not be flexible enough to accommodate all attribute requirements.
E.g. it's not possible to execute arbitrary code to compute the attribute.

### Technical challenges

It doesn't seem like there are many implementations of JSON rules engines. In Rust there is [json-rules-engine-rs],
which seems to be based on the javascript [json-rules-engine]. Notably, there is
[Event Ruler][aws-event-ruler] which is a Java library and what AWS EventBridge rules uses. Calling a Java library
from Rust would require some FFI bindings.

An existing library doesn't have to be used. Since the format of S3 events is known in advance,
a simpler approach would probably involve implementing the rules manually in filemanager, leveraging something like [serde_json].

[json-rules-engine-rs]: https:/GopherJ/json-rules-engine-rs
[json-rules-engine]: https:/CacheControl/json-rules-engine
[aws-event-ruler]: https:/aws/event-ruler
[serde_json]: https:/serde-rs/json

### Architecture

The architecture of this approach could look something like this, where each service emits rules for the filemanager to
consume:

![filemanager_attribute_linking](./filemanager_attributes.drawio.svg)

Here the filemanager stores rules in its database and processes them directly.

Alternatively, the linking logic could be a separate microservice (FileManagerAttributeManager? ThePreFileManagerManager?):

![filemanager_attribute_linking_service](./filemanager_attributes_alt.drawio.svg)

Here the filemanager ingests events that contain additional attributes from another SQS queue, and the
attribute manager consumes events from the S3 queue. In order to update existing records, the filemanager could
accept a POST request to update a set of records that the attribute manager knows about.

An advantage of this approach is that it can use different languages, which would be useful if using rules libraries like
Event Ruler.

The disadvantage is that it adds more complexity, and more latency in the S3 event processing, because now
the filemanager is no longer directly consuming S3 events.

## Alternatives

Instead of microservices pushing rules into the event bus, the filemanager could query the microservices to decide what
to do with the events. However, this adds many API calls if the filemanager has to query on a per-event basis.

Instead of reading/parsing rules in JSON, there could be a filemanager extension/plugin system which runs on each S3
event to determine attributes. This could be separate from the filemanager code, and would work well for
statically derived attributes. However, it may also introduce many API calls if the filemanager has to query other microservices
on a per-event basis.

A combination of these approaches is also possible, where there is some rule matching, and an extension which can
query other microservices/perform complex logic on the matched events only.

## Questions

1. Is a rule-based regex-style approach enough to cover all use-cases for generating attributes, or does more complicated
logic need to happen?
2. Are expiry/start dates for rules flexible enough to deal with changes in the rules over time?

Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# filemanager-api-server

An instance of the filemanager api which can be launched as a webserver. The default address which the webserver uses
is `localhost:8080`. Set the `FILEMANAGER_API_SERVER_ADDR` environment variable to change this. To run the local server:
is `0.0.0.0:8000`. Set the `FILEMANAGER_API_SERVER_ADDR` environment variable to change this. To run the local server:

```sh
make api
```

Then, checkout the OpenAPI docs at: `http://localhost:8080/swagger_ui`.
Then, checkout the OpenAPI docs at: `http://localhost:8000/swagger_ui`.
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ pub enum Error {
QueryError(String),
#[error("invalid input: `{0}`")]
InvalidQuery(String),
#[error("expected some value for id: `{0}`")]
#[error("expected record for id: `{0}`")]
ExpectedSomeValue(Uuid),
}

Expand Down
Loading

0 comments on commit 71ac639

Please sign in to comment.