-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #457 from umccr/feat/filemanager-api-matching
feat: filemanager API wildcards
- Loading branch information
Showing
14 changed files
with
1,556 additions
and
211 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
169 changes: 169 additions & 0 deletions
169
lib/workload/stateless/stacks/filemanager/docs/API_GUIDE.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,169 @@ | ||
# Filemanager API | ||
|
||
The filemanager API gives access to S3 object records for all [S3 file events][s3-events] which are recorded in the database. | ||
|
||
To start a local API server and view the OpenAPI documentation, run the following: | ||
|
||
```sh | ||
make api | ||
``` | ||
|
||
This serves Swagger OpenAPI docs at `http://localhost:8000/swagger_ui` when using default settings. | ||
|
||
The deployed instance of the filemanager API can be reached using the desired stage at `https://file.<stage>.umccr.org` | ||
using the orcabus API token. To retrieve the token, run: | ||
|
||
```sh | ||
export TOKEN=$(aws secretsmanager get-secret-value --secret-id orcabus/token-service-jwt --output json --query SecretString | jq -r 'fromjson | .id_token') | ||
``` | ||
|
||
## Querying records | ||
|
||
The API is designed to have a standard set of REST routes which can be used to query for records. The API is version with a | ||
`/api/v1` route prefix, and S3 object records can be reached under `/api/v1/s3_objects`. | ||
|
||
For example, to query a single record, use the `s3_object_id` in the path, which returns the JSON record: | ||
|
||
```sh | ||
curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects/0190465f-68fa-76e4-9c36-12bdf1a1571d" | jq | ||
``` | ||
|
||
Multiple records can be reached using the same route, which returns an array of JSON records: | ||
|
||
```sh | ||
curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects" | jq | ||
``` | ||
|
||
This route is paginated, and by default returns 1000 records from the first page in a JSON list response: | ||
|
||
```json | ||
{ | ||
"next_page": 1, | ||
"results": [ | ||
"<first 1000 s3_object records>..." | ||
] | ||
} | ||
``` | ||
|
||
Use the `page` and `page_size` query parameters to control the pagination: | ||
|
||
```sh | ||
curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects?page=10&page_size=50" | jq | ||
``` | ||
|
||
The records can be filtered using the same fields from the record by naming the field in a query parameter. | ||
For example, query all records for a certain bucket: | ||
|
||
```sh | ||
curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects?bucket=umccr-temp-dev" | jq | ||
``` | ||
|
||
Since the filemanager database keeps a copy of all S3 events that it receives, old records for deleted objects | ||
are also kept in the database. In order to retrieve only current objects, that is, objects that are still in S3 and | ||
don't have an associated `Deleted` event, use the `current_state` query parameter: | ||
|
||
```sh | ||
curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects?current_state=true" | jq | ||
``` | ||
|
||
### Attributes | ||
|
||
The filemanager has the ability to save JSON attributes on any records. Attributes can be used to query similar to | ||
filtering on record fields. The syntax for attribute querying uses square brackets to access nested JSON fields, similar | ||
to the syntax defined by the [qs] npm package. Brackets should be percent-encoded in URLs. | ||
|
||
For example, query for a previously set `portal_run_id`: | ||
|
||
```sh | ||
curl --get -H "Authorization: Bearer $TOKEN" --data-urlencode "attributes[portal_run_id]=202405212aecb782" \ | ||
"https://file.dev.umccr.org/api/v1/s3_objects" | jq | ||
``` | ||
|
||
> [!NOTE] | ||
> Attributes on filemanager records start empty. They need to be added to the record to query on them later. | ||
> See [updating records](#updating-records) | ||
### Wilcard matching | ||
|
||
The API supports using wildcards to match multiple characters in a value for most field. Use `%` to match multiple characters | ||
and `_` to match one character. These queries get converted to postgres `like` queries under the hood. For example, query | ||
on a key prefix: | ||
|
||
```sh | ||
curl --get -H "Authorization: Bearer $TOKEN" --data-urlencode "key=temp\_data%" \ | ||
"https://file.dev.umccr.org/api/v1/s3_objects" | jq | ||
``` | ||
|
||
Case-insensitive wildcard matching, which gets converted to a postgres `ilike` statement, is supported by using `case_sensitive`: | ||
|
||
```sh | ||
curl --get -H "Authorization: Bearer $TOKEN" --data-urlencode "key=temp\_data%" \ | ||
"https://file.dev.umccr.org/api/v1/s3_objects?case_sensitive=false" | jq | ||
``` | ||
|
||
Wildcard matching is also supported on attributes: | ||
|
||
```sh | ||
curl --get -H "Authorization: Bearer $TOKEN" --data-urlencode "attributes[portal_run_id]=20240521%" \ | ||
"https://file.dev.umccr.org/api/v1/s3_objects" | jq | ||
``` | ||
|
||
## Updating records | ||
|
||
As part of allowing filemanager to link and query on attributes, attributes can be updated using PATCH requests. | ||
Each of the above endpoints and queries supports a PATCH request which can be used to update attributes on a set | ||
of records, instead of listing records. All query parameters except pagination are supported for updates. | ||
Attributes are update using [JSON patch][json-patch]. | ||
|
||
For example, update attributes on a single record: | ||
|
||
```sh | ||
curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \ | ||
--data '{ "attributes": [ { "op": "add", "path": "/portal_run_id", "value": "202405212aecb782" } ] }' \ | ||
"https://file.dev.umccr.org/api/v1/s3_objects/0190465f-68fa-76e4-9c36-12bdf1a1571d" | jq | ||
``` | ||
|
||
Or, update attributes for multiple records with the same key prefix: | ||
|
||
```sh | ||
curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \ | ||
--data '{ "attributes": [ { "op": "add", "path": "/portal_run_id", "value": "202405212aecb782" } ] }' \ | ||
"https://file.dev.umccr.org/api/v1/s3_objects?key=%25202405212aecb782%25" | jq | ||
``` | ||
|
||
## Count objects | ||
|
||
There is an API route which counts the total number of records in the database, which supports | ||
similar query parameters as the regular list operations. | ||
|
||
For example, count the total records: | ||
|
||
```sh | ||
curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects/count" | jq | ||
``` | ||
|
||
## The `objects` record | ||
|
||
There is a similar record kept in the filemanager database called `object`. A similar REST API | ||
is available for these records under `/api/v1/objects`, however the `object` currently don't server | ||
a purpose in filemanager. They were initially included to support attribute linking, however they will likely | ||
be removed because attribute linking can be accomplished using the `attributes` column on `s3_object`. | ||
|
||
## Some missing features | ||
|
||
There are some missing features in the query API which are planned, namely: | ||
|
||
* There is no way to compare values with `>`, `>=`, `<`, `<=`. | ||
* There is no way to express `and` or `or` conditions in the API. | ||
|
||
There are also some feature missing for attribute linking. For example, there is no way | ||
to capture matching wildcard groups which can later be used in the JSON patch body. | ||
|
||
There is also no way to POST an attribute linking rule, which can be used to update S3 records | ||
as they are received by filemanager. See [ATTRIBUTE_LINKING.md][attribute-linking] for a discussion on some approaches | ||
for this. The likely solution will involve merging the above wildcard matching logic with attribute rules. | ||
|
||
[json-patch]: https://jsonpatch.com/ | ||
[qs]: https:/ljharb/qs | ||
[s3-events]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/EventNotifications.html | ||
[attribute-linking]: ATTRIBUTE_LINKING.md |
145 changes: 145 additions & 0 deletions
145
lib/workload/stateless/stacks/filemanager/docs/ATTRIBUTE_LINKING.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
> [!IMPORTANT] | ||
> This document is a discussion on potential designs for filemanager attribute linking. The current implementation | ||
> isn't exactly like this, although it will probably contain components of the design here. | ||
# FileManager Attribute Linking | ||
|
||
The filemanager needs to be able to store data from other microservices on `s3_object` records in order to perform some | ||
logic, e.g. querying on the stored data. Ideally the filemanager should only deal with object records, without having | ||
to know the domain of other microservices. This means that the filemanager needs a mechanism to store arbitrary | ||
data, and be told what data to store for a given record. | ||
|
||
## Storing attributes | ||
|
||
The first part of the problem, i.e. storing data on records is solved by using the `attributes` column on `s3_object`. | ||
The `attributes` can store arbitrary JSON data, which can be queried for using postgres JSON extensions. | ||
E.g. To fetch all objects where the attributes contain an 'attribute_id' looks like: | ||
|
||
```sql | ||
select * from s3_object where attributes @> '{ "attribute_id": "some_id" }'; | ||
``` | ||
|
||
From the API a nested style syntax can be used: | ||
|
||
``` | ||
/api/v1/s3_objects?attributes[attribute_id]=some_id | ||
``` | ||
|
||
Attributes can be used by other services to perform logic, e.g. the UI could fetch all objects where | ||
the `portal_run_id` equals `<some_id>`. | ||
|
||
## Knowing what attributes to store | ||
|
||
The filemanager needs to be told by other services what attributes to store. However, it only receives native S3 events | ||
as input which cannot be edited. | ||
|
||
A generalisable way to accomplish this is to allow the filemanager to accept a set of 'rules', which given | ||
an S3 event as input, determine the attribute. For example, a rule could be: | ||
|
||
* For all events where the bucket equals 'umccr-temp-dev', and the key starts with a prefix 'analysis_data/.../.../.../...' | ||
extract the 4th path segment and add the attribute: `{ "portal_run_id": "<4th path segment>" }`. | ||
|
||
### Rules engine | ||
|
||
The microservice which knows about the rule could tell the filemanager about it. The rules could be published on the | ||
event bus and use a JSON rules engine, similar to the way EventBridge rules are parsed. | ||
|
||
For example, the workflow manager could tell filemanager a rule about matching buckets/keys using the following event message: | ||
|
||
```json | ||
{ | ||
"detail-type": [ | ||
"FileManagerAttributeRule" | ||
], | ||
"source": [ | ||
"orcabus.workflowmanager" | ||
], | ||
"detail": { | ||
"rule": { | ||
"bucket": "umccr-temp-dev", | ||
"key": "some_prefix/(*)/*" | ||
}, | ||
"apply": { | ||
"some_attribute_id": "<first_wildcard_capture_group>" | ||
}, | ||
"start_from": "<apply_to_events_after_this_date>", | ||
"expires": "<date_where_rule_no_longer_applies>" | ||
} | ||
} | ||
``` | ||
|
||
This rule would match all S3 events that have 'umccr-temp-dev' as the bucket, and keys with 'some_prefix' containing a | ||
regex capture group. The rule only applies to events received between `starts_from` and `expires`. | ||
|
||
Filemanager would store this rule, and check existing rules for each S3 event to see if it needs to add | ||
attributes. If the rule is received by filemanager after an event has already fired, that's okay, the filemanager can | ||
apply the rule retroactively to its database records. | ||
|
||
The advantage of this approach is that it is quite general, and it means that the filemanager doesn't need to know any | ||
details about other microservices' logic/domains. Rules also don't need to be emitted by services to be used. For example, | ||
statically derived attributes that only need information from the S3 event could be initialized into the filemanager | ||
database as it's deployed. | ||
|
||
Rules could be updated if required. There could also be different operators that merge attributes in different ways, | ||
e.g. 'append attribute', 'append if not exists', 'update attribute', 'overwrite attribute', etc. | ||
|
||
A potential disadvantage is that the rules engine may not be flexible enough to accommodate all attribute requirements. | ||
E.g. it's not possible to execute arbitrary code to compute the attribute. | ||
|
||
### Technical challenges | ||
|
||
It doesn't seem like there are many implementations of JSON rules engines. In Rust there is [json-rules-engine-rs], | ||
which seems to be based on the javascript [json-rules-engine]. Notably, there is | ||
[Event Ruler][aws-event-ruler] which is a Java library and what AWS EventBridge rules uses. Calling a Java library | ||
from Rust would require some FFI bindings. | ||
|
||
An existing library doesn't have to be used. Since the format of S3 events is known in advance, | ||
a simpler approach would probably involve implementing the rules manually in filemanager, leveraging something like [serde_json]. | ||
|
||
[json-rules-engine-rs]: https:/GopherJ/json-rules-engine-rs | ||
[json-rules-engine]: https:/CacheControl/json-rules-engine | ||
[aws-event-ruler]: https:/aws/event-ruler | ||
[serde_json]: https:/serde-rs/json | ||
|
||
### Architecture | ||
|
||
The architecture of this approach could look something like this, where each service emits rules for the filemanager to | ||
consume: | ||
|
||
![filemanager_attribute_linking](./filemanager_attributes.drawio.svg) | ||
|
||
Here the filemanager stores rules in its database and processes them directly. | ||
|
||
Alternatively, the linking logic could be a separate microservice (FileManagerAttributeManager? ThePreFileManagerManager?): | ||
|
||
![filemanager_attribute_linking_service](./filemanager_attributes_alt.drawio.svg) | ||
|
||
Here the filemanager ingests events that contain additional attributes from another SQS queue, and the | ||
attribute manager consumes events from the S3 queue. In order to update existing records, the filemanager could | ||
accept a POST request to update a set of records that the attribute manager knows about. | ||
|
||
An advantage of this approach is that it can use different languages, which would be useful if using rules libraries like | ||
Event Ruler. | ||
|
||
The disadvantage is that it adds more complexity, and more latency in the S3 event processing, because now | ||
the filemanager is no longer directly consuming S3 events. | ||
|
||
## Alternatives | ||
|
||
Instead of microservices pushing rules into the event bus, the filemanager could query the microservices to decide what | ||
to do with the events. However, this adds many API calls if the filemanager has to query on a per-event basis. | ||
|
||
Instead of reading/parsing rules in JSON, there could be a filemanager extension/plugin system which runs on each S3 | ||
event to determine attributes. This could be separate from the filemanager code, and would work well for | ||
statically derived attributes. However, it may also introduce many API calls if the filemanager has to query other microservices | ||
on a per-event basis. | ||
|
||
A combination of these approaches is also possible, where there is some rule matching, and an extension which can | ||
query other microservices/perform complex logic on the matched events only. | ||
|
||
## Questions | ||
|
||
1. Is a rule-based regex-style approach enough to cover all use-cases for generating attributes, or does more complicated | ||
logic need to happen? | ||
2. Are expiry/start dates for rules flexible enough to deal with changes in the rules over time? | ||
|
4 changes: 2 additions & 2 deletions
4
lib/workload/stateless/stacks/filemanager/filemanager-api-server/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,10 @@ | ||
# filemanager-api-server | ||
|
||
An instance of the filemanager api which can be launched as a webserver. The default address which the webserver uses | ||
is `localhost:8080`. Set the `FILEMANAGER_API_SERVER_ADDR` environment variable to change this. To run the local server: | ||
is `0.0.0.0:8000`. Set the `FILEMANAGER_API_SERVER_ADDR` environment variable to change this. To run the local server: | ||
|
||
```sh | ||
make api | ||
``` | ||
|
||
Then, checkout the OpenAPI docs at: `http://localhost:8080/swagger_ui`. | ||
Then, checkout the OpenAPI docs at: `http://localhost:8000/swagger_ui`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.