Merge pull request #457 from umccr/feat/filemanager-api-matching

feat: filemanager API wildcards
umccr · Aug 9, 2024 · 71ac639 · 71ac639
2 parents 4cced9a + bce0135
commit 71ac639
Show file tree

Hide file tree

Showing 14 changed files with 1,556 additions and 211 deletions.
diff --git a/lib/workload/stateless/stacks/filemanager/README.md b/lib/workload/stateless/stacks/filemanager/README.md
@@ -86,7 +86,7 @@ Alternatively, just `brew install dbeaver-community` to easily browse the databa
 
 ## Local API server
 
-To use the local API server, run:
+For more details on the filemanager API, see the [`API_GUIDE.md`][api-guide]. To use the local API server, run:
 
 ```sh
 make api
@@ -125,7 +125,7 @@ docker system prune -a --volumes
 ## Architecture
 
 The filemanager ingest functionality operates to ensure eventual consistency in the database records. See the 
-[ARCHITECTURE.md][architecture] for more details.
+[`ARCHITECTURE.md`][architecture] for more details.
 
 ## Project Layout
 
@@ -142,6 +142,7 @@ The project is divided into multiple crates that serve different functionality.
 * [database]: Database migration files and queries.
 
 [architecture]: docs/ARCHITECTURE.md
+[api-guide]: docs/API_GUIDE.md
 [filemanager]: filemanager
 [filemanager-api-lambda]: filemanager-api-lambda
 [filemanager-api-server]: filemanager-api-server

diff --git a/lib/workload/stateless/stacks/filemanager/docs/API_GUIDE.md b/lib/workload/stateless/stacks/filemanager/docs/API_GUIDE.md
@@ -0,0 +1,169 @@
+# Filemanager API
+
+The filemanager API gives access to S3 object records for all [S3 file events][s3-events] which are recorded in the database.
+
+To start a local API server and view the OpenAPI documentation, run the following:
+
+```sh
+make api
+```
+
+This serves Swagger OpenAPI docs at `http://localhost:8000/swagger_ui` when using default settings.
+
+The deployed instance of the filemanager API can be reached using the desired stage at `https://file.<stage>.umccr.org`
+using the orcabus API token. To retrieve the token, run:
+
+```sh
+export TOKEN=$(aws secretsmanager get-secret-value --secret-id orcabus/token-service-jwt --output json --query SecretString | jq -r 'fromjson | .id_token')
+```
+
+## Querying records
+
+The API is designed to have a standard set of REST routes which can be used to query for records. The API is version with a
+`/api/v1` route prefix, and S3 object records can be reached under `/api/v1/s3_objects`.
+
+For example, to query a single record, use the `s3_object_id` in the path, which returns the JSON record:
+
+```sh
+curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects/0190465f-68fa-76e4-9c36-12bdf1a1571d" | jq
+```
+
+Multiple records can be reached using the same route, which returns an array of JSON records:
+
+```sh
+curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects" | jq
+```
+
+This route is paginated, and by default returns 1000 records from the first page in a JSON list response:
+
+```json
+{
+ "next_page": 1,
+ "results": [
+ "<first 1000 s3_object records>..."
+ ]
+}
+```
+
+Use the `page` and `page_size` query parameters to control the pagination:
+
+```sh
+curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects?page=10&page_size=50" | jq
+```
+
+The records can be filtered using the same fields from the record by naming the field in a query parameter.
+For example, query all records for a certain bucket:
+
+```sh
+curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects?bucket=umccr-temp-dev" | jq
+```
+
+Since the filemanager database keeps a copy of all S3 events that it receives, old records for deleted objects
+are also kept in the database. In order to retrieve only current objects, that is, objects that are still in S3 and
+don't have an associated `Deleted` event, use the `current_state` query parameter:
+
+```sh
+curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects?current_state=true" | jq
+```
+
+### Attributes
+
+The filemanager has the ability to save JSON attributes on any records. Attributes can be used to query similar to
+filtering on record fields. The syntax for attribute querying uses square brackets to access nested JSON fields, similar
+to the syntax defined by the [qs] npm package. Brackets should be percent-encoded in URLs.
+
+For example, query for a previously set `portal_run_id`:
+
+```sh
+curl --get -H "Authorization: Bearer $TOKEN" --data-urlencode "attributes[portal_run_id]=202405212aecb782" \
+"https://file.dev.umccr.org/api/v1/s3_objects" | jq
+```
+
+> [!NOTE] 
+> Attributes on filemanager records start empty. They need to be added to the record to query on them later.
+> See [updating records](#updating-records)
+
+### Wilcard matching
+
+The API supports using wildcards to match multiple characters in a value for most field. Use `%` to match multiple characters
+and `_` to match one character. These queries get converted to postgres `like` queries under the hood. For example, query
+on a key prefix:
+
+```sh
+curl --get -H "Authorization: Bearer $TOKEN" --data-urlencode "key=temp\_data%" \
+"https://file.dev.umccr.org/api/v1/s3_objects" | jq
+```
+
+Case-insensitive wildcard matching, which gets converted to a postgres `ilike` statement, is supported by using `case_sensitive`:
+
+```sh
+curl --get -H "Authorization: Bearer $TOKEN" --data-urlencode "key=temp\_data%" \
+"https://file.dev.umccr.org/api/v1/s3_objects?case_sensitive=false" | jq
+```
+
+Wildcard matching is also supported on attributes:
+
+```sh
+curl --get -H "Authorization: Bearer $TOKEN" --data-urlencode "attributes[portal_run_id]=20240521%" \
+"https://file.dev.umccr.org/api/v1/s3_objects" | jq
+```
+
+## Updating records
+
+As part of allowing filemanager to link and query on attributes, attributes can be updated using PATCH requests.
+Each of the above endpoints and queries supports a PATCH request which can be used to update attributes on a set
+of records, instead of listing records. All query parameters except pagination are supported for updates.
+Attributes are update using [JSON patch][json-patch].
+
+For example, update attributes on a single record:
+
+```sh
+curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
+--data '{ "attributes": [ { "op": "add", "path": "/portal_run_id", "value": "202405212aecb782" } ] }' \
+"https://file.dev.umccr.org/api/v1/s3_objects/0190465f-68fa-76e4-9c36-12bdf1a1571d" | jq
+```
+
+Or, update attributes for multiple records with the same key prefix:
+
+```sh
+curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
+--data '{ "attributes": [ { "op": "add", "path": "/portal_run_id", "value": "202405212aecb782" } ] }' \
+"https://file.dev.umccr.org/api/v1/s3_objects?key=%25202405212aecb782%25" | jq
+```
+
+## Count objects
+
+There is an API route which counts the total number of records in the database, which supports
+similar query parameters as the regular list operations.
+
+For example, count the total records:
+
+```sh
+curl -H "Authorization: Bearer $TOKEN" "https://file.dev.umccr.org/api/v1/s3_objects/count" | jq
+```
+
+## The `objects` record
+
+There is a similar record kept in the filemanager database called `object`. A similar REST API
+is available for these records under `/api/v1/objects`, however the `object` currently don't server
+a purpose in filemanager. They were initially included to support attribute linking, however they will likely
+be removed because attribute linking can be accomplished using the `attributes` column on `s3_object`.
+
+## Some missing features
+
+There are some missing features in the query API which are planned, namely:
+
+* There is no way to compare values with `>`, `>=`, `<`, `<=`.
+* There is no way to express `and` or `or` conditions in the API.
+
+There are also some feature missing for attribute linking. For example, there is no way
+to capture matching wildcard groups which can later be used in the JSON patch body.
+
+There is also no way to POST an attribute linking rule, which can be used to update S3 records
+as they are received by filemanager. See [ATTRIBUTE_LINKING.md][attribute-linking] for a discussion on some approaches
+for this. The likely solution will involve merging the above wildcard matching logic with attribute rules.
+
+[json-patch]: https://jsonpatch.com/
+[qs]: https:/ljharb/qs
+[s3-events]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/EventNotifications.html
+[attribute-linking]: ATTRIBUTE_LINKING.md
diff --git a/lib/workload/stateless/stacks/filemanager/docs/ATTRIBUTE_LINKING.md b/lib/workload/stateless/stacks/filemanager/docs/ATTRIBUTE_LINKING.md
@@ -0,0 +1,145 @@
+> [!IMPORTANT] 
+> This document is a discussion on potential designs for filemanager attribute linking. The current implementation
+> isn't exactly like this, although it will probably contain components of the design here.
+
+# FileManager Attribute Linking
+
+The filemanager needs to be able to store data from other microservices on `s3_object` records in order to perform some
+logic, e.g. querying on the stored data. Ideally the filemanager should only deal with object records, without having
+to know the domain of other microservices. This means that the filemanager needs a mechanism to store arbitrary
+data, and be told what data to store for a given record.
+
+## Storing attributes
+
+The first part of the problem, i.e. storing data on records is solved by using the `attributes` column on `s3_object`.
+The `attributes` can store arbitrary JSON data, which can be queried for using postgres JSON extensions.
+E.g. To fetch all objects where the attributes contain an 'attribute_id' looks like:
+
+```sql
+select * from s3_object where attributes @> '{ "attribute_id": "some_id" }';
+```
+
+From the API a nested style syntax can be used:
+
+```
+/api/v1/s3_objects?attributes[attribute_id]=some_id
+```
+
+Attributes can be used by other services to perform logic, e.g. the UI could fetch all objects where
+the `portal_run_id` equals `<some_id>`.
+
+## Knowing what attributes to store
+
+The filemanager needs to be told by other services what attributes to store. However, it only receives native S3 events
+as input which cannot be edited.
+
+A generalisable way to accomplish this is to allow the filemanager to accept a set of 'rules', which given
+an S3 event as input, determine the attribute. For example, a rule could be:
+
+* For all events where the bucket equals 'umccr-temp-dev', and the key starts with a prefix 'analysis_data/.../.../.../...'
+extract the 4th path segment and add the attribute: `{ "portal_run_id": "<4th path segment>" }`.
+
+### Rules engine
+
+The microservice which knows about the rule could tell the filemanager about it. The rules could be published on the
+event bus and use a JSON rules engine, similar to the way EventBridge rules are parsed.
+
+For example, the workflow manager could tell filemanager a rule about matching buckets/keys using the following event message:
+
+```json
+{
+ "detail-type": [
+ "FileManagerAttributeRule"
+ ],
+ "source": [
+ "orcabus.workflowmanager"
+ ],
+ "detail": {
+ "rule": {
+ "bucket": "umccr-temp-dev",
+ "key": "some_prefix/(*)/*"
+ },
+ "apply": {
+ "some_attribute_id": "<first_wildcard_capture_group>"
+ },
+ "start_from": "<apply_to_events_after_this_date>",
+ "expires": "<date_where_rule_no_longer_applies>"
+ }
+}
+```
+
+This rule would match all S3 events that have 'umccr-temp-dev' as the bucket, and keys with 'some_prefix' containing a 
+regex capture group. The rule only applies to events received between `starts_from` and `expires`.
+
+Filemanager would store this rule, and check existing rules for each S3 event to see if it needs to add
+attributes. If the rule is received by filemanager after an event has already fired, that's okay, the filemanager can
+apply the rule retroactively to its database records.
+
+The advantage of this approach is that it is quite general, and it means that the filemanager doesn't need to know any
+details about other microservices' logic/domains. Rules also don't need to be emitted by services to be used. For example,
+statically derived attributes that only need information from the S3 event could be initialized into the filemanager
+database as it's deployed.
+
+Rules could be updated if required. There could also be different operators that merge attributes in different ways,
+e.g. 'append attribute', 'append if not exists', 'update attribute', 'overwrite attribute', etc.
+
+A potential disadvantage is that the rules engine may not be flexible enough to accommodate all attribute requirements.
+E.g. it's not possible to execute arbitrary code to compute the attribute.
+
+### Technical challenges
+
+It doesn't seem like there are many implementations of JSON rules engines. In Rust there is [json-rules-engine-rs],
+which seems to be based on the javascript [json-rules-engine]. Notably, there is 
+[Event Ruler][aws-event-ruler] which is a Java library and what AWS EventBridge rules uses. Calling a Java library
+from Rust would require some FFI bindings.
+
+An existing library doesn't have to be used. Since the format of S3 events is known in advance, 
+a simpler approach would probably involve implementing the rules manually in filemanager, leveraging something like [serde_json].
+
+[json-rules-engine-rs]: https:/GopherJ/json-rules-engine-rs
+[json-rules-engine]: https:/CacheControl/json-rules-engine
+[aws-event-ruler]: https:/aws/event-ruler
+[serde_json]: https:/serde-rs/json
+
+### Architecture
+
+The architecture of this approach could look something like this, where each service emits rules for the filemanager to
+consume:
+
+![filemanager_attribute_linking](./filemanager_attributes.drawio.svg)
+
+Here the filemanager stores rules in its database and processes them directly.
+
+Alternatively, the linking logic could be a separate microservice (FileManagerAttributeManager? ThePreFileManagerManager?):
+
+![filemanager_attribute_linking_service](./filemanager_attributes_alt.drawio.svg)
+
+Here the filemanager ingests events that contain additional attributes from another SQS queue, and the
+attribute manager consumes events from the S3 queue. In order to update existing records, the filemanager could
+accept a POST request to update a set of records that the attribute manager knows about.
+
+An advantage of this approach is that it can use different languages, which would be useful if using rules libraries like
+Event Ruler.
+
+The disadvantage is that it adds more complexity, and more latency in the S3 event processing, because now
+the filemanager is no longer directly consuming S3 events.
+
+## Alternatives
+
+Instead of microservices pushing rules into the event bus, the filemanager could query the microservices to decide what
+to do with the events. However, this adds many API calls if the filemanager has to query on a per-event basis.
+
+Instead of reading/parsing rules in JSON, there could be a filemanager extension/plugin system which runs on each S3
+event to determine attributes. This could be separate from the filemanager code, and would work well for
+statically derived attributes. However, it may also introduce many API calls if the filemanager has to query other microservices
+on a per-event basis.
+
+A combination of these approaches is also possible, where there is some rule matching, and an extension which can 
+query other microservices/perform complex logic on the matched events only.
+
+## Questions
+
+1. Is a rule-based regex-style approach enough to cover all use-cases for generating attributes, or does more complicated
+ logic need to happen?
+2. Are expiry/start dates for rules flexible enough to deal with changes in the rules over time?
+
diff --git a/lib/workload/stateless/stacks/filemanager/filemanager-api-server/README.md b/lib/workload/stateless/stacks/filemanager/filemanager-api-server/README.md
@@ -1,10 +1,10 @@
 # filemanager-api-server
 
 An instance of the filemanager api which can be launched as a webserver. The default address which the webserver uses
-is `localhost:8080`. Set the `FILEMANAGER_API_SERVER_ADDR` environment variable to change this. To run the local server:
+is `0.0.0.0:8000`. Set the `FILEMANAGER_API_SERVER_ADDR` environment variable to change this. To run the local server:
 
 ```sh
 make api
 ```
 
-Then, checkout the OpenAPI docs at: `http://localhost:8080/swagger_ui`.
+Then, checkout the OpenAPI docs at: `http://localhost:8000/swagger_ui`.
diff --git a/lib/workload/stateless/stacks/filemanager/filemanager/src/error.rs b/lib/workload/stateless/stacks/filemanager/filemanager/src/error.rs
@@ -37,7 +37,7 @@ pub enum Error {
  QueryError(String),
  #[error("invalid input: `{0}`")]
  InvalidQuery(String),
- #[error("expected some value for id: `{0}`")]
+ #[error("expected record for id: `{0}`")]
  ExpectedSomeValue(Uuid),
 }