Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPICE-0009: External Readers #10

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

@HT154 HT154 marked this pull request as ready for review August 7, 2024 01:26
Copy link
Contributor

@bioball bioball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high level thoughts:

Using a persistent process with message passing isn't a bad idea--for processes with a higher startup cost (e.g. JVM-based processes), you pay that penalty once for the life of the evaluator, rather than every time a read() happens. However, there's an argument to be made for having the external reader binary be short-lived. For example, this makes it possible to write an external reader that's simply a bash script. And making these things bash-script-able is quite nice:

#!/usr/bin/env bash

set -e

URI="$1"

if [[ "$URI" != ldap:* ]]; then
  >&2 echo "Unexpected URI: $URI. Expected ldap scheme."
  exit 1
fi

ldapsearch -D "${URI:5}"

If it's a persistent process, it should probably be spawned the first time a module or resource is read, then kept alive while the evaluator session is alive (e.g. until it is closed).

I don't know if we need a DiscoverReadersRequest and DiscoverReadersResponse? Wouldn't the evaluator already know about the external readers it needs?

spices/SPICE-0009-external-readers.adoc Outdated Show resolved Hide resolved
spices/SPICE-0009-external-readers.adoc Outdated Show resolved Hide resolved
spices/SPICE-0009-external-readers.adoc Show resolved Hide resolved
Copy link

@stackoverflow stackoverflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing an example that shows the whole process. Your example just shows a Pkl file using the reader and a cli call, but it's not clear to me how these things glue together.

Also think having the processes being long-running would put too much hurdles for people who want to contribute a reader, so let's keep that as a future improvement.

For the package support:
That seems to complicated the design quite a bit. Now Pkl has to understand how to properly process packages-as-readers. If my reader is a Ruby script and I want to deploy it as a package, do I have to bundle the script together? Gets even more complicated with executables and multiplatform support. Or maybe I misunderstood you and package readers have to be Pkl-only.
I'm not sure how useful writing readers in Pkl instead of general languages will be. Though perhaps I'm wrong. My point is, maybe pkl packages are not the best distribution mechanism for readers, even though I can't think of a good solution for this.

spices/SPICE-0009-external-readers.adoc Outdated Show resolved Hide resolved
spices/SPICE-0009-external-readers.adoc Outdated Show resolved Hide resolved
spices/SPICE-0009-external-readers.adoc Outdated Show resolved Hide resolved
@HT154
Copy link
Author

HT154 commented Aug 8, 2024

Thanks for the review @bioball!

Using a persistent process with message passing isn't a bad idea--for processes with a higher startup cost (e.g. JVM-based processes), you pay that penalty once for the life of the evaluator, rather than every time a read() happens. However, there's an argument to be made for having the external reader binary be short-lived. For example, this makes it possible to write an external reader that's simply a bash script.

I definitely agree that there's utility in enabling the super simple scripting route! My primary motivation for a persistent subprocess is, as you've identified, startup cost.

Here's a real world example: I would like to be able to use an external reader to mediate access to a secret store in a large monorepo. I'd estimate that there are 80+ unique resource URIs that would be read during a "world eval". The process that would implement the external reader currently takes ~0.25 seconds to invoke from the command line (in the happy path), which would add ~20 seconds to the overall evaluation time. Most of that quarter second is process and initialization overhead and the actual read takes ~0.05 seconds, so if the reader process were persistent it would save ~15 seconds from the total evaluation time.

If persistent reader processes are implemented, it also becomes fairly trivial to implement one-shot readers. Spitballing, it might look something like this:

pkl eval <module.pkl> --external-reader 'pkl-cmd ldap=path/to/pkl-ldap.sh ldaps=path/to/pkl-ldap.sh' --allowed-resources ...

Where the pkl-cmd executable implements the external reader message passing API and registers a ResourceReader for each specified scheme->script mapping. I'd be interested in shipping pkl-cmd as part of the Pkl project.

If it's a persistent process, it should probably be spawned the first time a module or resource is read, then kept alive while the evaluator session is alive (e.g. until it is closed).

I don't know if we need a DiscoverReadersRequest and DiscoverReadersResponse? Wouldn't the evaluator already know about the external readers it needs?

The ergonomic motivation for spawning a persistent process on startup and having the readers be discovered at runtime would be to decouple which reader binaries are used from which schemes they provide. If one-shot reader processes are used, the scheme->binary registration is unavoidable, but persistent readers may implement many schemes. Having to perform the same mapping for persistent processes would be very verbose (since setting allowed modules is also required), but would be able to spawn the process on first use instead of during evaluator initialization. Another consideration here is user error: if a user registers a reader for a scheme that it does not provide, how should pkl respond?

@HT154
Copy link
Author

HT154 commented Aug 8, 2024

Thanks for the review @stackoverflow!

I'm missing an example that shows the whole process. Your example just shows a Pkl file using the reader and a cli call, but it's not clear to me how these things glue together.

I think I've addressed this in other comments, but please let me know if you still have questions.

Also think having the processes being long-running would put too much hurdles for people who want to contribute a reader, so let's keep that as a future improvement.

My main concerns with only targeting "one-shot" reader processes at this time are:

  • It's not clear how to support globbing (i.e. ListResourcesRequest/ListModulesRequest)
  • It may make "persistent" readers more complicated to implement in the future from a configuration/CLI perspective
  • Performance, as outlined above

Above I proposed a pkl-cmd persistent external reader that could function as an "adapter" of sorts to simpler one-shot reader processes. I definitely agree that the simplicity of writing a simple script should be a goal here, and I don't think the more involved persistent reader design rules that out.

For the package support: That seems to complicated the design quite a bit. Now Pkl has to understand how to properly process packages-as-readers.

I may be misinterpreting this concern, but I'm not sure this is so complicated. Reader executables would need to be supplied as package file URIs including the fragment #/path/to/binary.

If my reader is a Ruby script and I want to deploy it as a package, do I have to bundle the script together?

My thought here is to primarily target languages for which (message-passing) bindings already exist. Both Golang and Swift (at least on linux) can produce statically linked binaries that can be packaged dependency-free. If there are bindings for interpreted languages in the future, distribution would likely need to be via the language package manager, which this design does not rule out.

Gets even more complicated with executables and multiplatform support.

Definitely agree, but I think pkl:platform makes this reasonable. I show what this might look like in a PklProject here: #10 (comment)

@stackoverflow
Copy link

I think I've addressed this in other comments, but please let me know if you still have questions.

You addressed it in the comments, but someone reading the SPICE shouldn't have to read comments to understand it, so it would be nice if that was clear in the SPICE itself.

@stackoverflow
Copy link

About having readers being persistent:

I think you are approaching this from the wrong side. For Pkl, the reader should just be a process it spawns once it finds a read. If you want it to be persistent (which I consider an edge case) you can run a persistent background process and write a simple script for Pkl to call which would pipe the reads.
This would simplify your proposal and there wouldn't be a need for a pkl-cmd.

Also it's not clear to me how errors and the communication happens. I imagine we can keep the unix tradition of using return codes so 0 would mean success and other numbers would be errors. stderr could contain the error message in case things went wrong and stdout would contain the actual result of the read. But anyway, all of this should be in the SPICE.

@bioball
Copy link
Contributor

bioball commented Aug 9, 2024

Josh actually brings up a good point--the process of resolving a read can involve multiple calls to a reader, which can make a short-lived execution quite complicated. For example, resolving a glob star import (e.g. import*("foo:/**.pkl") is linear to the number of directories needed to read.

Additionally, Pkl wants to know three facts about each scheme:

  • Is it globbable?
  • Is it hierarchical?
  • Is it local?

It's quite a rough experience to provide all this to Pkl as CLI args: this is quite long: --external-reader ldap=pkl-ldap --globbable-reader=pkl-ldap --hierarchical-reader=pkl-ldap --local-reader=pkl-ldap.

I echo Islon's concerns about distributing binaries using Pkl packages. I see the value here, but this isn't what packages are designed for. Besides, there are already many ways to distribute binaries; languages typically already offer their own mechanisms for installing executables (go install, npm isntall -g, pip install, etc), and so do OS-level tools (apt-get, rpm, winget, yum, brew). Also: a tarball isn't a sufficient way to distribute executables; some executables need to link to libraries (e.g. glibc), and others require a runtime (e.g. Node.js). It's better to delegate this task to something else.

Some additional thoughts:

  • Right now, the message passing API is held outside of pkl-core. We did this to reduce the number of dependencies (no msgpack), and to minimize the amount of core that gets shipped in our core library. This feature implies that much of our message passing logic (e.g. MessageTransport) gets moved to pkl-core. However, I think we want this anyway--we've talked about moving BinaryEvaluator into pkl-core, and providing a serializer/deserializer to our Java library users (and perhaps even in-language).
  • I don't know if we want shelling out to executables to become the norm for Pkl. This feature makes sense to bring feature parity to the CLI, but library users may not want Pkl to spawn executables. However, it still makes sense that library users should be able to configure Pkl to do this if the CLI can do this too.

@HT154
Copy link
Author

HT154 commented Aug 17, 2024

I've taken another pass on this. Here are the major changes:

  • Moved URI-based reader executables to future directions
  • Addressed exec-per-read and up-front scheme->reader registration under alternatives considered
  • Added significant detail to the detailed design section

spices/SPICE-0009-external-readers.adoc Outdated Show resolved Hide resolved
spices/SPICE-0009-external-readers.adoc Outdated Show resolved Hide resolved
spices/SPICE-0009-external-readers.adoc Outdated Show resolved Hide resolved
spices/SPICE-0009-external-readers.adoc Outdated Show resolved Hide resolved
spices/SPICE-0009-external-readers.adoc Outdated Show resolved Hide resolved
HT154 added a commit to HT154/pkl that referenced this pull request Sep 9, 2024
This is preparatory work for [SPICE-0009](apple/pkl-evolution#10).
It is being contributed in a separate pull request to ease review.

The Message, Message(En|De)coder, and MessageTransport types have been ported to Java and moved to a new `org.pkl.core.messaging` package.
HT154 added a commit to HT154/pkl that referenced this pull request Sep 9, 2024
This is preparatory work for [SPICE-0009](apple/pkl-evolution#10).
It is being contributed in a separate pull request to ease review.

The Message, Message(En|De)coder, and MessageTransport types have been ported to Java and moved to a new `org.pkl.core.messaging` package.
HT154 added a commit to HT154/pkl that referenced this pull request Sep 9, 2024
This is preparatory work for [SPICE-0009](apple/pkl-evolution#10).
It is being contributed in a separate pull request to ease review.

The Message, Message(En|De)coder, and MessageTransport types have been ported to Java and moved to a new `org.pkl.core.messaging` package.
HT154 added a commit to HT154/pkl that referenced this pull request Sep 9, 2024
This is preparatory work for [SPICE-0009](apple/pkl-evolution#10).
It is being contributed in a separate pull request to ease review.

The Message, Message(En|De)coder, and MessageTransport types have been ported to Java and moved to a new `org.pkl.core.messaging` package.
HT154 added a commit to HT154/pkl that referenced this pull request Sep 9, 2024
This is preparatory work for [SPICE-0009](apple/pkl-evolution#10).
It is being contributed in a separate pull request to ease review.

The Message, Message(En|De)coder, and MessageTransport types have been ported to Java and moved to a new `org.pkl.core.messaging` package.
HT154 added a commit to HT154/pkl that referenced this pull request Sep 9, 2024
This is preparatory work for [SPICE-0009](apple/pkl-evolution#10).
It is being contributed in a separate pull request to ease review.

The Message, Message(En|De)coder, and MessageTransport types have been ported to Java and moved to a new `org.pkl.core.messaging` package.
HT154 added a commit to HT154/pkl that referenced this pull request Sep 15, 2024
This is preparatory work for [SPICE-0009](apple/pkl-evolution#10).
It is being contributed in a separate pull request to ease review.

This is required to allow deferred launch of external reader processes without requiring up-front configuration of scheme->process mappings.
More details on why this is required here: apple/pkl-evolution#10 (comment)

One concern here (which applies before this change as well) is that the semantics of module key factories and resource readers is reversed.
For module key factories, the first factory that answers for a URI scheme "wins", while for resource readers, the _last_ reader that answers for a URI scheme "wins".
This PR preserves that behavior, but it may be worth reconsidering this design at this time.
This behavior can be observed in practice in `ResourceReadersEvaluatorTest.\`module path\`` where `ResourceReaders.modulePath` is added after the pre-configured `ResourceReaders.classPath` and takes precedence.
HT154 added a commit to HT154/pkl that referenced this pull request Sep 15, 2024
This is preparatory work for [SPICE-0009](apple/pkl-evolution#10).
It is being contributed in a separate pull request to ease review.

This is required to allow deferred launch of external reader processes without requiring up-front configuration of scheme->process mappings.
More details on why this is required here: apple/pkl-evolution#10 (comment)

One concern here (which applies before this change as well) is that the semantics of module key factories and resource reader factories is reversed.
For module key factories, the first factory that answers for a URI scheme "wins", while for resource reader factories, the _last_ reader factory that answers for a URI scheme "wins".
This PR preserves that behavior, but it may be worth reconsidering this design at this time.

This behavior can be observed in practice in <code>ResourceReadersEvaluatorTest.\`module path\`</code> where `ResourceReaders.modulePath` is added after the pre-configured `ResourceReaders.classPath` and takes precedence.

This is a fairly large breaking API change, but in practice most clients will only need to replace a few lines in their calls to `EvaluatorBuilder`, eg.
```java
.addResourceReader(ResourceReaders.environmentVariable())
// becomes
.addResourceReaderFactory(ResourceReaderFactories.environmentVariable())
```
HT154 added a commit to HT154/pkl that referenced this pull request Sep 16, 2024
This is preparatory work for [SPICE-0009](apple/pkl-evolution#10).
It is being contributed in a separate pull request to ease review.

This is required to allow deferred launch of external reader processes without requiring up-front configuration of scheme->process mappings.
More details on why this is required here: apple/pkl-evolution#10 (comment)

One concern here (which applies before this change as well) is that the semantics of module key factories and resource reader factories is reversed.
For module key factories, the first factory that answers for a URI scheme "wins", while for resource reader factories, the _last_ reader factory that answers for a URI scheme "wins".
This PR preserves that behavior, but it may be worth reconsidering this design at this time.

This behavior can be observed in practice in <code>ResourceReadersEvaluatorTest.\`module path\`</code> where `ResourceReaders.modulePath` is added after the pre-configured `ResourceReaders.classPath` and takes precedence.

This is a fairly large breaking API change, but in practice most clients will only need to replace a few lines in their calls to `EvaluatorBuilder`, eg.
```java
.addResourceReader(ResourceReaders.environmentVariable())
// becomes
.addResourceReaderFactory(ResourceReaderFactories.environmentVariable())
```
HT154 added a commit to HT154/pkl-go that referenced this pull request Sep 23, 2024
[SPICE-0009](apple/pkl-evolution#10)

* Add `EvaluatorOptions.ExternalModuleReaders` and `EvaluatorOptions.ExternalResourceReaders`.
* Add `ExternalReaderRuntime` to host the child process side of the external reader workflow.
HT154 added a commit to HT154/pkl-go that referenced this pull request Sep 23, 2024
[SPICE-0009](apple/pkl-evolution#10)

* Add `EvaluatorOptions.ExternalModuleReaders` and `EvaluatorOptions.ExternalResourceReaders`.
* Add `ExternalReaderRuntime` to host the child process side of the external reader workflow.
HT154 added a commit to HT154/pkl that referenced this pull request Sep 23, 2024
HT154 added a commit to HT154/pkl-go that referenced this pull request Sep 23, 2024
[SPICE-0009](apple/pkl-evolution#10)

* Add `EvaluatorOptions.ExternalModuleReaders` and `EvaluatorOptions.ExternalResourceReaders`.
* Add `ExternalReaderRuntime` to host the child process side of the external reader workflow.
HT154 added a commit to HT154/pkl-go that referenced this pull request Sep 24, 2024
[SPICE-0009](apple/pkl-evolution#10)

* Add `EvaluatorOptions.ExternalModuleReaders` and `EvaluatorOptions.ExternalResourceReaders`.
* Add `ExternalReaderRuntime` to host the child process side of the external reader workflow.
HT154 added a commit to HT154/pkl-go that referenced this pull request Sep 24, 2024
[SPICE-0009](apple/pkl-evolution#10)

* Add `EvaluatorOptions.ExternalModuleReaders` and `EvaluatorOptions.ExternalResourceReaders`.
* Add `ExternalReaderRuntime` to host the child process side of the external reader workflow.
HT154 added a commit to HT154/pkl-go that referenced this pull request Sep 24, 2024
[SPICE-0009](apple/pkl-evolution#10)

* Add `EvaluatorOptions.ExternalModuleReaders` and `EvaluatorOptions.ExternalResourceReaders`.
* Add `ExternalReaderRuntime` to host the child process side of the external reader workflow.
Comment on lines 169 to 180
externalModuleReaders: Mapping<String, ExternalReader>?

externalResourceReaders: Mapping<String, ExternalReader>?

class ExternalReader {
/// May be specified as an absolute path to an executable
/// May also be specified as just an executable name, in which case it will be resolved according to the PATH environment variable
executable: String

/// Command line arguments that will be passed to the reader process
arguments: Listing<String>
}
Copy link
Contributor

@bioball bioball Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think let's be consistent with clientModuleReaders and clientResourceReaders, and make each of these a List<ExternalReader>, where ExternalReader is:

class ExternalReader {
  executable: String

  arguments: Listing<String>

  scheme: String
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this just be for CreateEvaluatorRequest? Or would it apply to pkl:EvaluatorSettings as well? I've got some usability concerns about the latter (mostly around easy reuse of reader configs for differ schemes or between resources/modules).

For CreateEvaluatorRequest specifically: what should the precedence order be? Should the first reader with the listed scheme win or the last? Twist: this behavior is currently different between clientModuleReaders (last wins) and clientResourceReaders (first wins), similar to how it works with EvaluatorBuilder.

The arguments are passed as `=`-delimited key-value pairs where the key is the reader's URI scheme.
The argument values may be passed as space-separated strings where the first element becomes `executable` and any remainder becomes `arguments`.

TBD: It might be best if the argument value is link:https://docs.python.org/3/library/shlex.html#shlex.split[shlex'd] instead of split to support passing arguments to the reader process that contain spaces.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this look like?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say that an external reader is configured like so:

--external-resource myScheme='myExecutable arg1 "arg2 with spaces"'

A naive white space trim+split yields

executable = "myExecutable"
arguments {
  "arg1"
  "\"arg2"
  "with"
  "spaces\""
}

While a "shlex" over the input behaves much more similarly to what a user might expect and results in

arguments {
  "arg1"
  "arg2 with spaces"
}

So without the shlex operator for the CLI args it's actually not possible to express the same configs on the CLI as it is via PklProject.

requestId: Int

/// The scheme of the resource to discover the spec for
/// The scheme of the resource to initialize.
scheme: String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reader should already know what schemes it should read when it receives a Read*Request. Do we need scheme here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialize messages only request the spec for a single scheme at a time. It would be possible to do an "all at once" sort of thing like the original proposal, but this method is actually super convenient from an implementation standpoint as it enables using the battle-tested ConcurrentHashMap response memoization already widely employed by the message passing API. My initial pass at the implementation that did the "all at once" discovery needed some messy locking; it would be great to pass the buck to the standard library instead.

HT154 added a commit to HT154/pkl that referenced this pull request Sep 24, 2024
HT154 added a commit to HT154/pkl-go that referenced this pull request Sep 24, 2024
[SPICE-0009](apple/pkl-evolution#10)

* Add `EvaluatorOptions.ExternalModuleReaders` and `EvaluatorOptions.ExternalResourceReaders`.
* Add `ExternalReaderRuntime` to host the child process side of the external reader workflow.
HT154 added a commit to HT154/pkl-swift that referenced this pull request Sep 25, 2024
HT154 added a commit to HT154/pkl-swift that referenced this pull request Sep 25, 2024
HT154 added a commit to HT154/pkl-swift that referenced this pull request Sep 25, 2024
HT154 added a commit to HT154/pkl-swift that referenced this pull request Sep 25, 2024
HT154 added a commit to HT154/pkl that referenced this pull request Oct 3, 2024
This is preparatory work for [SPICE-0009](apple/pkl-evolution#10).
It is being contributed in a separate pull request to ease review.

The Message, Message(En|De)coder, and MessageTransport types have been ported to Java and moved to a new `org.pkl.core.messaging` package.
HT154 added a commit to HT154/pkl that referenced this pull request Oct 3, 2024
HT154 added a commit to HT154/pkl that referenced this pull request Oct 7, 2024
This is preparatory work for [SPICE-0009](apple/pkl-evolution#10).
It is being contributed in a separate pull request to ease review.

The Message, Message(En|De)coder, and MessageTransport types have been ported to Java and moved to a new `org.pkl.core.messaging` package.
HT154 added a commit to HT154/pkl that referenced this pull request Oct 7, 2024
HT154 added a commit to HT154/pkl-go that referenced this pull request Oct 10, 2024
[SPICE-0009](apple/pkl-evolution#10)

* Add `EvaluatorOptions.ExternalModuleReaders` and `EvaluatorOptions.ExternalResourceReaders`.
* Add `ExternalReaderRuntime` to host the child process side of the external reader workflow.
HT154 added a commit to HT154/pkl-swift that referenced this pull request Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants