Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curated list of models #716

Closed
juliohm opened this issue Nov 29, 2020 · 8 comments
Closed

Curated list of models #716

juliohm opened this issue Nov 29, 2020 · 8 comments

Comments

@juliohm
Copy link
Contributor

juliohm commented Nov 29, 2020

I am opening this issue to discuss the possibility of a curated list of models.

Right now end-users are forced to rely on a non-trivial macro @load that fails depending on the scope (local vs. global) and can be considered advanced for newcomers.

My opinion is that a curated list should be the recommended workflow where users don't need to bother installing dependencies manually:

using MLJ

# well-tested models available
m1 = DecisionTreeClassifier()
m2 = KNeighborsClassifier()
...

This curated list could be made a dependency of the umbrella package. I don't think users would complain about too many dependencies given that any modern ML pipeline nowadays runs dozens of models at least.

cc: @DilumAluthge

@DilumAluthge
Copy link
Member

DilumAluthge commented Nov 29, 2020

This curated list could be made a dependency of the umbrella package. I don't think users would complain about too many dependencies given that any modern ML pipeline nowadays runs dozens of models at least.

Can you clarify what the "umbrella package" is?

If the "umbrella package" is MLJ.jl, then I would definitely complain. I don't want ] add MLJ to install the entire kitchen sink.

What's wrong with asking users to install MLJCuratedModels.jl if they want the curated list?

@DilumAluthge
Copy link
Member

If the "umbrella package" is MLJ.jl, then I would definitely complain. I don't want ] add MLJ to install the entire kitchen sink.

For example, the ensemble functionality lives inside MLJ.jl. I would be quite annoyed if I had to install a whole bunch of unrelated packages just so I could use MLJ's ensemble functionality.

@DilumAluthge
Copy link
Member

Now, on the other hand, if we first moved ALL of the functionality out of MLJ.jl into other repos, then I would have no problem adding a whole bunch of dependencies to MLJ.jl.

But as long as there is functionality in MLJ.jl that is not available in another package (MLJBase.jl, etc.), then I am opposed to adding lots of dependencies to MLJ.jl.

@DilumAluthge
Copy link
Member

So I guess the two options are:

  1. Keep MLJ.jl the way it is, and put the curated list in a separate MLJCuratedModels.jl package.
  2. Move ALL of the actual features/functionality out of MLJ.jl into separate packages. Once this process is done, we can add MLJCuratedModels.jl as a dependency of MLJ.jl.

@ablaom
Copy link
Member

ablaom commented Nov 30, 2020

For the record, MLJ is not intended to load any code, but still has the ensemble.jl stuff. The plan has always been to remove this. Maybe there are few other small things too, I forget.

Also, @load has been recently improved to eliminate some possible strange behaviour. And - after JuliaAI/MLJModels.jl#244 is complete (almost there!) - @load should work from within packages for any model (only KNN models still use Requires.jl).

I very much like @DilumAluthge 's proposal JuliaAI/MLJModels.jl#346 to address the beginner's problem.

@juliohm What do you think?

@ablaom
Copy link
Member

ablaom commented Nov 30, 2020

Also, if you want to directly load a model (no macros) you can do load_path to find out the location:

julia> load_path("PCA")
"MLJMultivariateStatsInterface.PCA"

julia> load_path("RandomForestRegressor")
ERROR: ArgumentError: Ambiguous model name. Use pkg=... .
The model RandomForestRegressor is provided by these packages:
 ["DecisionTree", "ScikitLearn"].

Stacktrace:
 [1] info(::String; pkg::Nothing) at /Users/anthony/.julia/packages/MLJModels/GyILf/src/model_search.jl:80
 [2] load_path(::String; pkg::Nothing) at /Users/anthony/.julia/packages/MLJModels/GyILf/src/loading.jl:32
 [3] load_path(::String) at /Users/anthony/.julia/packages/MLJModels/GyILf/src/loading.jl:32
 [4] top-level scope at REPL[16]:1

julia> load_path("RandomForestRegressor", pkg="ScikitLearn")
"MLJScikitLearnInterface.RandomForestRegressor"

julia> using MLJScikitLearnInterface

julia> import MLJScikitLearnInterface.RandomForestRegressor

julia> RandomForestRegressor()
RandomForestRegressor(
    n_estimators = 100,
    criterion = "mse",
    max_depth = nothing,
    min_samples_split = 2,
    min_samples_leaf = 1,
    min_weight_fraction_leaf = 0.0,
    max_features = "auto",
    max_leaf_nodes = nothing,
    min_impurity_decrease = 0.0,
    bootstrap = true,
    oob_score = false,
    n_jobs = nothing,
    random_state = nothing,
    verbose = 0,
    warm_start = false,
    ccp_alpha = 0.0,
    max_samples = nothing) @245

@juliohm
Copy link
Contributor Author

juliohm commented Nov 30, 2020

I think my concern is twofold: (1) we still need manual intervention to get a new model into an existing session. This could be addressed with a prompt installation option yes/no triggered by @load whenever a package is missing and the user could just press ENTER. (2) We have too many implementations of the same model and the user doesn't know which one to use. This could be solved with a curated list of "best" well-maintained, pure Julia implementations. For example, DecisionTree.jl is quite mature now and it doesn't make much sense to load sklearn trees or other tree implementations from other languages. I guess we can find similar examples where a single best implementation in pure Julia could be promoted to new Julia users. Keep in mind that a beginner user just wants to load a decision tree, no matter where it comes from, no matter the internal implementation details. He just wants something well-tested that works.

@juliohm
Copy link
Contributor Author

juliohm commented Dec 13, 2020

For the record, MLJ is not intended to load any code, but still has the ensemble.jl stuff. The plan has always been to remove this. Maybe there are few other small things too, I forget.

I fully support this idea. MLJ.jl would therefore provide a more user-friendly installation for users who are not writing packages, but actually writing ML pipelines for solving their problems with various models from a curated list. Advanced users seeking a more lightweight dependency to add to their own packages could be using a subpackage of the MLJ.jl stack like MLJBase.jl and MLJModelInterface.jl, and possibly a MLJEnsemble.jl.

In summary, one must always keep in mind two types of users:

  1. Users who want to write ML pipelines with well-tested and readily available models, who don't care about a long list of dependencies in their final application or Pluto notebook.
  2. Package writers who want to interface with the MLJ stack and use a subset of the functionality encountered in subpackages like MLJBase.jl, but cannot afford a dependency on model packages like DecisionTree.jl

@juliohm juliohm closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants