feat: add augurs-clustering crate with DBSCAN algorithm #100

sd2k · 2024-07-19T13:27:28Z

This PR adds a new crate, augurs-clustering, which adds time series clustering functionality using the DBSCAN algorithm.

Summary by CodeRabbit

New Features
- Introduced a DBSCAN clustering algorithm with documentation and benchmarks.
- Added a new module for clustering, providing Python bindings for the DBSCAN algorithm and flexibility in input formats.
Documentation
- Updated README with information about the new augurs-clustering module and its functionality.
- Added CHANGELOG for tracking changes in the augurs-clustering crate.
Chores
- Simplified npm publishing process by removing unnecessary tasks related to the Grafana Labs registry.

github-actions · 2024-07-19T14:15:01Z

Bencher

Report	Wed, August 21, 2024 at 19:35:57 UTC
Project	augurs
Branch	100/merge
Testbed	ubuntu-latest

Click to view all benchmark results

Benchmark	Latency	Latency Results nanoseconds (ns) \| (Δ%)	Latency Upper Boundary nanoseconds (ns) \| (%)
auto_fit/air_passengers	✅ (view plot)	1,893,600.00 (-0.91%)	1,965,151.35 (96.36%)
dbscan	✅ (view plot)	1,659,300.00 (+17.75%)	1,796,559.48 (92.36%)
distance_euclidean/None	✅ (view plot)	202,120.00 (-0.01%)	202,704.63 (99.71%)
distance_euclidean/Some(10)	✅ (view plot)	15,607.00 (-2.05%)	17,127.70 (91.12%)
distance_euclidean/Some(2)	✅ (view plot)	3,587.10 (+2.41%)	3,588.48 (99.96%)
distance_euclidean/Some(20)	✅ (view plot)	31,401.00 (-0.39%)	31,951.78 (98.28%)
distance_euclidean/Some(5)	✅ (view plot)	7,783.00 (-0.57%)	7,920.91 (98.26%)
distance_euclidean/Some(50)	✅ (view plot)	75,451.00 (-0.20%)	76,535.21 (98.58%)
distance_matrix_euclidean/window: Some(10), parallelize: false	✅ (view plot)	2,994,900,000.00 (-0.21%)	3,016,233,475.48 (99.29%)
distance_matrix_euclidean/window: Some(10), parallelize: true	✅ (view plot)	2,994,900,000.00 (+31.90%)	3,760,216,959.23 (79.65%)
distance_matrix_euclidean/window: Some(2), parallelize: false	✅ (view plot)	537,280,000.00 (+0.30%)	543,438,068.04 (98.87%)
distance_matrix_euclidean/window: Some(2), parallelize: true	✅ (view plot)	536,900,000.00 (+27.25%)	651,343,075.99 (82.43%)
fit/air_passengers	✅ (view plot)	423,880.00 (-2.18%)	448,634.01 (94.48%)
forecast/air_passengers	✅ (view plot)	1,360.30 (-2.53%)	1,466.70 (92.75%)
season_eight	✅ (view plot)	22,023.00 (-0.48%)	22,858.72 (96.34%)
vic_elec	✅ (view plot)	39,214,000.00 (+0.62%)	39,905,454.04 (98.27%)

Bencher - Continuous Benchmarking
View Public Perf Page
Docs | Repo | Chat | Help

coderabbitai · 2024-07-19T19:58:16Z

Warning

Rate limit exceeded

@sd2k has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 18 minutes and 39 seconds before requesting another review.

How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Commits

Files that changed from the base of the PR and between e1cf67e and 6c4ffa3.

Walkthrough

This update enhances the project's functionality by introducing a new augurs-clustering crate, which implements the DBSCAN clustering algorithm for time series analysis. The changes include updates to documentation, benchmarks, and various Rust modules, as well as integration with Python bindings. These improvements streamline the benchmarking process and expand the library's capabilities, fostering a more robust framework for data analysis.

Changes

Files	Change Summary
`.github/workflows/run_benchmarks.yaml`	Modified benchmarking command to include `--all-features`.
`Cargo.toml` (for multiple crates)	Added `augurs-clustering` as a new dependency in the workspace.
`README.md`	Updated with new entry for `augurs-clustering`, detailing its purpose and status.
`crates/augurs-clustering/*`	Introduced new crate with various files, including `CHANGELOG.md`, implementation of DBSCAN, and benchmarks.
`crates/augurs-js/*`	Added clustering module and bindings for the DBSCAN algorithm.
`crates/pyaugurs/*`	Introduced Python bindings for clustering algorithms, including new `Dbscan` class.
`justfile`	Removed npm-related tasks for Grafana Labs registry, simplifying the npm publishing process.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Clustering
    participant DBSCAN
    participant DistanceMatrix

    User->>Clustering: Create Dbscan instance
    Clustering->>DBSCAN: Initialize with parameters
    User->>Clustering: Call fit with DistanceMatrix
    Clustering->>DistanceMatrix: Process input distance matrix
    DBSCAN->>DistanceMatrix: Perform clustering
    DBSCAN-->>Clustering: Return cluster assignments
    Clustering-->>User: Output cluster results

Poem

🐰 In the meadow where data flows,
A rabbit hops where the clustering grows.
With DBSCAN now in our paws,
We find the clusters without a pause.
Hooray for changes, let’s take a chance,
To analyze data, and make it dance! 🌼

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

Outside diff range, codebase verification and nitpick comments (5)

crates/augurs-clustering/CHANGELOG.md (1)
9-9: Consider using "Others" instead of "Other".

The LanguageTool suggests that the plural noun "others" might fit better in this context.
 ### Other
- Add `augurs-clustering` crate
+ - Add `augurs-clustering` crate
Tools

LanguageTool

[misspelling] ~9-~9: It seems that the plural noun “others” fits better in this context.
Context: ...pec/v2.0.0.html). ## [Unreleased] ### Other - Add augurs-clustering crate

(OTHER_OTHERS)
crates/augurs-clustering/benches/dbscan.rs (1)
18-19: Consider parameterizing the DBSCAN parameters.

The parameters 10.0 and 3 are hardcoded. Consider parameterizing them to allow flexibility in benchmarking different configurations.
let eps = 10.0;
let min_points = 3;
Dbscan::new(eps, min_points).fit(&distance_matrix);
crates/augurs-clustering/README.md (3)
4-4: Add a comma for clarity.

Consider adding a comma after "time series" for better readability.

Use this diff to improve the sentence:
 This crate contains algorithms for clustering time series.
-So far only DBSCAN is implemented, and the distance matrix must be passed directly.
+So far, only DBSCAN is implemented, and the distance matrix must be passed directly.
Tools

LanguageTool

[typographical] ~4-~4: It seems that a comma is missing.
Context: ...algorithms for clustering time series. So far only DBSCAN is implemented, and the dis...

(SO_COMMA)

30-30: Correct the phrase for clarity.

The phrase "based heavily on to the implementation" should be corrected to "based heavily on the implementation."

Use this diff to correct the phrase:
 This implementation based heavily on to the implementation in [`linfa-clustering`] and [`scikit-learn`].
-This implementation based heavily on to the implementation in [`linfa-clustering`] and [`scikit-learn`].
+This implementation is based heavily on the implementation in [`linfa-clustering`] and [`scikit-learn`].
Tools

LanguageTool

[uncategorized] ~30-~30: “to the” seems less likely than “the”.
Context: ...s This implementation based heavily on to the implementation in [linfa-clustering] ...

(AI_HYDRA_LEO_CP_TO_THE_THE)

31-31: Correct the verb agreement.

The verb "is" should be changed to "are" to match the plural subject "these."

Use this diff to correct the verb agreement:
 The main difference between these is that we operate directly on the distance matrix rather than calculating
-The main difference between these is that we operate directly on the distance matrix rather than calculating
+The main difference between these are that we operate directly on the distance matrix rather than calculating
Tools

LanguageTool

[grammar] ~31-~31: The verb ‘is’ is singular. Did you mean: “this is” or “these are”?
Context: ...it-learn`]. The main difference between these is that we operate directly on the distanc...

(SINGULAR_VERB_AFTER_THESE_OR_THOSE)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d7a3dbb and e1cf67e.

Files ignored due to path filters (1)

crates/augurs-clustering/data/dist.csv is excluded by !**/*.csv

Files selected for processing (18)

.github/workflows/run_benchmarks.yaml (1 hunks)
Cargo.toml (1 hunks)
README.md (2 hunks)
crates/augurs-clustering/CHANGELOG.md (1 hunks)
crates/augurs-clustering/Cargo.toml (1 hunks)
crates/augurs-clustering/LICENSE-APACHE (1 hunks)
crates/augurs-clustering/LICENSE-MIT (1 hunks)
crates/augurs-clustering/README.md (1 hunks)
crates/augurs-clustering/benches/dbscan.rs (1 hunks)
crates/augurs-clustering/src/lib.rs (1 hunks)
crates/augurs-js/Cargo.toml (1 hunks)
crates/augurs-js/src/clustering.rs (1 hunks)
crates/augurs-js/src/dtw.rs (1 hunks)
crates/augurs-js/src/lib.rs (1 hunks)
crates/pyaugurs/Cargo.toml (1 hunks)
crates/pyaugurs/src/clustering.rs (1 hunks)
crates/pyaugurs/src/lib.rs (2 hunks)
justfile (1 hunks)

Files skipped from review due to trivial changes (4)

crates/augurs-clustering/Cargo.toml
crates/augurs-clustering/LICENSE-APACHE
crates/augurs-clustering/LICENSE-MIT
justfile

Additional context used

LanguageTool

crates/augurs-clustering/CHANGELOG.md

[misspelling] ~9-~9: It seems that the plural noun “others” fits better in this context.
Context: ...pec/v2.0.0.html). ## [Unreleased] ### Other - Add augurs-clustering crate

(OTHER_OTHERS)

crates/augurs-clustering/README.md

[typographical] ~4-~4: It seems that a comma is missing.
Context: ...algorithms for clustering time series. So far only DBSCAN is implemented, and the dis...

(SO_COMMA)

[uncategorized] ~30-~30: “to the” seems less likely than “the”.
Context: ...s This implementation based heavily on to the implementation in [linfa-clustering] ...

(AI_HYDRA_LEO_CP_TO_THE_THE)

[grammar] ~31-~31: The verb ‘is’ is singular. Did you mean: “this is” or “these are”?
Context: ...it-learn`]. The main difference between these is that we operate directly on the distanc...

(SINGULAR_VERB_AFTER_THESE_OR_THOSE)

Additional comments not posted (21)

crates/pyaugurs/Cargo.toml (1)

19-19: Dependency addition is consistent with workspace structure.

The addition of augurs-clustering.workspace = true aligns with the integration of the new crate and the workspace configuration.

crates/augurs-js/Cargo.toml (1)

22-22: Addition of augurs-clustering dependency approved.

The inclusion of augurs-clustering as a dependency aligns with the PR objectives of integrating clustering functionality into the project.
crates/augurs-js/src/clustering.rs (1)
1-49: Implementation of DBSCAN bindings approved.

The implementation of the DBSCAN algorithm with JavaScript bindings is well-structured and aligns with the PR objectives. The use of serde, tsify_next, and wasm_bindgen is appropriate for handling data and exposing Rust functionality to JavaScript.

However, ensure that the DistanceMatrix class is correctly implemented and compatible with the DBSCAN algorithm.

Run the following script to verify the implementation of DistanceMatrix:

Verification successful

DistanceMatrix Implementation Verified

The DistanceMatrix struct is correctly implemented and compatible with the DBSCAN algorithm. It provides necessary methods and is used consistently across the codebase.

DistanceMatrix is defined in crates/augurs-js/src/dtw.rs and wraps augurs_core::DistanceMatrix.

Methods like inner() ensure compatibility with DBSCAN.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the implementation of `DistanceMatrix`.

# Test: Search for the implementation of `DistanceMatrix`. Expect: Correct implementation and compatibility with DBSCAN.
ast-grep --lang rust --pattern 'struct DistanceMatrix { $$$ }'
Length of output: 64

Script:
#!/bin/bash
# Search for the definition of the `DistanceMatrix` struct in the codebase.
ast-grep --lang rust --pattern 'struct DistanceMatrix { $$$ }'

# Search for any implementations of methods or traits for `DistanceMatrix`.
ast-grep --lang rust --pattern 'impl DistanceMatrix { $$$ }'

# Search for any usages of `DistanceMatrix` to understand its context and compatibility.
rg 'DistanceMatrix' -A 5
Length of output: 23945
Cargo.toml (1)

25-25: Dependency Addition Approved.

The addition of augurs-clustering as a dependency aligns with the PR objectives and enhances the project's functionality.

.github/workflows/run_benchmarks.yaml (1)

40-40: Benchmark Command Enhancement Approved.

The inclusion of --all-features in the benchmarking command is a beneficial change, ensuring a comprehensive performance assessment.

crates/augurs-js/src/lib.rs (1)

17-17: New Module Addition Approved.

The addition of the clustering module expands the library's capabilities and aligns with the PR objectives.
crates/pyaugurs/src/clustering.rs (3)
11-19: LGTM: Flexible input representation.

The InputDistanceMatrix enum provides a flexible way to represent distance matrices, supporting lists, numpy arrays, and augurs core distance matrices.

21-41: LGTM: Robust conversion implementation.

The TryFrom implementation effectively converts different input types into an augurs_core::DistanceMatrix, with proper error handling.

50-92: LGTM: Well-structured Dbscan class.

The Dbscan class is well-implemented, providing clear methods for initialization and clustering. Ensure that the integration with the rest of the codebase is verified.

Run the following script to verify the integration:

Verification successful

Dbscan class is well-integrated across the codebase.

The Dbscan class is utilized in various modules, including tests and benchmarks, and is part of both Python and JavaScript bindings. This indicates that it is effectively integrated and its functionality is being verified across different environments.

Locations:

crates/augurs-clustering/src/lib.rs: Implementation and tests.

crates/pyaugurs/src/clustering.rs: Python bindings.

crates/augurs-js/src/clustering.rs: JavaScript bindings.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the integration of the `Dbscan` class in the codebase.

# Test: Search for the usage of the `Dbscan` class. Expect: Proper integration and usage.
rg --type rust -A 5 $'Dbscan'
Length of output: 11195
crates/pyaugurs/src/lib.rs (2)

17-17: LGTM: New clustering module added.

The clustering module has been successfully added, enhancing the library's functionality.

117-117: LGTM: Dbscan class added to Python module.

The Dbscan class is correctly added to the Python module, expanding the library's capabilities in clustering.

README.md (1)

24-24: LGTM: Documentation for augurs-clustering added.

The README update clearly describes the new augurs-clustering module, enhancing the project's documentation.

crates/augurs-js/src/dtw.rs (4)

83-84: Change to inner field type is appropriate.

The change from Vec<Vec<f64>> to augurs_core::DistanceMatrix likely enhances performance or functionality.

86-89: Addition of inner method is appropriate.

This method provides necessary encapsulation for accessing the underlying augurs_core::DistanceMatrix.

93-94: Simplification of from method is appropriate.

Directly assigning the inner field simplifies the conversion process.

100-100: Update to from method is appropriate.

Calling into_inner() on the inner field reflects the new structure and ensures proper conversion.

crates/augurs-clustering/src/lib.rs (5)

13-18: Definition of Dbscan struct is appropriate.

The fields epsilon and min_cluster_size are well-defined and relevant for the DBSCAN algorithm.

20-33: Initialization method new is appropriate.

The method correctly initializes the Dbscan struct with the provided parameters.

47-99: Implementation of fit method is robust.

The method effectively implements the DBSCAN clustering algorithm, handling clustering and noise identification.

101-111: Implementation of find_neighbours method is efficient.

The method efficiently identifies neighbors within the specified epsilon distance.

114-192: Test module is comprehensive.

The tests cover various scenarios for the DBSCAN algorithm, ensuring robustness.

crates/augurs-clustering/benches/dbscan.rs

sd2k changed the title ~~clustering~~ feat: add augurs-clustering crate with DBSCAN algorithm Jul 19, 2024

sd2k changed the base branch from main to dtw July 19, 2024 13:34

sd2k force-pushed the clustering branch from 284bd06 to cb07e12 Compare July 19, 2024 14:04

sd2k force-pushed the dtw branch from aa6f9fd to 6e4bd7f Compare July 19, 2024 20:20

sd2k force-pushed the clustering branch from 4a78102 to 74e2df6 Compare July 19, 2024 20:20

sd2k force-pushed the dtw branch from 6e4bd7f to 39509cf Compare July 30, 2024 15:32

sd2k force-pushed the clustering branch 2 times, most recently from a030c55 to c39d876 Compare August 4, 2024 02:34

Base automatically changed from dtw to main August 21, 2024 15:43

sd2k and others added 6 commits August 21, 2024 13:22

feat: add augurs-clustering crate with DBSCAN algorithm

0688834

Add JS bindings for clustering crate

d46965d

Fix import in benchmarks

1315d1c

Pass --all-features when running benchmarks in CI

194ecb2

Add augurs-clustering to main README

d55b3f1

Fix table in README.md

e1cf67e

sd2k force-pushed the clustering branch from def498d to e1cf67e Compare August 21, 2024 16:23

sd2k marked this pull request as ready for review August 21, 2024 18:58

coderabbitai bot reviewed Aug 21, 2024

View reviewed changes

crates/augurs-clustering/benches/dbscan.rs Show resolved Hide resolved

Correct grammar in README

6c4ffa3

sd2k merged commit 6dcc641 into main Sep 4, 2024
21 checks passed

sd2k deleted the clustering branch September 4, 2024 12:11

This was referenced Sep 4, 2024

chore: release v0.4.0 #106

Closed

chore: release v0.4.0 #109

Closed

coderabbitai bot mentioned this pull request Sep 9, 2024

feat: parallel DTW calculations in augurs-js #111

Merged

sd2k mentioned this pull request Sep 22, 2024

chore: release v0.4.0 #116

Closed

coderabbitai bot mentioned this pull request Sep 24, 2024

feat: add 'augurs' convenience crate, re-exporting other crates #117

Merged

sd2k mentioned this pull request Oct 10, 2024

chore: release v0.4.0 #123

Closed

coderabbitai bot mentioned this pull request Oct 10, 2024

feat: add cmdstan-based optimizer for augurs-prophet #121

Merged

This was referenced Oct 16, 2024

chore: release v0.4.0 #128

Merged

chore: release v0.4.1 #132

Merged

chore: release v0.4.2 #133

Merged

chore: release v0.4.3 #134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add augurs-clustering crate with DBSCAN algorithm #100

feat: add augurs-clustering crate with DBSCAN algorithm #100

sd2k commented Jul 19, 2024 •

edited by coderabbitai bot

Loading

github-actions bot commented Jul 19, 2024 •

edited

Loading

coderabbitai bot commented Jul 19, 2024 •

edited

Loading

Rate limit exceeded

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

feat: add augurs-clustering crate with DBSCAN algorithm #100

feat: add augurs-clustering crate with DBSCAN algorithm #100

Conversation

sd2k commented Jul 19, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

github-actions bot commented Jul 19, 2024 • edited Loading

Bencher

coderabbitai bot commented Jul 19, 2024 • edited Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

sd2k commented Jul 19, 2024 •

edited by coderabbitai bot

Loading

github-actions bot commented Jul 19, 2024 •

edited

Loading

coderabbitai bot commented Jul 19, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)