Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multicolumn ANN indexes on a hypertable #134

Open
hamishc opened this issue Sep 23, 2024 · 2 comments
Open

Multicolumn ANN indexes on a hypertable #134

hamishc opened this issue Sep 23, 2024 · 2 comments
Labels
question Further information is requested

Comments

@hamishc
Copy link

hamishc commented Sep 23, 2024

Hi! I'm wanting to perform ANN search on time-series data, so I'm trying to index my tables on multiple columns: the embedding column and the timestamp column, in order to optimally take advantage of timescale hypertable functionality. I'm not able to find any documentation on how to do this.

e.g. I would like something like

CREATE INDEX my_index ON my_hypertable USING diskann (timestamp, embedding);

It seems like pgvector supports conditional indexing only (e.g. CREATE INDEX ON items USING hnsw (embedding vector_l2_ops) WHERE (category_id = 123);) but for obvious reasons this isn't available for time-based partitions.

It would be a major advantage for us to be able to query on long-term timeseries data, so we'd love to see this added if it's not already available. If it isn't, is this functionality possible or on the roadmap as an enhancement at some point?

@cevian
Copy link
Collaborator

cevian commented Sep 23, 2024

@hamishc Vector indexes cannot be multi-column right now. What you want to do instead is use time-based table partitioning using Timescale's hypertables and then have a regular diskann column on the embedding column. That way the query executions will be approximately as follows:

  1. the query planner will exclude any chunks (partitions) that cannot have any data based on the time-based constraints in your query
  2. for each chunk that matched, the index on that chunk will get the rows with the closest vectors
  3. the executor will then filter out any rows that don't match the time filter

Step 1 makes sure most of the irrelevant data based on the time constraints are thrown away quickly. Step 2 uses the full power of the vector index. Step 3 does the final cleanup.

@cevian cevian added the question Further information is requested label Sep 23, 2024
@hamishc
Copy link
Author

hamishc commented Sep 24, 2024

Oh, so hypertables don't actually need indexes on the time column in order to use the partitions? When I created the hypertable I ran it with create_default_indexes => FALSE - so I assumed any indexing had to be on both the desired column and the time column (this is what the timescale docs seem to suggest).

I've validated with the query planner that it's using the indexes and only running on the requested partition, so it's working either way! Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants