Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does it take more time to search the entire xq than to search sequentially for a subset of all the shards of the xq? #836

Closed
chenyihang1993 opened this issue May 23, 2019 · 8 comments
Labels

Comments

@chenyihang1993
Copy link

There are 1B samples in xq. I get the searching result by two different ways:

  1. Using IVF65536_HNSW32,PQ64 index, train it and add whole xq in it, then search.
  2. Using IVF65536_HNSW32,PQ64 index, train it and cut xq into 10 subsets(there are 100M samples in it), then add subsets of xq separately to 10 indexes and search separately, finally merge the result.
    The first way takes more time than the second. Please tell me the reason. THX.
@mdouze
Copy link
Contributor

mdouze commented May 24, 2019

With as baseline case 1:

  • Searching a dataset of 100M that is a subset of the 1B should be faster because the inverted lists are shorter.
  • Searching 10 datasets of 100M should be slower because it is performing the same amount of comparisons but the overhead is higher.
  • Doing the same but in parallel (eg. via an IndexShards) could be faster if you are performing searches in one thread.

If you are not in one of those cases, please comment.

@chenyihang1993
Copy link
Author

chenyihang1993 commented May 27, 2019

The strange thing is that searching 10 datasets of 100M is faster than searching the dataset of 1B. This is my case.

@chenyihang1993
Copy link
Author

There is a demo. The length of xb is 1M, and the index is 'IVF100_HNSW32,PQ64'.

import numpy as np
import faiss
import time

d = 64  # dimension
nb = 1000000  # database size
nq = 1000  # nb of queries
nt = 10000
np.random.seed(1234)  # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xq = np.random.random((nq, d)).astype('float32')
xt = np.random.random((nt, d)).astype('float32')
k = 100

index_1 = faiss.index_factory(d, 'IVF100_HNSW32,PQ64')
index_1.train(xt)
index_1.add(xb)
faiss.ParameterSpace().set_index_parameter(index_1, 'nprobe', 64)
start = time.time()
index_1.search(xq, k)
print('time of searching: ' + str(time.time() - start))

index_1.reset()
faiss.write_index(index_1, 'base.index')

subset_len = 100000
sum_time = 0
for i in range(0, nb, subset_len):
    index = faiss.read_index('base.index')
    faiss.ParameterSpace().set_index_parameter(index, 'nprobe', 64)
    index.add(xb[i:i + subset_len])
    start = time.time()
    index.search(xq, k)
    sum_time = sum_time + time.time() - start

print('time of searching separately:' + str(sum_time))

The result is

time of searching: 18.017865657806396
time of searching separately:4.433526039123535

@mdouze
Copy link
Contributor

mdouze commented May 27, 2019

Thanks for the demo. I can reproduce the issue.
It seems that this is a threading problem. When running in a single thread, the times are:

time of searching: 38.52920603752136
time of searching separately:42.64698934555054

The issue is due to the inverted list scanning, the quantization time is constant.
Looking further...

@beauby beauby added bug and removed question labels May 30, 2019
@beauby
Copy link
Contributor

beauby commented May 30, 2019

This will be fixed soon.

@chenyihang1993
Copy link
Author

Thank you very much.

@beauby
Copy link
Contributor

beauby commented Jun 7, 2019

This was fixed upstream, and the fix will be available in the next release.

@beauby beauby closed this as completed Jun 7, 2019
beauby pushed a commit that referenced this issue Jun 19, 2019
Bugfixes:
- slow scanning of inverted lists (#836).

Features:
- add basic support for 6 new metrics in CPU `IndexFlat` and `IndexHNSW` (#848);
- add support for `IndexIDMap`/`IndexIDMap2` with binary indexes (#780).

Misc:
- throw python exception for OOM (#758);
- make `DistanceComputer` available for all random access indexes;
- gradually moving from `long` to `int64_t` for portability.
@ucasiggcas
Copy link

hi,dear
If I want to use xb=100M or other ,then how to set the index
I see the code you set
The length of xb is 1M, and the index is 'IVF100_HNSW32,PQ64'.

index_1 = faiss.index_factory(d, 'IVF100_HNSW32,PQ64')
index_1.train(xt)
index_1.add(xb)
faiss.ParameterSpace().set_index_parameter(index_1, 'nprobe', 64)

thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants