Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add approximate filters #607

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions osbenchmark/utils/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ class Context(Enum):
MAX_DISTANCE_NEIGHBORS = 4
finnroblin marked this conversation as resolved.
Show resolved Hide resolved
MIN_SCORE_NEIGHBORS = 5
PARENTS = 6
ATTRIBUTES = 7


class DataSet(ABC):
Expand Down Expand Up @@ -133,6 +134,7 @@ def size(self):
def reset(self):
self.current = self.BEGINNING

# pylint: disable=R0911
@staticmethod
def parse_context(context: Context) -> str:
if context == Context.NEIGHBORS:
Expand All @@ -152,6 +154,9 @@ def parse_context(context: Context) -> str:
if context == Context.MIN_SCORE_NEIGHBORS:
return "min_score_neighbors"

if context == Context.ATTRIBUTES:
return "attributes"

raise Exception("Unsupported context")


Expand Down
16 changes: 15 additions & 1 deletion osbenchmark/utils/parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ def parse_string_parameter(key: str, params: dict, default: str = None) -> str:

def parse_int_parameter(key: str, params: dict, default: int = None) -> int:
if key not in params:
if default:
if default is not None:
Copy link
Contributor Author

@finnroblin finnroblin Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is technically not related to this PR, but the previous if statement throws a ConfigurationError if default == 0. All unit tests pass in both configurations.

return default
raise ConfigurationError(
"Value cannot be None for param {}".format(key)
Expand All @@ -46,3 +46,17 @@ def parse_float_parameter(key: str, params: dict, default: float = None) -> floa
return params[key]

raise ConfigurationError("Value must be a float for param {}".format(key))


def parse_bool_parameter(key: str, params: dict, default: bool = None) -> bool:
if key not in params:
if default is not None:
return default
raise ConfigurationError(
"Value cannot be None for param {}".format(key)
)

if isinstance(params[key], bool):
return params[key]

raise ConfigurationError("Value must be a bool for param {}".format(key))
26 changes: 25 additions & 1 deletion osbenchmark/worker_coordinator/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -1254,6 +1254,17 @@ def _get_field_value(content, field_name):
return _get_field_value(content["_source"], field_name)
return None

def binary_search_for_last_negative_1(neighbors):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use a Python utility here but I couldn't find one in the standard library. Any suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not aware of anything. Also, given that the list may not be sorted, it might be harder to do. That being said, I think this is good.

low = 0
high = len(neighbors)
while low < high:
mid = (low + high) // 2
if neighbors[mid] == "-1":
high = mid
else:
low = mid + 1
return low - 1

def calculate_topk_search_recall(predictions, neighbors, top_k):
"""
Calculates the recall by comparing top_k neighbors with predictions.
Expand All @@ -1270,7 +1281,20 @@ def calculate_topk_search_recall(predictions, neighbors, top_k):
self.logger.info("No neighbors are provided for recall calculation")
return 0.0
min_num_of_results = min(top_k, len(neighbors))
last_neighbor_is_negative_1 = int(neighbors[min_num_of_results-1]) == -1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is necessary when the dataset is first generated and then neighbors are filtered out. This occurs in some of the perf tools datasets like here. These datasets first generate true neighbors for the test vector, such as [5, 2, 3], and then post-filter the neighbors with attributes that do not pass the filter. The documents without the attributes are replaced by -1. However looking more closely at the dataset generation code I'm not sure that all of the -1 entries are at the end of the array. @martin-gaievski I see you wrote the filtering code, could you please clarify? Thanks

Copy link
Member

@martin-gaievski martin-gaievski Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they should be at the end of collection of nearest neighbors for each query, although I'm not clearly remember sorting them. Can you share example where -1 are in the row, but they are mixed with some other ids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any in testing, but wanted to ask since I couldn't find a sort. I took another look at the code for filtering neighbors and it looks like the initial array with all -1 entries is overwritten from left to right, so any remaining -1 ids are at the end. I don't think we'll use these filter datasets anyways (since Heemin's nested fields script generates k true neighbors after filtering instead of generating k true neighbors and then post-filtering to have <k neighbors), but I'll call out the assumption that all -1 are at the end in a comment.

truth_set = neighbors[:min_num_of_results]
if last_neighbor_is_negative_1:
self.logger.debug("Last neighbor is -1")
last_neighbor_idx = binary_search_for_last_negative_1(truth_set)

# Note: we do - 1 since list indexing is inclusive, and we want to ignore the first '-1' in neighbors.
truth_set = truth_set[:last_neighbor_idx-1]
if not truth_set:
self.logger.info("No true neighbors after filtering, returning recall = 1.\n"
"Total neighbors in prediction: [%d].", len(predictions))
return 1.0


for j in range(min_num_of_results):
if j >= len(predictions):
self.logger.info("No more neighbors in prediction to compare against ground truth.\n"
Expand All @@ -1280,7 +1304,7 @@ def calculate_topk_search_recall(predictions, neighbors, top_k):
if predictions[j] in truth_set:
correct += 1.0

return correct / min_num_of_results
return correct / len(truth_set)

def calculate_radial_search_recall(predictions, neighbors, enable_top_1_recall=False):
"""
Expand Down
Loading
Loading