Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

ngram hash function relies on compiler specific behavior #553

Closed
leezu opened this issue Jun 21, 2018 · 2 comments
Closed

ngram hash function relies on compiler specific behavior #553

leezu opened this issue Jun 21, 2018 · 2 comments

Comments

@leezu
Copy link

leezu commented Jun 21, 2018

fastText uses std::string to represent text data, which is an alias for std::basic_string<char>. It is compiler defined if char is signed or unsigned. However, the result of

uint32_t Dictionary::hash(const std::string& str) const {
  uint32_t h = 2166136261;
  for (size_t i = 0; i < str.size(); i++) {
    h = h ^ uint32_t(str[i]);
    h = h * 16777619;
  }
  return h;
}

depends on if char is signed or unsigned.

To guarantee portability of binary fasttext models, std::basic_string<signed char> should be used throughout the library. This should match the current behavior on Linux and OS X (possibly more), but would change the behavior on Android systems for which char defaults apparently often to unsigned char. Ie. if anyone was using a pretrained binary fasttext model there it was giving wrong results.

@EdouardGrave
Copy link
Contributor

Hi @leezu,

Thank you for reporting this issue.

You are right, and we will fix this issue by the end of the week.

Best,
Edouard.

@leezu
Copy link
Author

leezu commented Aug 9, 2018

Thanks @EdouardGrave for fixing. I'm closing the issue. For reference, the commit containing the fix is 9c9b9b2

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants