Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ASCII key sometimes fails to work in version 1.12.0 and dev #3397

Closed
shrektan opened this issue Feb 13, 2019 · 3 comments · Fixed by #3451
Closed

Non-ASCII key sometimes fails to work in version 1.12.0 and dev #3397

shrektan opened this issue Feb 13, 2019 · 3 comments · Fixed by #3451
Assignees
Labels
Milestone

Comments

@shrektan
Copy link
Member

shrektan commented Feb 13, 2019

Previously, my production environment is using data.table 1.11.4. After upgrading to version 1.12.0 (CRAN version), I find sometimes the non-ASCII strings cannot be matched... It's very very difficult to reproduce... However, finally get managed to this reproducible example...

Note, again, this only happens on Windows, only when the column being keyed is encoded in native encoding. What's strange is that I cannot reproduce it on Windows 10. It can be only reproduced on Windows 7 (succeeded on 2 computers so it should not be an issue related to my computer).

Moreover, at first, it only occurs when options(stringsAsFactors = FALSE) being set... However, it's unrelated in the below example code.

I'll try to debug and fix it...

library(data.table)

x <- '借:Cash|借:损益类-交易费用|借:损益类-价差收入|借:损益类-公允价值变动损益|贷:资产类-公允价值变动|贷:资产类-成本'
v <- c(
  x, 
  "借:Cash|借:损益类-交易费用|借:损益类-价差收入|借:损益类-公允价值变动损益|贷:资产类-公允价值变动|贷:资产类-应计利息|贷:资产类-成本", 
  "借:Cash|借:损益类-利息收入|贷:资产类-成本", "借:Cash|借:损益类-利息收入|贷:资产类-成本|贷:资产类-应计利息", 
  "借:Cash|借:损益类-利息收入|贷:资产类-成本|贷:资产类-应计利息|贷:资产类-折溢价", 
  "借:Cash|借:损益类-利息收入|贷:资产类-成本|贷:资产类-应计利息|借:资产类-折溢价", 
  "借:Cash|借:损益类-利息收入|贷:资产类-应计利息", "借:Cash|借:损益类-利息收入|贷:资产类-应计利息|贷:资产类-成本", 
  "借:Cash|借:损益类-利息收入|借:权益类-资本公积|贷:资产类-公允价值变动|贷:资产类-应计利息|贷:资产类-成本"
)

tmp <- data.table(a = v, b = 1, key = 'a')
print(tmp[J(x), b])  # returns NA on 1.12.0 and dev; returns 1 on 1.11.4
print(tmp[, b[v == x]]) # always return 1
@shrektan
Copy link
Member Author

Confirmed. It's e59ba14 leads to the bug.

@jangorecki jangorecki added this to the 1.12.2 milestone Feb 20, 2019
@shrektan
Copy link
Member Author

shrektan commented Mar 10, 2019

Smallest reprex I can have (the orders are different, again, only reproducible on Windows7 with Chinese as the default language). Hopefully, I can have some time this week to settle this down.

library(data.table)
x <- '借:Cash|借:损益类-交易费用|借:损益类-价差收入|借:损益类-公允价值变动损益|贷:资产类-公允价值变动|贷:资产类-成本'
v <- c(x, '借:Cash|借:损益类-交易费用|借:损益类-价差收入|借:损益类-公允价值变动损益|贷:资产类-公允价值变动|贷:资产类-应计利息|贷:资产类-成本')
v <- c(v, rep('a' , 4))
data.table(a = v, b = 1, key = 'a')[, tail(a, 2)]
#> [1] "借:Cash|借:损益类-交易费用|借:损益类-价差收入|借:损益类-公允价值变动损益|贷:资产类-公允价值变动|贷:资产类-应计利息|贷:资产类-成本"
#> [2] "借:Cash|借:损益类-交易费用|借:损益类-价差收入|借:损益类-公允价值变动损益|贷:资产类-公允价值变动|贷:资产类-成本"
data.table(a = c(v, 'a'), b = 1, key = 'a')[, tail(a, 2)]
#> [1] "借:Cash|借:损益类-交易费用|借:损益类-价差收入|借:损益类-公允价值变动损益|贷:资产类-公允价值变动|贷:资产类-成本"                   
#> [2] "借:Cash|借:损益类-交易费用|借:损益类-价差收入|借:损益类-公允价值变动损益|贷:资产类-公允价值变动|贷:资产类-应计利息|贷:资产类-成本"

update

The example only works when the threads are larger than 1. In other words, it only happens without setDTthreads(1). Anyway, the original example works for all cases.

@shrektan
Copy link
Member Author

I'm pretty sure the following line leads to the bug. At the time, s may not be UTF-8 encoded and result in a different ustr_maxlen, which is then used in cradix_r().

if (LENGTH(s)>ustr_maxlen) ustr_maxlen=LENGTH(s);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants