Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chin and chmatch length 1 speedup #4121

Merged
merged 3 commits into from
Dec 17, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@

## NEW FEATURES

1. `%chin%` and `chmatch(x, table)` are faster when `x` is length 1, `table` is long, and `x` occurs near the start of `table`. Thanks to Michael Chirico for the suggestion, [#4117](https:/Rdatatable/data.table/pull/4117#discussion_r358378409).

## BUG FIXES

1. A NULL timezone on POSIXct was interpreted by `as.IDate` and `as.ITime` as UTC rather than the session's default timezone (`tz=""`) , [#4085](https:/Rdatatable/data.table/issues/4085).
Expand Down
4 changes: 2 additions & 2 deletions R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -119,8 +119,8 @@ do_patterns = function(pat_sub, all_cols) {

# check UTC status
is_utc = function(tz) {
# via grep('UTC|GMT', OlsonNames(), value = TRUE)
utc_tz = c("Etc/GMT", "Etc/UTC", "GMT", "GMT-0", "GMT+0", "GMT0", "UTC")
# via grep('UTC|GMT', OlsonNames(), value = TRUE); ordered by "prior" frequency
utc_tz = c("UTC", "GMT", "Etc/UTC", "Etc/GMT", "GMT-0", "GMT+0", "GMT0")
if (is.null(tz)) tz = Sys.timezone()
return(tz %chin% utc_tz)
}
Expand Down
29 changes: 21 additions & 8 deletions src/chmatch.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,31 @@ static SEXP chmatchMain(SEXP x, SEXP table, int nomatch, bool chin, bool chmatch
if (!isString(x) && !isNull(x)) error("x is type '%s' (must be 'character' or NULL)", type2char(TYPEOF(x)));
if (!isString(table) && !isNull(table)) error("table is type '%s' (must be 'character' or NULL)", type2char(TYPEOF(table)));
if (chin && chmatchdup) error("Internal error: either chin or chmatchdup should be true not both"); // # nocov
// allocations up front before savetl starts
SEXP ans = PROTECT(allocVector(chin?LGLSXP:INTSXP, length(x)));
if (!length(x)) { UNPROTECT(1); return ans; } // no need to look at table when x is empty
const int xlen = length(x);
const int tablelen = length(table);
// allocations up front before savetl starts in case allocs fail
SEXP ans = PROTECT(allocVector(chin?LGLSXP:INTSXP, xlen));
if (xlen==0) { UNPROTECT(1); return ans; } // no need to look at table when x is empty
int *ansd = INTEGER(ans);
if (!length(table)) { const int val=(chin?0:nomatch), n=LENGTH(x); for (int i=0; i<n; ++i) ansd[i]=val; UNPROTECT(1); return ans; }
if (tablelen==0) { const int val=(chin?0:nomatch), n=xlen; for (int i=0; i<n; ++i) ansd[i]=val; UNPROTECT(1); return ans; }
// Since non-ASCII strings may be marked with different encodings, it only make sense to compare
// the bytes under a same encoding (UTF-8) #3844 #3850
const SEXP *xd = STRING_PTR(PROTECT(coerceUtf8IfNeeded(x)));
const SEXP *td = STRING_PTR(PROTECT(coerceUtf8IfNeeded(table)));
const int nprotect = 3; // ans, xd, td
if (xlen==1) {
ansd[0] = nomatch;
for (int i=0; i<tablelen; ++i) {
if (td[i]==xd[0]) {
ansd[0] = chin ? 1 : i+1;
break; // short-circuit early; if there are dups in table the first is returned
}
}
UNPROTECT(nprotect);
return ans;
}
// else xlen>1; nprotect is const above since no more R allocations should occur after this point
savetl_init();
const int xlen = length(x);
for (int i=0; i<xlen; i++) {
SEXP s = xd[i];
const int tl = TRUELENGTH(s);
Expand All @@ -31,7 +45,6 @@ static SEXP chmatchMain(SEXP x, SEXP table, int nomatch, bool chin, bool chmatch
// # nocov end
}
}
const int tablelen = length(table);
int nuniq=0;
for (int i=0; i<tablelen; ++i) {
SEXP s = td[i];
Expand Down Expand Up @@ -89,8 +102,8 @@ static SEXP chmatchMain(SEXP x, SEXP table, int nomatch, bool chin, bool chmatch
for (int i=0; i<tablelen; i++)
SET_TRUELENGTH(td[i], 0); // reinstate 0 rather than leave the -i-1
savetl_end();
UNPROTECT(3); // ans, xd, td
return(ans);
UNPROTECT(nprotect); // ans, xd, td
return ans;
}

// for internal use from C :
Expand Down