memrecycle no longer errors when attempting to coerce CPLXSXP to STRSXP #4203

sritchie73 · 2020-01-27T18:18:28Z

Error was due to internal allNA function in utils.c not handling CPLXSXP. This PR implements this, detecting where ISNAN on either the real or imaginary component.

Demonstration that this is the correct way of handling NAs for Complex Vectors:

x = c(complex(real=1, imaginary=0), NA, complex(real=NA, imaginary=1), 
      complex(real=1, imaginary=NA), complex(real=NA, imaginary=NA), 
      NaN, complex(real=NaN, imaginary=1), complex(real=0, imaginary=NaN),
      complex(real=NaN, imaginary=NaN))
x
# [1]   1+0i       NA       NA       NA       NA NaN+0i NaN+1i   0+NaNi NaN+NaNi
sapply(x, is.na)
# [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
sapply(x, is.nan)
# [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

codecov · 2020-01-27T18:42:11Z

Codecov Report

Merging #4203 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4203      +/-   ##
==========================================
+ Coverage   99.61%   99.61%   +<.01%     
==========================================
  Files          72       72              
  Lines       13875    13916      +41     
==========================================
+ Hits        13821    13862      +41     
  Misses         54       54

Impacted Files	Coverage Δ
R/merge.R	`100% <ø> (ø)`	⬆️
src/subset.c	`100% <ø> (ø)`	⬆️
R/as.data.table.R	`100% <ø> (ø)`	⬆️
src/assign.c	`100% <ø> (ø)`	⬆️
src/fwriteR.c	`100% <100%> (ø)`	⬆️
src/dogroups.c	`100% <100%> (ø)`	⬆️
src/frank.c	`100% <100%> (ø)`	⬆️
R/frank.R	`100% <100%> (ø)`	⬆️
src/freadR.c	`100% <100%> (ø)`	⬆️
src/fread.c	`99.52% <100%> (ø)`	⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 016df30...c80a0d6. Read the comment docs.

MichaelChirico · 2020-01-27T23:39:15Z

src/data.table.h

@@ -47,6 +47,9 @@ typedef R_xlen_t RLEN;
 #define NA_INTEGER64 INT64_MIN
 #define MAX_INTEGER64 INT64_MAX

+// for use with CPLXSXP, no macro provided by R internals
+#define ISNAN_COMPLEX(x) (ISNAN(x.r) || ISNAN(x.i)) // TRUE if either real or imaginary component is NA or NaN


A few other places this could be used if we're going to add it:

src/coalesce.c:128: if (ISNAN(tt.r) && ISNAN(tt.i)) continue; src/coalesce.c:134: const bool final = !ISNAN(finalVal.r) && !ISNAN(finalVal.i); src/coalesce.c:138: if (!ISNAN(val.r) && !ISNAN(val.i)) continue; src/coalesce.c:139: int j=0; while (ISNAN(val.r) && ISNAN(val.i) && j<k) val=((Rcomplex *)valP[j++])[i]; src/coalesce.c:140: if (!ISNAN(val.r) || !ISNAN(val.i)) xP[i]=val; else if (final) xP[i]=finalVal; src/frank.c:59: for (int j=0; j<n; ++j) ians[j] |= (ISNAN(COMPLEX(v)[j].r) || ISNAN(COMPLEX(v)[j].i)); src/frank.c:195: while (j < n && !ISNAN(COMPLEX(v)[j].r) && !ISNAN(COMPLEX(v)[j].i)) j++; src/gsumm.c:320: if (ISNAN(elem.r) && ISNAN(elem.i)) my_anyNA = true; src/gsumm.c:327: if (ISNAN(elem.r) && ISNAN(elem.i)) my_anyNA = true; src/gsumm.c:557: if (!ISNAN(elem.r)) _ans[my_low[i]].r += elem.r; src/gsumm.c:642: if (ISNAN(xd[ix].r) || ISNAN(xd[ix].i)) continue; // || otherwise we'll need two counts in two c's too? src/assign.c:1039: else BODY(double, REAL, double, ISNAN(val)?(im=NA_REAL,NA_REAL):(im=0.0,val), td[i].r=cval;td[i].i=im)

It strikes me as a bit odd that there aren't macros for these in Rinternals.h - I'll see how R-devel feels about adding them to base R

MichaelChirico

LGTM

2005m · 2020-01-28T14:17:37Z

I am wondering if we shouldn't generalise 64 bit vectors in C function allNA and use xlength ?

MichaelChirico · 2020-01-28T15:12:01Z

@2005m IIRC Matt has high priority for supporting 64-bit row indices generally in data.table in the near-ish future, i would group these under that bigger issue

2005m · 2020-01-28T15:16:17Z

Is there an issue open regarding that subject because yes I also think it is important especially if we want to move to long vectors in data.table. I am happy to start going through the C code and move what can be moved to 64 bits. Are you refering to issue 3957?

MichaelChirico · 2020-01-28T15:40:53Z

I looked briefly but forget where I saw this written down

^ cc @mattdowle

MichaelChirico · 2020-01-28T16:50:30Z

ah, found it, it was in the Q&A at the H2O talk (~20 minutes in)

https://youtu.be/fZpA_cU0SPg

sritchie73 · 2020-01-28T16:52:07Z

Just found this:

src/datatable.h:

extern Rcomplex NA_CPLX;  // initialized in init.c; see there for comments

src/init.c:

  NA_CPLX.r = NA_REAL;  // NA_REAL is defined as R_NaReal which is not a strict constant and thus initializer {NA_REAL, NA_REAL} can't be used in .h
  NA_CPLX.i = NA_REAL;  // https:/Rdatatable/data.table/pull/3689/files#r304117234

This is used in a few places:
https:/Rdatatable/data.table/search?q=NA_CPLX&unscoped_q=NA_CPLX

But it looks like its not strictly correct, as demonstrated by the R code above, is.na() requires only one of the real or imaginary components to be NA before returning TRUE, not both.

sritchie73 · 2020-01-30T14:24:39Z

Question RE: long vector support. Should we be using R_xlen_t rather than int64_t?

jangorecki · 2020-01-30T14:37:18Z

@sritchie73 For long vector support R_xlen_t should be fine, unless we want to isolate logic from R C API, then int64_t could be used. For now don't bother about long vector. Matt has working branch on that issue, so no need to duplicate work and create merge conflicts.

mattdowle · 2020-02-18T04:49:53Z

Great PR, thanks!

Answers to the various questions above (I tried to reply within each comment but I'd have to edit to do that which'd be messy, so here goes):

But it looks like [data.table's NA_CPLX] its not strictly correct, as demonstrated by the R code above, is.na() requires only one of the real or imaginary components to be NA before returning TRUE, not both.

NA_CPLX is only used to assign an NA value to a complex (like when filling with NA). When assigning we may as well assign both real and imaginary to avoid leaving uninitialized random content there. The ISNAN() code exists in data.table precisely because we aren't using ==NA_CPLX (which would indeed be wrong if we did).

I am wondering if we shouldn't generalise 64 bit vectors in C function allNA and use xlength ?
Should we be using R_xlen_t rather than int64_t?

Writing out loud ...

I always thought that R had defined and used R_len_t so they could change its definition to be int64_t and all vectors would just become 64bit. That was the whole point of that up front effort to use an extra layer of R_len_t rather than just int, surely. But then, instead, they introduced R_xlen_t and xlength and everyone has to change their code and it takes years and years. It's really desperately sad and such a shame. There must have been good reasons and deal-breakers preventing that course. I seem to remember that one problem was that we package creators had used int in many places and so long vectors wouldn't have worked in those package. But there's the thing, I would have put the onus back onto packages to change their int to R_len_t as they should have done in the first place. Where packages did use R_len_t (and data.table did) we now have to change to R_xlen_t, length to xlength and LENGTH to XLENGTH (and anything else?). There are big advantages of retaining both integer and integer64 (32bit integer is half as much memory as integer64) so I'm not advocating that the integer type should have been made 64bit size. I'm just advocating that all vectors and accessors just have been made 64bit so that those who had used int would have to update to R_len_t (putting the workload onto the appropriate people -- but maybe they were the ones who shouted the loudest and got their way). But there again, R never added integer64 support anyway.

So, with the rant over and that said, I currently prefer int64_t for the reason that it's a good name with "64" in it and it's a C99 standard that many more people understand, and as Jan says, we'd like to make the internals R-independent wherever possible (e.g. pydatatable in mind). R_xlen_t is not a great name really. It's just one letter different to R_len_t, and what does 'x' mean anyway, and what's the difference between R_len_t and R_xlen_t ... it raises all those questions. R_len_t was an inspired idea, but then the opportunity to change it to 64bit wasn't possible apparently. That horse bolted. Now's the time to give up on that abstraction then and use the standard types everyone understands. For example, #4213 was easier I think dealing with those standard types. Had R_len_t and R_xlen_t been in use there, it might have hindered that PR a little bit. I can be persuaded, but there's quite some momentum in the code using int64_t now and my mental model has become tuned into seeing "64" in the type or not. In other (quite hairy) places I do use and rely on 16 bit types (int16_t) and it's easier to use the standard types directly in those places without a layer in the way,

Further, long vectors aren't 64bit (or 63bit) long. They're up to 2^52 because R doesn't have a 64bit integer type, so a double is used; all integers in the range 1 : 2^52 are precise in a double. [ Give or take a bit, I'm writing quickly. ] So actually, double comes into all this and it's acually a 3-way decision: R_xlen_t | int64_t | double. I don't think data.frame can have > 2bn rows can it in R yet, but for both data.frame and data.table, nrow(DT) will return a double type. The question is whether in data.table we want to attach the "integer64" attribute to the nrow, or whether we want to move back towards the likely base R way which will be up-to 2^52 stored in double and no special attribute/class for integer64. That's much more pressing than R_xlen_t in the source, for the reason that it's the user experience.

sritchie73 · 2020-02-18T04:56:39Z

That makes sense, I'm happy to go through my PRs and change R_xlen_t to int64_t in light of the above.

The only gotcha would be that the max value of R_xlen_t is less than int64_r. From memory when I looked at the R internals documentation long-vectors can only be something like 2^52 in length, but from memory when I was playing around with implementing checks for this (before I found R_xlen_t) an integer vector this long would require something like >600TB of memory, so this is not something we really need to worry about.

mattdowle · 2020-02-18T05:38:15Z

Great. Glad. Yes agreed I added to the end of my long comment about 2^52 as you were writing. Welcome everyone's thoughts on whether nrow(DT) should return base-R plain double, or a bit64::integer64. I'm leaning towards a plain double. For consistency with base, and actually if you increase the digits option in R, integers up to 2^52 do print out accurately and nicely. data.table could check and warn if the digits option was too low when it knows it has big nrow.

Another thing is that R_xlen_t appears to be defined as ptrdiff_t in R. So on 32bit, it'll be smaller. Long vectors couldn't be allocated on 32bit-addressability anyway, so that seems fair enough on first glance. But when it comes to writing C loops, bound checks, and overflow, my head can't cope with the uncertainty over what the size and bounds of the type are. We do have places that use INT64_MAX, for example, and we're confident in that because the type is int64_t for sure. Even on 32bit, int64_t is still 64bit and 64bit integer arithmetic works on 32bit.

sritchie73 · 2020-02-18T05:51:05Z

Plain double makes more sense I think. Otherwise we'd have to add bit64 as a dependency instead of suggests.

sritchie73 added 4 commits January 27, 2020 17:49

Added test defining target behaviour

bbb3445

AllNA handles complex types now

4daca3f

Increased test coverage

4f353df

Added NEWS item

0ffd3bf

sritchie73 added a commit that referenced this pull request Jan 27, 2020

No need to handle CPLX separately, #4203

856a8c1

sritchie73 mentioned this pull request Jan 27, 2020

Fast data.table to matrix conversion in C [Depends: #4196] #4144

Draft

8 tasks

Fixed bad test number

2829fd7

sritchie73 added 4 commits January 27, 2020 19:54

Added test for codecov

f0e8044

modified test to hopefully trigger codecov

917080d

reverting previous commit

f85e22f

Created macro ISNAN_COMPLEX

016df30

MichaelChirico reviewed Jan 27, 2020

View reviewed changes

MichaelChirico approved these changes Jan 27, 2020

View reviewed changes

mattdowle added this to the 1.12.9 milestone Feb 18, 2020

Merge branch 'master' into memrecycle-cplx-str

652574d

Extra parens in macro around (x) in case x is an expression

c80a0d6

mattdowle merged commit ade4178 into master Feb 18, 2020

mattdowle deleted the memrecycle-cplx-str branch February 18, 2020 06:15

jangorecki modified the milestones: 1.12.11, 1.12.9 May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memrecycle no longer errors when attempting to coerce CPLXSXP to STRSXP #4203

memrecycle no longer errors when attempting to coerce CPLXSXP to STRSXP #4203

sritchie73 commented Jan 27, 2020 •

edited

Loading

codecov bot commented Jan 27, 2020 •

edited

Loading

MichaelChirico Jan 27, 2020

sritchie73 Jan 28, 2020 •

edited

Loading

MichaelChirico left a comment

2005m commented Jan 28, 2020

MichaelChirico commented Jan 28, 2020

2005m commented Jan 28, 2020 •

edited

Loading

MichaelChirico commented Jan 28, 2020

MichaelChirico commented Jan 28, 2020 •

edited by jangorecki

Loading

sritchie73 commented Jan 28, 2020

sritchie73 commented Jan 30, 2020

jangorecki commented Jan 30, 2020

mattdowle commented Feb 18, 2020 •

edited

Loading

sritchie73 commented Feb 18, 2020

mattdowle commented Feb 18, 2020 •

edited

Loading

sritchie73 commented Feb 18, 2020

memrecycle no longer errors when attempting to coerce CPLXSXP to STRSXP #4203

memrecycle no longer errors when attempting to coerce CPLXSXP to STRSXP #4203

Conversation

sritchie73 commented Jan 27, 2020 • edited Loading

codecov bot commented Jan 27, 2020 • edited Loading

Codecov Report

MichaelChirico Jan 27, 2020

Choose a reason for hiding this comment

sritchie73 Jan 28, 2020 • edited Loading

Choose a reason for hiding this comment

MichaelChirico left a comment

Choose a reason for hiding this comment

2005m commented Jan 28, 2020

MichaelChirico commented Jan 28, 2020

2005m commented Jan 28, 2020 • edited Loading

MichaelChirico commented Jan 28, 2020

MichaelChirico commented Jan 28, 2020 • edited by jangorecki Loading

sritchie73 commented Jan 28, 2020

sritchie73 commented Jan 30, 2020

jangorecki commented Jan 30, 2020

mattdowle commented Feb 18, 2020 • edited Loading

sritchie73 commented Feb 18, 2020

mattdowle commented Feb 18, 2020 • edited Loading

sritchie73 commented Feb 18, 2020

sritchie73 commented Jan 27, 2020 •

edited

Loading

codecov bot commented Jan 27, 2020 •

edited

Loading

sritchie73 Jan 28, 2020 •

edited

Loading

2005m commented Jan 28, 2020 •

edited

Loading

MichaelChirico commented Jan 28, 2020 •

edited by jangorecki

Loading

mattdowle commented Feb 18, 2020 •

edited

Loading

mattdowle commented Feb 18, 2020 •

edited

Loading