Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memrecycle no longer errors when attempting to coerce CPLXSXP to STRSXP #4203

Merged
merged 11 commits into from
Feb 18, 2020

Conversation

sritchie73
Copy link
Contributor

@sritchie73 sritchie73 commented Jan 27, 2020

Closes #4202

Error was due to internal allNA function in utils.c not handling CPLXSXP. This PR implements this, detecting where ISNAN on either the real or imaginary component.

Demonstration that this is the correct way of handling NAs for Complex Vectors:

x = c(complex(real=1, imaginary=0), NA, complex(real=NA, imaginary=1), 
      complex(real=1, imaginary=NA), complex(real=NA, imaginary=NA), 
      NaN, complex(real=NaN, imaginary=1), complex(real=0, imaginary=NaN),
      complex(real=NaN, imaginary=NaN))
x
# [1]   1+0i       NA       NA       NA       NA NaN+0i NaN+1i   0+NaNi NaN+NaNi
sapply(x, is.na)
# [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
sapply(x, is.nan)
# [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

@codecov
Copy link

codecov bot commented Jan 27, 2020

Codecov Report

Merging #4203 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4203      +/-   ##
==========================================
+ Coverage   99.61%   99.61%   +<.01%     
==========================================
  Files          72       72              
  Lines       13875    13916      +41     
==========================================
+ Hits        13821    13862      +41     
  Misses         54       54
Impacted Files Coverage Δ
R/merge.R 100% <ø> (ø) ⬆️
src/subset.c 100% <ø> (ø) ⬆️
R/as.data.table.R 100% <ø> (ø) ⬆️
src/assign.c 100% <ø> (ø) ⬆️
src/fwriteR.c 100% <100%> (ø) ⬆️
src/dogroups.c 100% <100%> (ø) ⬆️
src/frank.c 100% <100%> (ø) ⬆️
R/frank.R 100% <100%> (ø) ⬆️
src/freadR.c 100% <100%> (ø) ⬆️
src/fread.c 99.52% <100%> (ø) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 016df30...c80a0d6. Read the comment docs.

src/data.table.h Outdated
@@ -47,6 +47,9 @@ typedef R_xlen_t RLEN;
#define NA_INTEGER64 INT64_MIN
#define MAX_INTEGER64 INT64_MAX

// for use with CPLXSXP, no macro provided by R internals
#define ISNAN_COMPLEX(x) (ISNAN(x.r) || ISNAN(x.i)) // TRUE if either real or imaginary component is NA or NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few other places this could be used if we're going to add it:

src/coalesce.c:128:        if (ISNAN(tt.r) && ISNAN(tt.i)) continue;
src/coalesce.c:134:    const bool final = !ISNAN(finalVal.r) && !ISNAN(finalVal.i);
src/coalesce.c:138:      if (!ISNAN(val.r) && !ISNAN(val.i)) continue;
src/coalesce.c:139:      int j=0; while (ISNAN(val.r) && ISNAN(val.i) && j<k) val=((Rcomplex *)valP[j++])[i];
src/coalesce.c:140:      if (!ISNAN(val.r) || !ISNAN(val.i)) xP[i]=val; else if (final) xP[i]=finalVal;
src/frank.c:59:      for (int j=0; j<n; ++j) ians[j] |= (ISNAN(COMPLEX(v)[j].r) || ISNAN(COMPLEX(v)[j].i));
src/frank.c:195:      while (j < n && !ISNAN(COMPLEX(v)[j].r) && !ISNAN(COMPLEX(v)[j].i)) j++;
src/gsumm.c:320:          if (ISNAN(elem.r) && ISNAN(elem.i)) my_anyNA = true;
src/gsumm.c:327:          if (ISNAN(elem.r) && ISNAN(elem.i)) my_anyNA = true;
src/gsumm.c:557:            if (!ISNAN(elem.r)) _ans[my_low[i]].r += elem.r;
src/gsumm.c:642:      if (ISNAN(xd[ix].r) || ISNAN(xd[ix].i)) continue;  // || otherwise we'll need two counts in two c's too?
src/assign.c:1039:      else          BODY(double,  REAL, double, ISNAN(val)?(im=NA_REAL,NA_REAL):(im=0.0,val),         td[i].r=cval;td[i].i=im)

Copy link
Contributor Author

@sritchie73 sritchie73 Jan 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It strikes me as a bit odd that there aren't macros for these in Rinternals.h - I'll see how R-devel feels about adding them to base R

Copy link
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@2005m
Copy link
Contributor

2005m commented Jan 28, 2020

I am wondering if we shouldn't generalise 64 bit vectors in C function allNA and use xlength ?

@MichaelChirico
Copy link
Member

@2005m IIRC Matt has high priority for supporting 64-bit row indices generally in data.table in the near-ish future, i would group these under that bigger issue

@2005m
Copy link
Contributor

2005m commented Jan 28, 2020

Is there an issue open regarding that subject because yes I also think it is important especially if we want to move to long vectors in data.table. I am happy to start going through the C code and move what can be moved to 64 bits. Are you refering to issue 3957?

@MichaelChirico
Copy link
Member

I looked briefly but forget where I saw this written down

^ cc @mattdowle

@MichaelChirico
Copy link
Member

MichaelChirico commented Jan 28, 2020

ah, found it, it was in the Q&A at the H2O talk (~20 minutes in)

https://youtu.be/fZpA_cU0SPg

@sritchie73
Copy link
Contributor Author

Just found this:

src/datatable.h:

extern Rcomplex NA_CPLX;  // initialized in init.c; see there for comments

src/init.c:

  NA_CPLX.r = NA_REAL;  // NA_REAL is defined as R_NaReal which is not a strict constant and thus initializer {NA_REAL, NA_REAL} can't be used in .h
  NA_CPLX.i = NA_REAL;  // https:/Rdatatable/data.table/pull/3689/files#r304117234

This is used in a few places:
https:/Rdatatable/data.table/search?q=NA_CPLX&unscoped_q=NA_CPLX

But it looks like its not strictly correct, as demonstrated by the R code above, is.na() requires only one of the real or imaginary components to be NA before returning TRUE, not both.

@sritchie73
Copy link
Contributor Author

Question RE: long vector support. Should we be using R_xlen_t rather than int64_t?

@jangorecki
Copy link
Member

@sritchie73 For long vector support R_xlen_t should be fine, unless we want to isolate logic from R C API, then int64_t could be used. For now don't bother about long vector. Matt has working branch on that issue, so no need to duplicate work and create merge conflicts.

@mattdowle mattdowle added this to the 1.12.9 milestone Feb 18, 2020
@mattdowle
Copy link
Member

mattdowle commented Feb 18, 2020

Great PR, thanks!

Answers to the various questions above (I tried to reply within each comment but I'd have to edit to do that which'd be messy, so here goes):

But it looks like [data.table's NA_CPLX] its not strictly correct, as demonstrated by the R code above, is.na() requires only one of the real or imaginary components to be NA before returning TRUE, not both.

NA_CPLX is only used to assign an NA value to a complex (like when filling with NA). When assigning we may as well assign both real and imaginary to avoid leaving uninitialized random content there. The ISNAN() code exists in data.table precisely because we aren't using ==NA_CPLX (which would indeed be wrong if we did).

I am wondering if we shouldn't generalise 64 bit vectors in C function allNA and use xlength ?
Should we be using R_xlen_t rather than int64_t?

Writing out loud ...

I always thought that R had defined and used R_len_t so they could change its definition to be int64_t and all vectors would just become 64bit. That was the whole point of that up front effort to use an extra layer of R_len_t rather than just int, surely. But then, instead, they introduced R_xlen_t and xlength and everyone has to change their code and it takes years and years. It's really desperately sad and such a shame. There must have been good reasons and deal-breakers preventing that course. I seem to remember that one problem was that we package creators had used int in many places and so long vectors wouldn't have worked in those package. But there's the thing, I would have put the onus back onto packages to change their int to R_len_t as they should have done in the first place. Where packages did use R_len_t (and data.table did) we now have to change to R_xlen_t, length to xlength and LENGTH to XLENGTH (and anything else?). There are big advantages of retaining both integer and integer64 (32bit integer is half as much memory as integer64) so I'm not advocating that the integer type should have been made 64bit size. I'm just advocating that all vectors and accessors just have been made 64bit so that those who had used int would have to update to R_len_t (putting the workload onto the appropriate people -- but maybe they were the ones who shouted the loudest and got their way). But there again, R never added integer64 support anyway.

So, with the rant over and that said, I currently prefer int64_t for the reason that it's a good name with "64" in it and it's a C99 standard that many more people understand, and as Jan says, we'd like to make the internals R-independent wherever possible (e.g. pydatatable in mind). R_xlen_t is not a great name really. It's just one letter different to R_len_t, and what does 'x' mean anyway, and what's the difference between R_len_t and R_xlen_t ... it raises all those questions. R_len_t was an inspired idea, but then the opportunity to change it to 64bit wasn't possible apparently. That horse bolted. Now's the time to give up on that abstraction then and use the standard types everyone understands. For example, #4213 was easier I think dealing with those standard types. Had R_len_t and R_xlen_t been in use there, it might have hindered that PR a little bit. I can be persuaded, but there's quite some momentum in the code using int64_t now and my mental model has become tuned into seeing "64" in the type or not. In other (quite hairy) places I do use and rely on 16 bit types (int16_t) and it's easier to use the standard types directly in those places without a layer in the way,

Further, long vectors aren't 64bit (or 63bit) long. They're up to 2^52 because R doesn't have a 64bit integer type, so a double is used; all integers in the range 1 : 2^52 are precise in a double. [ Give or take a bit, I'm writing quickly. ] So actually, double comes into all this and it's acually a 3-way decision: R_xlen_t | int64_t | double. I don't think data.frame can have > 2bn rows can it in R yet, but for both data.frame and data.table, nrow(DT) will return a double type. The question is whether in data.table we want to attach the "integer64" attribute to the nrow, or whether we want to move back towards the likely base R way which will be up-to 2^52 stored in double and no special attribute/class for integer64. That's much more pressing than R_xlen_t in the source, for the reason that it's the user experience.

@sritchie73
Copy link
Contributor Author

That makes sense, I'm happy to go through my PRs and change R_xlen_t to int64_t in light of the above.

The only gotcha would be that the max value of R_xlen_t is less than int64_r. From memory when I looked at the R internals documentation long-vectors can only be something like 2^52 in length, but from memory when I was playing around with implementing checks for this (before I found R_xlen_t) an integer vector this long would require something like >600TB of memory, so this is not something we really need to worry about.

@mattdowle
Copy link
Member

mattdowle commented Feb 18, 2020

Great. Glad. Yes agreed I added to the end of my long comment about 2^52 as you were writing. Welcome everyone's thoughts on whether nrow(DT) should return base-R plain double, or a bit64::integer64. I'm leaning towards a plain double. For consistency with base, and actually if you increase the digits option in R, integers up to 2^52 do print out accurately and nicely. data.table could check and warn if the digits option was too low when it knows it has big nrow.

Another thing is that R_xlen_t appears to be defined as ptrdiff_t in R. So on 32bit, it'll be smaller. Long vectors couldn't be allocated on 32bit-addressability anyway, so that seems fair enough on first glance. But when it comes to writing C loops, bound checks, and overflow, my head can't cope with the uncertainty over what the size and bounds of the type are. We do have places that use INT64_MAX, for example, and we're confident in that because the type is int64_t for sure. Even on 32bit, int64_t is still 64bit and 64bit integer arithmetic works on 32bit.

@sritchie73
Copy link
Contributor Author

Plain double makes more sense I think. Otherwise we'd have to add bit64 as a dependency instead of suggests.

@mattdowle mattdowle merged commit ade4178 into master Feb 18, 2020
@mattdowle mattdowle deleted the memrecycle-cplx-str branch February 18, 2020 06:15
@jangorecki jangorecki modified the milestones: 1.12.11, 1.12.9 May 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

memrecycle fails to coerce complex to character
5 participants