PDEP-14: Dedicated string data type for pandas 3.0 #58551

jorisvandenbossche · 2024-05-03T15:18:13Z

Following the discussion in #57073, this proposes a possible solution to get a string dtype in pandas 3.0 (essentially writing out my compromise attempt at #57073 (comment) as a formal proposal).
This also covers the issue tracking the required work for the string dtype in #54792.

Abstract

This PDEP proposes to introduce a dedicated string dtype that will be used by default in pandas 3.0:

In pandas 3.0, enable a string dtype ("str") by default, using PyArrow if available or otherwise the numpy object-dtype alternative.
The default string dtype will use missing value semantics using NaN consistent with the other default data types.

This will give users a long-awaited proper string dtype for 3.0, while 1) not (yet) making PyArrow a hard dependency, but still a dependency used by default, and 2) leaving room for future improvements (different missing value semantics, using NumPy 2.0 or nanoarrow, etc).

Sub-discussions:

Default string dtype (PDEP-14): naming convention to distinguish the dtype variants #58613

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

bashtage

A good attempt at providing the compromise that is being asked for.

Some possible names that spring to mind: pyarrow_legacy, pyarrow_nan

bashtage · 2024-05-03T15:45:20Z

web/pandas/pdeps/00xx-string-dtype.md

+default in pandas 3.0:
+
+* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available
+ or otherwise the numpy object-dtype alternative.


Should you allow the possability of a NumPy 2 improved type for pandas 3? With a heirarchy arrow -> np 2 -> np object?

This proposal does not preclude any further improvements for the numpy-based string dtype using numpy 2.0. A few lines below I explicitly mention it as a future improvement and in the "Object-dtype "fallback" implementation" section as well.

I just don't want to explicitly commit to anything for pandas 3.0 related to that, given it is hard to judge right now how well it will work / how much work it is to get it ready (not only our own implementation, but also support in the rest of the ecosystem). If it is ready by 3.0, then we can evaluate that separately, but this proposal doesn't stand or fall with it.

Regardless of whether to also use numpy 2.0, we have to agree on 1) making a "string" dtype the default for 3.0, 2) the missing value behaviour to use for this dtype, and 3) whether to provide an alternative for PyArrow (in which case we need the object-dtype version anyway since we also can't require numpy 2.0). I would like the proposal to focus on those aspects.

bashtage · 2024-05-03T15:48:34Z

web/pandas/pdeps/00xx-string-dtype.md

+After acceptance of PDEP-10, two aspects of the proposal have been under
+reconsideration:
+
+- Based on user feedback, it has been considered to relax the new `pyarrow`


Is it worth mentioning why this has been objected to? As far as I am aware virtually all objections are due to the installation size effect, and not performance or compatibility.

I can certainly mention something, but would prefer to keep that brief to focus here on the strings context and not trigger discussion here about the merits of those objections.
(for example, it's not only installation size, but also the difficulty to install from source in case there are no wheels)

Added "(mostly around installation complexity and size)"

WillAyd · 2024-05-03T16:58:23Z

web/pandas/pdeps/00xx-string-dtype.md

+reconsideration:
+
+- Based on user feedback, it has been considered to relax the new `pyarrow`
+ requirement to not be a _hard_ runtime dependency. In addition, NumPy 2.0 can


I don't think NumPy 2.0 will reduce the need to make pyarrow a dependency for strings; as far as I am aware it is not natively returned by any I/O operation and it has a completely different string architecture than pyarrow, so there is no zero-copy capability. Those seem like they either will require a large amount of string copying or a hefty amount of updates to make it natively work with our I/O, as well as with the larger Arrow ecosystem. That's a huge amount of things to gloss over

I don't think NumPy 2.0 will reduce the need to make pyarrow a dependency for strings

I think it can do that if your motivation for wanting pyarrow is the better performance compared to object-dtype. In that case, numpy 2.0's StringDType can give you a part of the speedup, without requiring pyarrow.
The discussion in #57073 also started from that point of view, mentioning numpy 2.0 as an alternative to requiring pyarrow, so based on that my feeling is that what I wrote here is correct (or at least seen as such by some people).

But you are completely right that there are a lot of things that would need to be implemented to make it fully usable for us. That's also the reason that this PDEP does not say to use numpy 2.0, but defers that as a possible future enhancement, to discuss later. And you are also right that it has drawbacks compared to a Arrow based solution (using Arrow memory layout, but not necessary using pyarrow the package), another reason for me personally to again defer that to a separate discussion.

I just wanted to mention it for the complete context of the string dtype history and discussion. Now, I already mention its existence in the previous paragraph, so could keep it shorter here.
(and if you have any concrete suggestions to word this better, I am all ears!)

WillAyd · 2024-05-03T17:09:38Z

web/pandas/pdeps/00xx-string-dtype.md

+topic.
+
+In the first place, we need to acknowledge that most users should not need to
+use storage-specific options. Users are expected to specify `pd.StringDtype()`


So we are reusing pd.StringDtype() in this case right? Is that going to break existing use cases where users have relied on that using pd.NA as a sentinel?

So we are reusing pd.StringDtype() in this case right?

Yes, and that is what already happens since pandas 2.1 with future.infer_string enabled

Is that going to break existing use cases where users have relied on that using pd.NA as a sentinel?

Yes, I mentioned that in the "Backwards compatibility" section

Ah thanks - sorry for overlooking that. So I think it goes without saying then that if we go this route we no longer will declare pd.StringDtype() experimental? Or are we still trying to keep that reservation knowing even this is not considered a long term design decision?

So I think it goes without saying then that if we go this route we no longer will declare pd.StringDtype() experimental?

Yep, given the proposal is to enable this by default, I think that is indeed saying to remove the experimental label (I can mention that somewhere explicitly if that helps)

Or are we still trying to keep that reservation knowing even this is not considered a long term design decision?

Once we have a "string", we will always have one, I think. That aspect is the long term decision this PDEP is proposing. We might change later the missing value semantics, but that doesn't mean the string dtype proposed here is still experimental (just like our default "int64" dtype is not experimental). At the time that we would decide to enable new missing value semantics by default, then "string" will "simply" start meaning something differently.

jbrockmendel · 2024-05-03T21:31:12Z

ValueError: Could not find PDEP number in 'PDEP: Dedicated string data type for pandas 3.0'. Please make sure to write the title as: 'PDEP-num: PDEP: Dedicated string data type for pandas 3.0'.

web/pandas/pdeps/00xx-string-dtype.md

jbrockmendel · 2024-05-03T21:39:03Z

web/pandas/pdeps/00xx-string-dtype.md

+Currently, the `StringDtype(storage="pyarrow_numpy")` is used, where
+"pyarrow_numpy" is a rather confusing option.
+
+TODO see if we can come up with a better naming scheme


StringDtype(storage="pyarrow", semantics="numpy")? or instead of semantics, could use "na_value=np.nan`

If i'm understanding correctly about the motivation for the change in dtype (improved overall user experience), then moving forward I suspect that when we can have improved/native dtypes for other data types (nested, date, etc) that the same logic would need to apply, i.e. we would need to have a variants of these with NumPy semantics.

Now this probably falls under PDEP-13 but if we have semantics as a argument (that users would see and use) we could still end up with columns using different missing value indicators?

StringDtype(storage="pyarrow", semantics="numpy")? or instead of semantics, could use "na_value=np.nan`

or maybe "nullable=[True|False]"

However, at the moment, we distinguish the nullable data types for the other dtypes (int, float, etc) with capitalization and so for consistency could also consider string/String as the dtypes.

PDEP-13 proposes StringDtype(backend="pyarrow", na_marker=np.nan). I think the repr should just be updated to reflect that; trying to sift through the meaning of int versus Int versus int[pyarrow] compared to string versus string[pyarrow] versus string[pyarrow_numpy] I think would be a distraction for this proposal

StringDtype(storage="pyarrow", semantics="numpy")? or instead of semantics, could use "na_value=np.nan`

@jbrockmendel good point that we can also use other keywords than just storage to make the distinction

if we have semantics as a argument (that users would see and use) we could still end up with columns using different missing value indicators?

Only if users explicitly specify a non-default value for this, and never by default. This is the same with whatever option we come up with (eg also when using dtype_backend="pyarrow" or explicitly asking for one of the masked dtypes with dtype=Int64 or .. you can end up with a DataFrame with columns with mixed semantics)

we distinguish the nullable data types for the other dtypes (int, float, etc) with capitalization and so for consistency could also consider string/String as the dtypes.

Yeah, only unfortunately to be consistent with the other dtypes where we use capitalization, it would need to be "string" for the new NaN-based dtype, and "String" for the "nullable" NA-based variant. And so that doesn't help with backwards compatibility, because "string" right now means the nullable dtype. Given that, I would personally not use capitalization here (which also only is a solution for the string alias naming, not for the StringDtype(..) API)

To keep the sub-discussions manageable, I moved this specific topic out of this inline comment thread, and into it's own issue: #58613

simonjayhawkins · 2024-05-04T10:03:05Z

web/pandas/pdeps/00xx-string-dtype.md

+
+- Created: May 3, 2024
+- Status: Under discussion
+- Discussion:


I see no reason not to use #57073 as the discussion issue as any further discussion will be here and #57073 can now focus on whether to reject PDEP-10 and what to do about the planned improvements to other dtypes.

My assumption is that approval of this PDEP should not, in itself, be a justification to overturn the PDEP-10 decision even though they are very much related and the implementation of the fallback option is only applicable if PDEP-10 is formally rejected.

rhshadrach · 2024-05-04T13:09:46Z

@jorisvandenbossche - I've renamed this PDEP-14 to fix the doc build job. The docs build automatically picks up added PDEP PRs for the website, and they need a number for that to succeed.

lithomas1 · 2024-05-04T16:00:54Z

web/pandas/pdeps/00xx-string-dtype.md

+[introduced in pandas 2.1](https://pandas.pydata.org/docs/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings)
+that is still backed by PyArrow but follows the default missing values semantics
+pandas uses for all other default data types (and using `NaN` as the missing
+value sentinel) ([GH-54792](https:/pandas-dev/pandas/issues/54792)).


The pyarrow_numpy StringArray also returns numpy arrays as results for some operations.

I think this is also important to mention.

At this point, I haven't yet mentioned that the original StringDtype returns masked arrays from operations (only that it uses pd.NA). I only mention that when going more in detail on this topic in the "Missing value semantics" subsection. Given that, I would also leave it here to the generic "missing value semantics" for the new variant as well (to not make the background section even longer. I can certainly expand the "Missing value semantics" section if needed)

lithomas1 · 2024-05-04T16:03:19Z

web/pandas/pdeps/00xx-string-dtype.md

+
+To avoid a hard dependency on PyArrow for pandas 3.0, this PDEP proposes to keep
+a "fallback" option in case PyArrow is not installed. The original `StringDtype`
+backed by a numpy object-dtype array of Python strings can be used for this, and


It would be nice to clarify that this is a separate dtype from the original string[python] dtype, just to make it clear that the original StringDtype is not changing (and still will return masked arrays, and use pd.NA as its missing sentinel)

I tried to clarify in the test that it is indeed a new variant of the string dtype and uses a subclass to reuse most code

lithomas1 · 2024-05-04T16:05:40Z

web/pandas/pdeps/00xx-string-dtype.md

+
+For pandas 3.0, this is the most realistic option given this implementation is
+already available for a long time. Beyond 3.0, we can still explore further
+improvements such as using nanoarrow or NumPy 2.0, but at that point that is an


I would drop this bit about nanoarrow (given it is not explained/introduced in the paragraphs beforehand).

If you want to add an explanation above, that's also fine with me.

I added a link to the discussion issues for both numpy 2.0 and nanoarrow, so people can find more explanation there if they want.

lithomas1 · 2024-05-04T16:59:01Z

web/pandas/pdeps/00xx-string-dtype.md

+flag in pandas 2.1 (by `pd.options.future.infer_string = True`).
+
+Some small enhancements or fixes (or naming changes) might still be needed and
+can be backported to pandas 2.2.x.


This part of the plan worries me a little.

Maybe it would be better to cut off a 2.3 from 2.2.x.

I think there's a significant proportion of the downloads for 2.2 that aren't on the latest patch release.
I think there's ~ 1/3 of the downloads that are fetching 2.2.0.

Also,
it would be good to mention which version of pandas is expected to have infer_string be able to infer to the object fallback option.

a 2.3 release (maybe around the same time as 3.0rc) sounds reasonable.

If the features/bugfixes added to 2.3 are limited to the string dtype then we shouldn't need many patch releases. We may not need to fix any string dtype related issues that are fixed for 3.0 as these will be behind a flag in 2.3 and so shouldn't break existing code.

On the other hand, as these features are behind a flag, maybe releasing a 2.3 would not gain the field testing we hope for.

And therefore, instead of doing a 2.3, planning for at least a couple of release candidates for 3.0 would better achieve this.

@jorisvandenbossche

Thoughts on this?

Maybe it would be better to cut off a 2.3 from 2.2.x.

Yes, if we still plan to add a deprecation warning and change the naming scheme in StringDtype, calling that 2.3.0 sounds as the best option (I had been planning to propose doing a 2.3.0 (from the 2.2.x branch) anyway to bump the warning for CoW from DeprecationWarning to FutureWarning)

lithomas1 · 2024-05-04T17:22:29Z

web/pandas/pdeps/00xx-string-dtype.md

+
+1. Delaying has a cost: it further postpones introducing a dedicated string
+ dtype that has massive benefits for our users, both in usability as (for the
+ significant part of the user base that has PyArrow installed) in performance.


I don't think we can just claim this. I don't disagree, but this should be backed up more.

At least from the feedback received from #57073 and the other issue, there's at least a significant part of the user base that doesn't use strings.

There's also a significant chunk of the population that can't install pyarrow (due to size requirements or exotic platforms or whatever).

I am not sure this argument is that convincing either, although for slightly different reasons. I don't think we need to feel rushed for the next release

I don't think we can just claim this. I don't disagree, but this should be backed up more.

@lithomas1 can you clarify which part of the paragraph you think requires more backing up?
The fact that I say a "significant" part of our user base has pyarrow installed?

I don't think we can ever know exact numbers for this, but one data point is that pandas currently has 210M monthly downloads and pyarrow has 120M monthly downloads. Of course not all of those pyarrow users are also using pandas, but let's just assume that half of those pyarrow downloads come from people using pandas, that would mean that around 30% for our users already have pyarrow installed, which I would consider as a "significant part".
(and my guess is that for people working with larger datasets, where the speed of pyarrow becomes more important, this percentage will be higher, for example because of using the parquet IO)

But anyway, we are never going to know this exact number, but IMO we do know that a significant part of our userbase has pyarrow and will benefit from using that by default.

there's at least a significant part of the user base that doesn't use strings.

Yes, and then this PDEP is not relevant for them. But it's not because some users don't use strings, that we shouldn't improve the life of those users that do use strings? (so just not really understanding how this is a relevant argument)

There's also a significant chunk of the population that can't install pyarrow

Yes, and this PDEP addresses that by allowing a fallback when pyarrow is not installed.

I am not sure this argument is that convincing either, although for slightly different reasons.

@WillAyd can you then clarify which other reasons?

My other reason is that I don't think there is ever a rush to get a release out; we have historically never operated that way

I don't think there is ever a rush to get a release out; we have historically never operated that way

For the last six years, we have roughly released a new feature release every six months. We indeed never rush a specific release if there is something holding it up for a bit, but historically we have been releasing somewhat regularly.

At this point, a next feature release will be 3.0 given the amount of changes we already made on the main branch that require the next release cut from main to be 3.0 and not 2.3 (enforced deprecations etc).
(we can cut a 2.3 release from the the 2.2.x maintenance branch, which we might want to do for several reasons, but not counting that as a feature release for this discussion, as that will not actually contain features)

So I would say there is not necessarily a rush to do a release with a default "string" dtype (that is up for debate, i.e. this PDEP), but there is some rush to get a 3.0 release out. In the meaning that I think we don't want to delay 3.0 for like half a year or longer.

So for me delaying the string dtype, essentially means not including it in 3.0 but postponing it to pandas 4.0 (I should maybe be clearer in the paragraph above about that).

And then I try to argue in the text here that postponing it for 4.0 has a cost (or, missed benefit), because we have an implementation we could use for a default string dtype in pandas 3.0, and postponing introducing it makes that users will use the sub-optimal object dtype for longer, for (IMO) no good reason.

I don't think we can just claim this. I don't disagree, but this should be backed up more.

@lithomas1 can you clarify which part of the paragraph you think requires more backing up? The fact that I say a "significant" part of our user base has pyarrow installed?

It'd be nice to add how much perf benefits Arrow strings are expected to bring (e.g. 20%? 2x? 10x?).
Putting in the part about how many users have pyarrow would also help.

It'd also be good to elaborate on the usability part. IIUC, the main benefit here is not having to manually check element to see whether your object dtype'd column contains strings (since I think all the string methods work on object dtype'd columns).

I think it's also fair to amend this part to say "massive benefits to users that use strings" (instead of in general).

Benchmarks are going to be highly dependent on usage and context. If working in an Arrow native ecosystem, the speedup of strings may be a factor over 100x. If working in a space where you have to copy back and forth a lot with NumPy, that number goes way down.

I think trying to set expectations on one number / benchmark for performance is futile, but generally Arrow only helps, and makes it so that we as developers don't need to write custom I/O solutions (eg: ADBC Drivers, parquet, read_csv with pyarrow all work with Arrow natively with no extra pandas dev effort)

It'd be nice to add how much perf benefits Arrow strings are expected to bring (e.g. 20%? 2x? 10x?).

Benchmarks are going to be highly dependent on usage and context.

Indeed, for single operations you can easily get a >10x speedup, but of course a typical workflow does not consist of just string operations, and the overall speedup depends a lot (see this slide for one small example comparison (https://phofl.github.io/pydata-berlin/pydata-berlin-2023/intro.html#74) and this blogpost from Patrick showing the benefit in a dask example workflow (https://towardsdatascience.com/utilizing-pyarrow-to-improve-pandas-and-dask-workflows-2891d3d96d2b).

but generally Arrow only helps, and makes it so that we as developers don't need to write custom I/O solutions

That is often true, but except for strings ;).
For strings, the faster compute kernels will still give a lot of value even if your IO wasn't done through Arrow (and give a lot more value compared to using pyarrow for numeric data)

web/pandas/pdeps/00xx-string-dtype.md

simonjayhawkins

Thanks @jorisvandenbossche for the PDEP.

I am generally in agreement with the motivation for this PDEP on the proviso that any approval is not rejecting PDEP-10. The motivation of accepting PDEP-10 by the team members could have been related to the perceived maintenance burden, a more performant string dtype, interoperability, having better default inference for other data types or maybe some other reason. This current PDEP only addresses one aspect of that decision.

One other aspect that is not mentioned here and was not mentioned in PDEP-10 is the consequences of choosing PyArrow as a backend. Bearing in mind, that it was felt that the implications of using nullable semantics for default dtypes was not discussed, I wonder whether we should have a section that discusses the other implications of choosing PyArrow in this PDEP, e.g. implications of choosing 1d immutable arrays as the backend.

web/pandas/pdeps/00xx-string-dtype.md

simonjayhawkins · 2024-05-05T09:43:44Z

web/pandas/pdeps/00xx-string-dtype.md

+4. We update installation guidelines to clearly encourage users to install
+ pyarrow for the default user experience.


and do we consider adding a performance warning to the fallback also?

and do we consider adding a performance warning to the fallback also?

I personally wouldn't do that always / for each method, because that would be super noisy (and in some cases, like smallish data, it doesn't matter that much, so getting those warnings would be annoying).

If we wanted to warn users to gently push them towards installing pyarrow, I think we could do a warning but only 1) raise it once, and 2) only when doing one of the string operations on a big enough dataset (with some threshold).

Now, your question reminds me that the current pyarrow-backed string dtype has those fallback warnings for very specific cases, which I personally think we should stop doing when it becomes the default dtype. Given this is already for the existing implementation (and to keep the many discussion lines here a bit more limited), I opened a separate issue for this: #58581.
(but if there is agreement on that other issue, can of course briefly mention that here later)

fair point. from the recent user feedback of adding the deprecation warning for the PyArrow requirement, then maybe not having any warnings is wise.

that the current pyarrow-backed string dtype has those fallback warnings for very specific cases, which I personally think we should stop doing when it becomes the default dtype.

+1

web/pandas/pdeps/00xx-string-dtype.md

simonjayhawkins · 2024-05-05T10:12:20Z

web/pandas/pdeps/00xx-string-dtype.md

+Currently, the `StringDtype(storage="pyarrow_numpy")` is used, where
+"pyarrow_numpy" is a rather confusing option.
+
+TODO see if we can come up with a better naming scheme


StringDtype(storage="pyarrow", semantics="numpy")? or instead of semantics, could use "na_value=np.nan`

or maybe "nullable=[True|False]"

However, at the moment, we distinguish the nullable data types for the other dtypes (int, float, etc) with capitalization and so for consistency could also consider string/String as the dtypes.

simonjayhawkins · 2024-05-05T10:28:03Z

web/pandas/pdeps/00xx-string-dtype.md

+ dtype that has massive benefits for our users, both in usability as (for the
+ significant part of the user base that has PyArrow installed) in performance.


Suggested change

dtype that has massive benefits for our users, both in usability as (for the

significant part of the user base that has PyArrow installed) in performance.

dtype that has massive benefits for our users, both in usability and, for users that already have PyArrow installed or have no issues installing PyArrow, in performance.

web/pandas/pdeps/00xx-string-dtype.md

simonjayhawkins · 2024-05-05T10:44:34Z

web/pandas/pdeps/00xx-string-dtype.md

+flag in pandas 2.1 (by `pd.options.future.infer_string = True`).
+
+Some small enhancements or fixes (or naming changes) might still be needed and
+can be backported to pandas 2.2.x.


a 2.3 release (maybe around the same time as 3.0rc) sounds reasonable.

If the features/bugfixes added to 2.3 are limited to the string dtype then we shouldn't need many patch releases. We may not need to fix any string dtype related issues that are fixed for 3.0 as these will be behind a flag in 2.3 and so shouldn't break existing code.

On the other hand, as these features are behind a flag, maybe releasing a 2.3 would not gain the field testing we hope for.

And therefore, instead of doing a 2.3, planning for at least a couple of release candidates for 3.0 would better achieve this.

Co-authored-by: Simon Hawkins <[email protected]>

web/pandas/pdeps/00xx-string-dtype.md

WillAyd · 2024-05-06T14:58:20Z

web/pandas/pdeps/00xx-string-dtype.md

+Currently, the `StringDtype(storage="pyarrow_numpy")` is used, where
+"pyarrow_numpy" is a rather confusing option.
+
+TODO see if we can come up with a better naming scheme


PDEP-13 proposes StringDtype(backend="pyarrow", na_marker=np.nan). I think the repr should just be updated to reflect that; trying to sift through the meaning of int versus Int versus int[pyarrow] compared to string versus string[pyarrow] versus string[pyarrow_numpy] I think would be a distraction for this proposal

WillAyd · 2024-05-06T15:02:15Z

web/pandas/pdeps/00xx-string-dtype.md

+
+1. Delaying has a cost: it further postpones introducing a dedicated string
+ dtype that has massive benefits for our users, both in usability as (for the
+ significant part of the user base that has PyArrow installed) in performance.


I am not sure this argument is that convincing either, although for slightly different reasons. I don't think we need to feel rushed for the next release

WillAyd · 2024-05-06T15:05:51Z

web/pandas/pdeps/00xx-string-dtype.md

+1. Delaying has a cost: it further postpones introducing a dedicated string
+ dtype that has massive benefits for our users, both in usability as (for the
+ significant part of the user base that has PyArrow installed) in performance.
+2. In case we eventually transition to use `pd.NA` as the default missing value


the challenges around this will not be unique to the string dtype and
therefore not a reason to delay this.

I might be missing the intent but I don't understand why the larger issue of NA handling means we should be faster to implement this

I don't understand why the larger issue of NA handling means we should be faster to implement this

It's not a reason to do it "faster", but I meant to say that the discussion regarding NA is not a reason to do it "slower" (to delay introducing a dedicated string dtype)

I think the flip side is that if we aren't careful about the NA handling we can introduce some new keywords / terminology that makes it very confusing in the long run (which is essentially one of the problems with our strings naming conventions)

As a practical example, if we decided we wanted semantics= as a keyword argument to StringDtype in this PDEP to move the NA discussion along, that might be counter-productive when we look at more data types and decide semantics= was not a clear way to allow datetime data types to support pd.NaT as the missing value.

(not saying the above is necessarily the truth, just cherry picking from conversation so far)

That's one reason that I personally would prefer not introducing a keyword specifically for the missing value semantics, for now (just for this PDEP / the string dtype). I just listed some options in #58613, and I think we can do without it.

WillAyd · 2024-05-06T15:09:47Z

web/pandas/pdeps/00xx-string-dtype.md

+
+Wouldn't adding even more variants of the string dtype will make things only more
+confusing? Indeed, this proposal unfortunately introduces more variants of the
+string dtype. However, the reason for this is to ensure the actual default user


This just retroactively clarifies the reasoning for string[pyarrow_numpy] to have existed in the first place right? Or is it supposed to be hinting at some other feature that the implementation details of the PDEP is proposing?

Yes, it's indeed explaining why we did this, which is of course "retroactively" given I was asked to write this PDEP partly for changes that have already been released. So a big part of the PDEP is retroactively in that sense (which it not necessarily helping to write it clearly ..).

Or is it supposed to be hinting at some other feature that the implementation details of the PDEP is proposing?

however, more importantly, the PDEP makes this (the already added dtype) the default in 3.0. It would remain behind the future flag for the next release if enough people feel we are not ready.

WillAyd · 2024-05-06T15:12:53Z

web/pandas/pdeps/00xx-string-dtype.md

+One other backwards incompatible change is present for early adopters of the
+existing `StringDtype`. In pandas 3.0, calling `pd.StringDtype()` will start
+returning the new default string dtype, while up to now this returned the
+experimental string dtype using `pd.NA` introduced in pandas 1.0. Those users


Historically you would get this by using dtype="string" too right? I'm a little wary that we are underestimating the scope of how breaking this could be; I didn't even realize we considered that dtype experimental all this time

This has been available (as pyarrow backed) since 1.3, so almost three years (July 2, 2021). Even though considered experimental, if the new string dtype is not accepted for 3.0, then maybe a deprecation warning should be added? (We could also do this if decided a 2.3 release is needed?)

A deprecation warning about what exactly?

I'm a little wary that we are underestimating the scope of how breaking this could be

The scope of changing NaN to NA for all users is much bigger though (essentially what was decided in PDEP-10 if we would follow it strictly to the letter).
And similarly if we would in the future change NaN/NaT semantics to NA for all dtypes, the scope will be much bigger (because once that is enabled by default, for example a user that was doing dtype="float64" will probably get the new NA behaviour while now it uses NaN), but we are still considering that (granted, it's exactly those details that we have to discuss a lot more in detail (elsewhere) and figure out, though).

I know that this is not necessarily a good argument to justify this breaking change (because we certainly should be wary of the scope of those breaking changes), but I do want to point out again that the choice in this PDEP to use NaN semantics is to reduce the scope of the breaking changes for most users (at the expense of increasing the scope of breaking changes for the smaller subset of users that was already using dtype="string").

If we don't want to make dtype="string" breaking, then either we need to come up with a different name for the dtype (not using "string", like "utf8" or "text"), or either we need to delay introducing a default string dtype until after we have agreement on the NA discussions.

And personally I think "string" is by far the best name (and I find the small breakage worth it for being able to use that name), and as I argued elsewhere (and in the Why not delay introducing a default string dtype? section in the PDEP text), I think it is valuable for our users to not wait with adding a dedicated string dtype until we are ready with the NA discussion and implementation.

at the expense of increasing the scope of breaking changes for the smaller subset of users that was already using dtype="string"

This is where I am a little uncomfortable - I don't know how to measure the size of that, but I am wary of assuming it is not a signifcant number of users. The fact that "string" returns NA as a missing value is a documented difference in our code base:

https://pandas.pydata.org/docs/dev/user_guide/text.html#behavior-differences

And its usage has been promoted for quite some time:

https://stackoverflow.com/a/60553529/621736
https://towardsdatascience.com/why-we-need-to-use-pandas-new-string-dtype-instead-of-object-for-textual-data-6fd419842e24
https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.1.0.html#all-dtypes-can-now-be-converted-to-stringdtype

If we don't want to make dtype="string" breaking, then either we need to come up with a different name for the dtype (not using "string", like "utf8" or "text"), or either we need to delay introducing a default string dtype until after we have agreement on the NA discussions.

Yea none of these options are great...but out of them I still would probably prefer waiting. I think right now we are marching down a path of "string" missing values:

Returning pd.NA today

Returning np.nan with this PDEP (granted those changes are already in main)

Going back to returning pd.NA with the NA PDEP

But personally I think dtype="string" meaning something different than the default string dtype you get without specifying the dtype is going to be very confusing ..)

I think we have to carefully specify what the user specifies in a dtype argument and how that gets interpreted, versus what we return as the dtype when they look at Series.dtype.

So we could have a mapping that says

User specifies dtype= pandas returns Series.dtype

Unspecified "string[pyarrow_numpy]" OR "string[python]"

"string" "string[pyarrow]"

StringDtype("pyarrow") "string[pyarrow]"

StringDtype("python") "string[python]"

StringDtype("pyarrow_numpy") "string[pyarrow_numpy]"

The first row depends on whether pyarrow is installed.
For the second, third and fifth rows, if pyarrow is not installed, we raise an Exception.

Separately, we can then debate what the values in the second column should look like in #58613 . I personally am not a fan of "pyarrow_numpy"

No, my answer to your example snippet was trying to explain how I would ensure this does not break (if we return bool column instead of object dtype with True/False/NaN will ensure that filtering keeps working).

Ah OK - I didn't realize you were proposing that change be a part of this PDEP, just thought it was an idea you had for the future. But that's a completely new behavior...and then begs the question of do we go back and change dtype=object to have that same behavior or just have dtype="string" exclusively have it. Ultimately we end up with the same issue

Yeah, I also agree with Will that it's not fair to change this without warning for people already using "string".
(pd.NA is also a big selling point of the dtype="string" too)

Maybe a good compromise would be to use string[pyarrow] under the hood for those users (if they had it installed)?

If we were to move ahead with the move to nullable dtypes in general, I worry that this changing of the na value for dtype="string" from pd.NA -> np.nan -> pd.NA will cause a lot of confusion.

If we were to do 2.3 (like I suggested below), this might be addressable there (with a deprecation).

Still adding some deprecation warnings in 2.x for current users of StringDtype is something we certainly could do. I am personally ambivalent about it, but fine with adding it if others think that is better (I do think it might become quite noisy, and it also does not change the fact that 3.0 would switch from NA to NaN)

The warning message could then point people to enable pd.options.future.infer_string = True in case they only care about having the (faster) string dtype, or otherwise update their dtype specification if they want the NA instead of NaN version.

I think we have to carefully specify what the user specifies in a dtype argument and how that gets interpreted, versus what we return as the dtype when they look at Series.dtype.

So we could have a mapping that says

I created a variant of that table #58613 (comment) with a concrete proposal

For the second, third and fifth rows, if pyarrow is not installed, we raise an Exception.

(for clarity, this "second" row referred to specifying a dtype with "string")
If you explicitly ask for pyarrow, then yes raising an exception is fine and expected. But a generic "string" (or StringDtype()) has to mean "whatever string dtype that is the default" and so cannot raise an exception if pyarrow is not installed, but should return the object-dtype based fallback.

jorisvandenbossche · 2024-05-07T11:29:13Z

One of the concrete discussion points is the API design of the StringDtype(..) constructor and the way to distinguish the various variants of the dtype (i.e. the current "pyarrow_numpy" naming we introduced in #54533 / #54792).
To keep that sub-discussion manageable, I opened a dedicated issue for that specific topic: #58613

jbrockmendel · 2024-05-07T14:51:08Z

I'm with Joris pretty much across the board on this. I'm pretty sure @phofl will be too.

jorisvandenbossche · 2024-06-27T19:59:59Z

A reminder that, as mentioned on the mailing list, we will start a vote on this next week if there are no more substantial comments.

jorisvandenbossche · 2024-07-01T18:24:50Z

I started a vote at #59160

phofl · 2024-07-22T20:37:16Z

Update the status to accepted

jorisvandenbossche · 2024-07-24T15:41:41Z

Thanks for the update @phofl

Given the vote passed, I am going to merge this now.

simonjayhawkins · 2024-07-24T15:55:09Z

Thanks @jorisvandenbossche for the significant effort on this. Very worthwhile outcome.

Co-authored-by: Simon Hawkins <[email protected]> Co-authored-by: Irv Lustig <[email protected]> Co-authored-by: William Ayd <[email protected]> Co-authored-by: Richard Shadrach <[email protected]> Co-authored-by: Patrick Hoefler <[email protected]>

PDEP: Dedicated string data type for pandas 3.0

fbeb69d

jorisvandenbossche added the PDEP pandas enhancement proposal label May 3, 2024

jorisvandenbossche requested a review from datapythonista as a code owner May 3, 2024 15:18

jorisvandenbossche mentioned this pull request May 3, 2024

DISC: Consider not requiring PyArrow in 3.0 #57073

Open

bashtage reviewed May 3, 2024

View reviewed changes

WillAyd mentioned this pull request May 3, 2024

DISC: nanoarrow-backed ArrowStringArray #58552

Open

3 tasks

small textual edits and typos

f03f54d

WillAyd reviewed May 3, 2024

View reviewed changes

jbrockmendel reviewed May 3, 2024

View reviewed changes

web/pandas/pdeps/00xx-string-dtype.md Outdated Show resolved Hide resolved

jbrockmendel reviewed May 3, 2024

View reviewed changes

web/pandas/pdeps/00xx-string-dtype.md Outdated Show resolved Hide resolved

jbrockmendel reviewed May 3, 2024

View reviewed changes

simonjayhawkins reviewed May 4, 2024

View reviewed changes

rhshadrach changed the title ~~PDEP: Dedicated string data type for pandas 3.0~~ PDEP-14: Dedicated string data type for pandas 3.0 May 4, 2024

lithomas1 reviewed May 4, 2024

View reviewed changes

web/pandas/pdeps/00xx-string-dtype.md Outdated Show resolved Hide resolved

simonjayhawkins approved these changes May 5, 2024

View reviewed changes

jorisvandenbossche and others added 2 commits May 5, 2024 13:55

address part of the feedback

561de87

Update web/pandas/pdeps/00xx-string-dtype.md

86f4e51

Co-authored-by: Simon Hawkins <[email protected]>

Dr-Irv reviewed May 6, 2024

View reviewed changes

web/pandas/pdeps/00xx-string-dtype.md Outdated Show resolved Hide resolved

Dr-Irv reviewed May 6, 2024

View reviewed changes

web/pandas/pdeps/00xx-string-dtype.md Outdated Show resolved Hide resolved

WillAyd requested changes May 6, 2024

View reviewed changes

jorisvandenbossche mentioned this pull request May 7, 2024

Default string dtype (PDEP-14): naming convention to distinguish the dtype variants #58613

Closed

jorisvandenbossche added 3 commits June 14, 2024 20:00

Merge remote-tracking branch 'upstream/main' into pdep-string-dtype

af5ad3c

tiny edit

bd52f39

mismatched quote

f8fbc61

jorisvandenbossche mentioned this pull request Jul 1, 2024

VOTE: Voting issue for PDEP-14: Dedicated string data type for pandas 3.0 #59160

Closed

flying-sheep mentioned this pull request Jul 9, 2024

Nullable string columns scverse/anndata#679

Closed

Update 0014-string-dtype.md

d78462d

phofl approved these changes Jul 22, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into pdep-string-dtype

4de20d1

jorisvandenbossche merged commit 29b0b28 into pandas-dev:main Jul 24, 2024
15 of 16 checks passed

jorisvandenbossche deleted the pdep-string-dtype branch July 24, 2024 16:08

jorisvandenbossche mentioned this pull request Aug 22, 2024

ENH: Need API support and __repr__ to discover the storage used for strings #59342

Open

mroeschke mentioned this pull request Aug 30, 2024

RLS: 2.3 #59664

Open

6 tasks

		4. We update installation guidelines to clearly encourage users to install
		pyarrow for the default user experience.

		dtype that has massive benefits for our users, both in usability as (for the
		significant part of the user base that has PyArrow installed) in performance.

User specifies `dtype`=	pandas returns `Series.dtype`
Unspecified	`"string[pyarrow_numpy]"` OR `"string[python]"`
`"string"`	`"string[pyarrow]"`
StringDtype("pyarrow")	`"string[pyarrow]"`
StringDtype("python")	`"string[python]"`
StringDtype("pyarrow_numpy")	`"string[pyarrow_numpy]"`

PDEP-14: Dedicated string data type for pandas 3.0 #58551

PDEP-14: Dedicated string data type for pandas 3.0 #58551

Conversation

jorisvandenbossche commented May 3, 2024 • edited Loading

bashtage left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche May 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented May 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented May 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 May 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 May 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche May 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonjayhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonjayhawkins May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 7, 2024

jbrockmendel commented May 7, 2024

jorisvandenbossche commented May 3, 2024 •

edited

Loading

jorisvandenbossche May 3, 2024 •

edited

Loading

rhshadrach commented May 4, 2024 •

edited

Loading

lithomas1 May 4, 2024 •

edited

Loading

lithomas1 May 4, 2024 •

edited

Loading

jorisvandenbossche May 6, 2024 •

edited

Loading

jorisvandenbossche May 7, 2024 •

edited

Loading

simonjayhawkins May 7, 2024 •

edited

Loading

jorisvandenbossche May 7, 2024 •

edited

Loading