PDEP-14: Dedicated string data type for pandas 3.0
- Created: May 3, 2024
- Status: Accepted
- Discussion: https://github.com/pandas-dev/pandas/pull/58551
- Author: Joris Van den Bossche
- Revision: 1
Abstract
This PDEP proposes to introduce a dedicated string dtype that will be used by default in pandas 3.0:
- In pandas 3.0, enable a string dtype (
"str"
) by default, using PyArrow if available or otherwise a string dtype using numpy object-dtype under the hood as fallback. - The default string dtype will use missing value semantics (using NaN) consistent with the other default data types.
This will give users a long-awaited proper string dtype for 3.0, while 1) not (yet) making PyArrow a hard dependency, but only a dependency used by default, and 2) leaving room for future improvements (different missing value semantics, using NumPy 2.0 strings, etc).
Background
Currently, pandas by default stores text data in an object
-dtype NumPy array.
The current implementation has two primary drawbacks. First, object
dtype is
not specific to strings: any Python object can be stored in an object
-dtype
array, not just strings, and seeing object
as the dtype for a column with
strings is confusing for users. Second: this is not efficient (all string
methods on a Series are eventually calling Python methods on the individual
string objects).
To solve the first issue, a dedicated extension dtype for string data has
already been
added in pandas 1.0.
This has always been opt-in for now, requiring users to explicitly request the
dtype (with dtype="string"
or dtype=pd.StringDtype()
). The array backing
this string dtype was initially almost the same as the default implementation,
i.e. an object
-dtype NumPy array of Python strings.
To solve the second issue (performance), pandas contributed to the development
of string kernels in the PyArrow package, and a variant of the string dtype
backed by PyArrow was
added in pandas 1.3.
This could be specified with the storage
keyword in the opt-in string dtype
(pd.StringDtype(storage="pyarrow")
).
Since its introduction, the StringDtype
has always been opt-in, and has used
the experimental pd.NA
sentinel for missing values (which was also introduced
in pandas 1.0).
However, up to this date, pandas has not yet taken the step to use pd.NA
for
for any default dtype, and thus the StringDtype
deviates in missing value
behaviour compared to the default data types.
In 2023, PDEP-10
proposed to start using a PyArrow-backed string dtype by default in pandas 3.0
(i.e. infer this type for string data instead of object dtype). To ensure we
could use the variant of StringDtype
backed by PyArrow instead of Python
objects (for better performance), it proposed to make pyarrow
a new required
runtime dependency of pandas.
In the meantime, NumPy has also been working on a native variable-width string data type, which was made available starting with NumPy 2.0. This can provide a potential alternative to PyArrow for implementing a string data type in pandas that is not backed by Python objects.
After acceptance of PDEP-10, two aspects of the proposal have been under reconsideration:
- Based on feedback from users and maintainers from other packages (mostly
around installation complexity and size), it has been considered to relax the
new
pyarrow
requirement to not be a hard runtime dependency. In addition, NumPy 2.0 could in the future potentially reduce the need to make PyArrow a required dependency specifically for a dedicated pandas string dtype. - PDEP-10 did not consider the usage of the experimental
pd.NA
as a consequence of adopting one of the existing implementations of theStringDtype
.
For the second aspect, another variant of the StringDtype
was
introduced in pandas 2.1
that is still backed by PyArrow but follows the default missing values semantics
pandas uses for all other default data types (and using NaN
as the missing
value sentinel) (GH-54792).
At the time, the storage
option for this new variant was called
"pyarrow_numpy"
to disambiguate from the existing "pyarrow"
option using
pd.NA
(but this PDEP proposes a better naming scheme, see the "Naming"
subsection below).
This last dtype variant is what users currently (pandas 2.2) get for string data
when enabling the future.infer_string
option (to enable the behaviour which
is intended to become the default in pandas 3.0).
Proposal
To be able to move forward with a string data type in pandas 3.0, this PDEP proposes:
- For pandas 3.0, a
"str"
string dtype is enabled by default, i.e. this string dtype will be used as the default dtype for text data when creating pandas objects (e.g. inference in constructors, I/O functions). - This default string dtype will follow the same behaviour for missing values
as other default data types, and use
NaN
as the missing value sentinel. - The string dtype will use PyArrow if installed, and otherwise falls back to an in-house functionally-equivalent (but slower) version. This fallback can reuse (with minor code additions) the existing numpy object-dtype backed StringArray for its implementation.
- Installation guidelines are updated to clearly encourage users to install pyarrow for the default user experience.
Those string dtypes enabled by default will then no longer be considered as experimental.
Default inference of a string dtype
By default, pandas will infer this new string dtype instead of object dtype for string data (when creating pandas objects, such as in constructors or IO functions).
In pandas 2.2, the existing future.infer_string
option can be used to opt-in to the future
default behaviour:
>>> pd.options.future.infer_string = True
>>> pd.Series(["a", "b", None])
0 a
1 b
2 NaN
dtype: string
Right now (pandas 2.2), the existing option only enables the PyArrow-based future dtype. For the remaining 2.x releases, this option will be expanded to also work when PyArrow is not installed to enable the object-dtype fallback in that case.
Missing value semantics
As mentioned in the background section, the original StringDtype
has always
used the experimental pd.NA
sentinel for missing values. In addition to using
pd.NA
as the scalar for a missing value, this essentially means that:
- String columns follow "NA-semantics"
for missing values, where
NA
propagates in boolean operations such as comparisons or predicates. - Operations on the string column that give a numeric or boolean result use the
nullable Integer/Float/Boolean data types (e.g.
ser.str.len()
returns the nullable"Int64"
/pd.Int64Dtype()
dtype instead of the numpyint64
dtype (orfloat64
in case of missing values)).
However, up to this date, all other default data types still use NaN
semantics
for missing values. Therefore, this proposal says that a new default string
dtype should also still use the same default missing value semantics and return
default data types when doing operations on the string column, to be consistent
with the other default dtypes at this point.
In practice, this means that the default string dtype will use NaN
as
the missing value sentinel, and:
- String columns will follow NaN-semantics for missing values, where
NaN
gives False in boolean operations such as comparisons or predicates. - Operations on the string column that give a numeric or boolean result will use
the default data types (i.e. numpy
int64
/float64
/bool
).
Because the original StringDtype
implementations already use pd.NA
and
return masked integer and boolean arrays in operations, a new variant of the
existing dtypes that uses NaN
and default data types was needed. The original
variant of StringDtype
using pd.NA
will continue to be available for those
who were already using it.
Object-dtype "fallback" implementation
To avoid a hard dependency on PyArrow for pandas 3.0, this PDEP proposes to keep
a "fallback" option in case PyArrow is not installed. The original StringDtype
backed by a numpy object-dtype array of Python strings can be mostly reused for
this (adding a new variant of the dtype) and a new StringArray
subclass only
needs minor changes to follow the above-mentioned missing value semantics
(GH-58451).
For pandas 3.0, this is the most realistic option given this implementation has already been available for a long time. Beyond 3.0, further improvements such as using NumPy 2.0 (GH-58503) or nanoarrow (GH-58552) can still be explored, but at that point that is an implementation detail that should not have a direct impact on users (except for performance).
For the original variant of StringDtype
using pd.NA
, currently the default
storage is "python"
(the object-dtype based implementation). Also for this
variant, it is proposed to follow the same logic for determining the default
storage, i.e. default to "pyarrow"
if available, and otherwise
fall back to "python"
.
Naming
Given the long history of this topic, the naming of the dtypes is a difficult topic.
In the first place, it should be acknowledged that most users should not need to
use storage-specific options. Users are expected to specify a generic name (such
as "str"
or "string"
), and that will give them their default string dtype
(which depends on whether PyArrow is installed or not).
For the generic string alias to specify the dtype, "string"
is already used
for the StringDtype
using pd.NA
. This PDEP proposes to use "str"
for the
new default StringDtype
using NaN
. This ensures backwards compatibility for
code using dtype="string"
, and was also chosen because dtype="str"
or
dtype=str
currently already works to ensure your data is converted to
strings (only using object dtype for the result).
But for testing purposes and advanced use cases that want control over the exact
variant of the StringDtype
, we need some way to specify this and distinguish
them from the other string dtypes.
Currently (pandas 2.2), StringDtype(storage="pyarrow_numpy")
is used for the new variant using NaN
,
where the "pyarrow_numpy"
storage was used to disambiguate from the existing
"pyarrow"
option using pd.NA
. However, "pyarrow_numpy"
is a rather confusing
option and doesn't generalize well. Therefore, this PDEP proposes a new naming
scheme as outlined below, and "pyarrow_numpy"
will be deprecated as an alias
in pandas 2.3 and removed in pandas 3.0.
The storage
keyword of StringDtype
is kept to disambiguate the underlying
storage of the string data (using pyarrow or python objects), but an additional
na_value
is introduced to disambiguate the the variants using NA semantics
and NaN semantics.
Overview of the different ways to specify a dtype and the resulting concrete dtype of the data:
User specification | Concrete dtype | String alias | Note |
---|---|---|---|
Unspecified (inference) | StringDtype(storage="pyarrow"\|"python", na_value=np.nan) |
"str" | (1) |
"str" or StringDtype(na_value=np.nan) |
StringDtype(storage="pyarrow"\|"python", na_value=np.nan) |
"str" | (1) |
StringDtype("pyarrow", na_value=np.nan) |
StringDtype(storage="pyarrow", na_value=np.nan) |
"str" | |
StringDtype("python", na_value=np.nan) |
StringDtype(storage="python", na_value=np.nan) |
"str" | |
StringDtype("pyarrow") |
StringDtype(storage="pyarrow", na_value=pd.NA) |
"string[pyarrow]" | |
StringDtype("python") |
StringDtype(storage="python", na_value=pd.NA) |
"string[python]" | |
"string" or StringDtype() |
StringDtype(storage="pyarrow"\|"python", na_value=pd.NA) |
"string[pyarrow]" or "string[python]" | (1) |
StringDtype("pyarrow_numpy") |
StringDtype(storage="pyarrow", na_value=np.nan) |
"string[pyarrow_numpy]" | (2) |
Notes:
- (1) You get "pyarrow" or "python" depending on pyarrow being installed.
- (2) "pyarrow_numpy" is kept temporarily because this is already in a released version, but it will be deprecated in 2.x and removed for 3.0.
For the new default string dtype, only the "str"
alias can be used to
specify the dtype as a string, i.e. pandas would not provide a way to make the
underlying storage (pyarrow or python) explicit through the string alias. This
string alias is only a convenience shortcut and for most users "str"
is
sufficient (they don't need to specify the storage), and the explicit
pd.StringDtype(storage=..., na_value=np.nan)
is still available for more
fine-grained control.
Also for the existing variant using pd.NA
, specifying the storage through the
string alias could be deprecated, but that is left for a separate decision.
Alternatives
Why not delay introducing a default string dtype?
To avoid introducing a new string dtype while other discussions and changes are
in flux (eventually making pyarrow a required dependency? adopting pd.NA
as
the default missing value sentinel? using the new NumPy 2.0 capabilities?
overhauling all our dtypes to use a logical data type system?), introducing a
default string dtype could also be delayed until there is more clarity in those
other discussions. Specifically, it would avoid temporarily switching to use
NaN
for the string dtype, while in a future version we might switch back
to pd.NA
by default.
However:
- Delaying has a cost: it further postpones introducing a dedicated string dtype that has significant benefits for users, both in usability as (for the part of the user base that has PyArrow installed) in performance.
- In case pandas eventually transitions to use
pd.NA
as the default missing value sentinel, a migration path for all pandas data types will be needed, and thus the challenges around this will not be unique to the string dtype and therefore not a reason to delay this.
Making this change now for 3.0 will benefit the majority of users, and the PDEP author believes this is worth the cost of the added complexity around "yet another dtype" (also for other data types we already have multiple variants).
Why not use the existing StringDtype with pd.NA
?
Wouldn't adding even more variants of the string dtype make things only more confusing? Indeed, this proposal unfortunately introduces more variants of the string dtype. However, the reason for this is to ensure the actual default user experience is less confusing, and the new string dtype fits better with the other default data types.
If the new default string data type would use pd.NA
, then after some
operations, a user can easily end up with a DataFrame that mixes columns using
NaN
semantics and columns using NA
semantics (and thus a DataFrame that
could have columns with two different int64, two different float64, two different
bool, etc dtypes). This would lead to a very confusing default experience.
With the proposed new variant of the StringDtype, this will ensure that for the
default experience, a user will only see only 1 kind of integer dtype, only
kind of 1 bool dtype, etc. For now, a user should only get columns using pd.NA
when explicitly opting into this.
Naming alternatives
An initial version of this PDEP proposed to use the "string"
alias and the
default pd.StringDtype()
class constructor for the new default dtype.
However, that caused a lot of discussion around backwards compatibility for
existing users of dtype=pd.StringDtype()
and dtype="string"
, that uses
pd.NA
to represent missing values.
During the discussion, several alternatives have been brought up. Both
alternative keyword names as using a different constructor. In the end,
this PDEP proposes to use a different string alias ("str"
) but to keep
using the existing pd.StringDtype
(with the existing storage
keyword but
with an additional na_value
keyword) for now to keep the changes as
minimal as possible, leaving a larger overhaul of the dtype system (potentially
including different constructor functions or namespace) for a future discussion.
See GH-58613 for the full
discussion.
One consequence is that when using the class constructor for the default dtype,
it has to be used with non-default arguments, i.e. a user needs to specify
pd.StringDtype(na_value=np.nan)
to get the default dtype using NaN
.
Therefore, the pandas documentation will focus on the usage of dtype="str"
.
Backward compatibility
The most visible backwards incompatible change will be that columns with string
data will no longer have an object
dtype. Therefore, code that assumes
object
dtype (such as ser.dtype == object
) will need to be updated. This
change is done as a hard break in a major release, as warning in advance for the
changed inference is deemed too noisy.
To allow testing code in advance, the
pd.options.future.infer_string = True
option is available for users.
Otherwise, the actual string-specific functionality (such as the .str
accessor
methods) should generally all keep working as is.
By preserving the current missing value semantics, this proposal is also mostly
backwards compatible on this aspect. When storing strings in object dtype, pandas
however did allow using None
as the missing value indicator as well (and in
certain cases such as the shift
method, pandas even introduced this itself).
For all the cases where currently None
was used as the missing value sentinel,
this will change to consistently use NaN
.
For existing users of StringDtype
Existing code that already opted in to use the StringDtype
using pd.NA
should generally keep working as is. The latest version of this PDEP preserves
the behaviour of dtype="string"
or dtype=pd.StringDtype()
to mean the
pd.NA
variant of the dtype.
It does propose the change the default storage to "pyarrow"
(if available) for
the opt-in pd.NA
variant as well, but this should have limited, if any,
user-visible impact.
Timeline
The future PyArrow-backed string dtype was already made available behind a feature
flag in pandas 2.1 (enabled by pd.options.future.infer_string = True
).
The variant using numpy object-dtype can also be backported to the 2.2.x branch to allow easier testing. It is proposed to release this as 2.3.0 (created from the 2.2.x branch, given that the main branch already includes many other changes targeted for 3.0), together with the changes to the naming scheme.
The 2.3.0 release would then have all future string functionality available (both the pyarrow and object-dtype based variants of the default string dtype).
For pandas 3.0, this future.infer_string
flag becomes enabled by default.
PDEP-14 History
- 3 May 2024: Initial version