pyspark.sql.functions.split#

pyspark.sql.functions.split(str, pattern, limit=- 1)[source]#

Splits str around matches of the given pattern.

New in version 1.5.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

strColumn or column name

a string expression to split

patternColumn or literal string

a string representing a regular expression. The regex string should be a Java regular expression.

Changed in version 4.0.0: pattern now accepts column. Does not accept column name since string type remain accepted as a regular expression representation, for backwards compatibility. In addition to int, limit now accepts column and column name.

limitColumn or column name or int

an integer which controls the number of times pattern is applied.

limit > 0: The resulting array’s length will not be more than limit, and the
resulting array’s last entry will contain all input beyond the last matched pattern.
limit <= 0: pattern will be applied as many times as possible, and the resulting
array can be of any size.

Changed in version 3.0: split now takes an optional limit field. If not provided, default limit value is -1.

Returns

Column: array of separated strings.

See also

pyspark.sql.functions.sentences()
pyspark.sql.functions.split_part()

Examples

Example 1: Repeat with a constant pattern

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([('oneAtwoBthreeC',)], ['s',])
>>> df.select('*', sf.split(df.s, '[ABC]')).show()
+--------------+-------------------+
|             s|split(s, [ABC], -1)|
+--------------+-------------------+
|oneAtwoBthreeC|[one, two, three, ]|
+--------------+-------------------+

>>> df.select('*', sf.split(df.s, '[ABC]', 2)).show()
+--------------+------------------+
|             s|split(s, [ABC], 2)|
+--------------+------------------+
|oneAtwoBthreeC| [one, twoBthreeC]|
+--------------+------------------+

>>> df.select('*', sf.split('s', '[ABC]', -2)).show()
+--------------+-------------------+
|             s|split(s, [ABC], -2)|
+--------------+-------------------+
|oneAtwoBthreeC|[one, two, three, ]|
+--------------+-------------------+

Example 2: Repeat with a column containing different patterns and limits

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([
...     ('oneAtwoBthreeC', '[ABC]', 2),
...     ('1A2B3C', '[1-9]+', 1),
...     ('aa2bb3cc4', '[1-9]+', -1)], ['s', 'p', 'l'])
>>> df.select('*', sf.split(df.s, df.p)).show()
+--------------+------+---+-------------------+
|             s|     p|  l|    split(s, p, -1)|
+--------------+------+---+-------------------+
|oneAtwoBthreeC| [ABC]|  2|[one, two, three, ]|
|        1A2B3C|[1-9]+|  1|        [, A, B, C]|
|     aa2bb3cc4|[1-9]+| -1|     [aa, bb, cc, ]|
+--------------+------+---+-------------------+

>>> df.select(sf.split('s', df.p, 'l')).show()
+-----------------+
|   split(s, p, l)|
+-----------------+
|[one, twoBthreeC]|
|         [1A2B3C]|
|   [aa, bb, cc, ]|
+-----------------+