pyspark.sql.functions.split#
- pyspark.sql.functions.split(str, pattern, limit=- 1)[source]#
Splits str around matches of the given pattern.
New in version 1.5.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- str
Column
or column name a string expression to split
- pattern
Column
or literal string a string representing a regular expression. The regex string should be a Java regular expression.
Changed in version 4.0.0: pattern now accepts column. Does not accept column name since string type remain accepted as a regular expression representation, for backwards compatibility. In addition to int, limit now accepts column and column name.
- limit
Column
or column name or int an integer which controls the number of times pattern is applied.
limit > 0
: The resulting array’s length will not be more than limit, and theresulting array’s last entry will contain all input beyond the last matched pattern.
limit <= 0
: pattern will be applied as many times as possible, and the resultingarray can be of any size.
Changed in version 3.0: split now takes an optional limit field. If not provided, default limit value is -1.
- str
- Returns
Column
array of separated strings.
Examples
Example 1: Repeat with a constant pattern
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([('oneAtwoBthreeC',)], ['s',]) >>> df.select('*', sf.split(df.s, '[ABC]')).show() +--------------+-------------------+ | s|split(s, [ABC], -1)| +--------------+-------------------+ |oneAtwoBthreeC|[one, two, three, ]| +--------------+-------------------+
>>> df.select('*', sf.split(df.s, '[ABC]', 2)).show() +--------------+------------------+ | s|split(s, [ABC], 2)| +--------------+------------------+ |oneAtwoBthreeC| [one, twoBthreeC]| +--------------+------------------+
>>> df.select('*', sf.split('s', '[ABC]', -2)).show() +--------------+-------------------+ | s|split(s, [ABC], -2)| +--------------+-------------------+ |oneAtwoBthreeC|[one, two, three, ]| +--------------+-------------------+
Example 2: Repeat with a column containing different patterns and limits
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([ ... ('oneAtwoBthreeC', '[ABC]', 2), ... ('1A2B3C', '[1-9]+', 1), ... ('aa2bb3cc4', '[1-9]+', -1)], ['s', 'p', 'l']) >>> df.select('*', sf.split(df.s, df.p)).show() +--------------+------+---+-------------------+ | s| p| l| split(s, p, -1)| +--------------+------+---+-------------------+ |oneAtwoBthreeC| [ABC]| 2|[one, two, three, ]| | 1A2B3C|[1-9]+| 1| [, A, B, C]| | aa2bb3cc4|[1-9]+| -1| [aa, bb, cc, ]| +--------------+------+---+-------------------+
>>> df.select(sf.split('s', df.p, 'l')).show() +-----------------+ | split(s, p, l)| +-----------------+ |[one, twoBthreeC]| | [1A2B3C]| | [aa, bb, cc, ]| +-----------------+