For Spark version >= 2.4, we can utilize higher-order functions to work with arrays, including this problem. Let's say df
is the dataframe.
df = spark.createDataFrame([
("Barkley likes people. Barkley likes treats. Barkley likes everything.",[22, 22, 25]),
("A sentence. Another sentence.",[13, 18]),
("One sheep. Two sheep. Three sheep. Four sheep.",[11, 12, 13, 12])],
"col1:string, col2:array<int>")
df.show()
# +--------------------+----------------+
# | col1| col2|
# +--------------------+----------------+
# |Barkley likes peo...| [22, 22, 25]|
# |A sentence. Anoth...| [13, 18]|
# |One sheep. Two sh...|[11, 12, 13, 12]|
# +--------------------+----------------+
To slice the sentences from col1
, substring
function will used and it needs arguments of start position and length. col2
is the lengths of every sentences in the string. The start positions of every sentences are the cumulative sum of array col2
from 0 to n-1, as hinted in the question. To get that, use higher-order functions transform
and aggregate
. After that, get every sentences and use map_from_entries
to create a map for every sentences and their indexes. This is an example to do so.
import pyspark.sql.functions as F
df = (df
.withColumn("start", F.expr("transform(transform(col2, (v1,i) -> slice(col2, 1, i)), v2 -> aggregate(v2, 0, (a,b) -> a + b))"))
.withColumn("sentences", F.expr("transform(col2, (v, i) -> struct(i+1 as index, substring(col1, start[i], col2[i]) as sentence))"))
.selectExpr("col1", "map_from_entries(sentences) as sentences")
)
df.show(truncate=False)
# +---------------------------------------------------------------------+------------------------------------------------------------------------------------------+
# |col1 |sentences |
# +---------------------------------------------------------------------+------------------------------------------------------------------------------------------+
# |Barkley likes people. Barkley likes treats. Barkley likes everything.|[1 -> Barkley likes people. , 2 -> Barkley likes treats., 3 -> Barkley likes everything]|
# |A sentence. Another sentence. |[1 -> A sentence. A, 2 -> Another sentence.] |
# |One sheep. Two sheep. Three sheep. Four sheep. |[1 -> One sheep. , 2 -> Two sheep. , 3 -> Three sheep. , 4 -> Four sheep.] |
# +---------------------------------------------------------------------+------------------------------------------------------------------------------------------+