2

I'm having a little trouble formulating this question but I'll try to explain. I understand how to explode a single column of an array, but I have multiple array columns where the arrays line up with each other in terms of index-values. In my dataframe, exploding each column basically just does a useless cross join resulting in dozens of invalid rows. So I'm going to start here by showing the data.

This shows some results from SparkNLP with some text and four sets of features of the text. Each column from tr to nr contains an array. Each of these arrays lines up with the others.

+--+---------------------+---------------------+----------------------+--------------------+--------------------+
|ID|                 text|                   tr|                    lr|                  pr|                  nr|
+--+---------------------+---------------------+----------------------+--------------------+--------------------+
|10|  thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...|   [NN, :, NNP, NNS]|    [O, O, I-PER, O]|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...|
|12| * I would like to...| [*, I, would, lik...|  [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...|
+--+---------------------+---------------------+----------------------+--------------------+--------------------+

What I want is a new dataframe with the ID and text plus each ith item in all arrays on a single row, like shown below for the above dataframe:

+--+---------------------+---------------------+----------------------+--------------------+--------------------+------+-------+---+-----+
|ID|                 text|                   tr|                    lr|                  pr|                  nr| token|  lemma|pos|  ner|
+--+---------------------+---------------------+----------------------+--------------------+--------------------+------+-------+---+-----+
|10|  thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...|   [NN, :, NNP, NNS]|    [O, O, I-PER, O]| thing|  thing| NN|    O|
|10|  thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...|   [NN, :, NNP, NNS]|    [O, O, I-PER, O]|     :|      :|  :|    O|
|10|  thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...|   [NN, :, NNP, NNS]|    [O, O, I-PER, O]|MacKay| MacKay|NNP|I-PER|
|10|  thing: MacKay rolls|[thing, :, MacKay,...|[thing, :, MacKay, ...|   [NN, :, NNP, NNS]|    [O, O, I-PER, O]| rolls|   roll|NNS|    O|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...| thing|  thing| NN|    O|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...|     :|      :|  :|    O|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...|MacKay| MacKay|NNP|I-PER|
|11|thing: MacKay roll...|[thing, :, MacKay,...|[thing, :, MacKay, ...|[NN, :, NNP, NNS,...|[O, O, I-PER, O, ...|  roll|   roll|NNS|    O|
|11|...
...
|12| * I would like to...| [*, I, would, lik...|  [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...|     *|      *| NN|    O|
|12| * I would like to...| [*, I, would, lik...|  [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...|     I|      I|PRP|    O|
|12| * I would like to...| [*, I, would, lik...|  [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...| would|  would| MD|    O|
|12| * I would like to...| [*, I, would, lik...|  [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...|  like|   like| VB|    O|
|12| * I would like to...| [*, I, would, lik...|  [*, I, would, lik...|[NN, PRP, MD, VB,...|[O, O, O, O, O, O...|    to|    ...|...|    O|
|12|...
...
+--+---------------------+---------------------+----------------------+--------------------+--------------------+------+-------+---+-----+

I don't need the tr through nr columns in the output but left them for clarity.

Is there a way to accomplish this?

Additionally, would there also be a way to extract the array index at the same time (add to the output row)?

  • you can use the spark sql [`pos_explode`](https://spark.apache.org/docs/latest/api/sql/index.html#posexplode) function. – Vamsi Prabhala Feb 04 '20 at 20:46
  • @Vamsi Prabhala, the results from posexplode are not what I'm looking for unless I coded it incorrectly. From the first row in my example, it generates four rows, with one row for each column and the index of each column shown, rather than the array (0 for tr, 1 for lr, etc.). I used `ann3.select($"ID", $"text", posexplode(array($"tr", $"lr", $"pr", $"nr")))` – caileagleisg Feb 04 '20 at 22:11
  • you have to `posexplode` each column and `join` the result on index. – Vamsi Prabhala Feb 04 '20 at 22:12
  • Does this answer your question? [Explode (transpose?) multiple columns in Spark SQL table](https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table) – 10465355 Feb 04 '20 at 22:35

1 Answers1

0

In this scenario, what you would like to do is explode individual columns using the withColumn expression. Lets assume that you have the dataset loaded as an initial data frame df. Now you would like to achieve something like below.

      val df = <load initial dataset>
      val df1  = df.select($"id", $"text",$"tr", $"lr", $"pr", $"nr").withColumn("tr", explode($"tr")).withColumn("lr",explode($"lr")).withColumn("pr",explode($"pr")).withColumn("nr",explode($"nr"))

This would result in adding the array values to records, tagged with the ID & Text. A drawback of this approach will be the increase in the record count and duplication of non-array column.