Questions tagged [foundry-python-transform]
33 questions
7
votes
2 answers
Why is my build hanging / taking a long time to generate my query plan with many unions?
I notice when I run the same code as my example over here but with a union or unionByName or unionAll instead of the join, my query planning takes significantly longer and can result in a driver OOM.
Code included here for reference, with a slight…

vanhooser
- 1,497
- 3
- 19
5
votes
1 answer
How do I parse xml documents in Palantir Foundry?
I have a set of .xml documents that I want to parse.
I previously have tried to parse them using methods that take the file contents and dump them into a single cell, however I've noticed this doesn't work in practice since I'm seeing slower and…

vanhooser
- 1,497
- 3
- 19
5
votes
2 answers
How to create python libraries and how to import it in palantir foundry
In order to generalize the python functions, I wanted to add functions to python libraries so that I can use these function across the multiple repositories. Anyone please answer the below questions.
1) How to create our own python libraries
2) how…

Gavisha BN
- 141
- 1
- 8
3
votes
1 answer
Shuffle Stage Failing Due To Executor Loss
I get the following error when my spark jobs fails **"org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 21), which maintains the block data to fetch is dead."**
Over view of my spark job
input size is ~35 GB
I have…

Arun Mohan
- 349
- 4
- 13
3
votes
1 answer
How can I merge an incremental dataset and a snapshot dataset while retaining deleted rows?
I have a data connection source that creates two datasets:
Dataset X (Snapshot)
Dataset Y (Incremental)
The two datasets pull from the same source. Dataset X consists of the current state of all rows in the source table. Dataset Y pulls all rows…

tomwhittaker
- 331
- 2
- 8
3
votes
1 answer
Palantir Foundry incremental testing is hard to iterate on, how do I find bugs faster?
I have a pipeline setup in my Foundry instance that is using incremental computation but for some reason isn't doing what I expect. Namely, I want to read the previous output of my transform and get the maximum value of a date, then read the input…

vanhooser
- 1,497
- 3
- 19
3
votes
1 answer
Is there a tool available within Foundry that can automatically populate column descriptions? If so, what is it called?
We are looking to see if there is a tool within the Foundry platform that will allow us to have a list of field descriptions and when the dataset builds, it can populated those descriptions automatically. Does this exist and if so what is the tool…

Robert F
- 187
- 5
2
votes
1 answer
PySpark Serialized Results too Large OOM for loop in Spark
I have serious difficulties in understanding why I cannot run a transform which, after waiting so many minutes (sometimes hours), returns the error "Serialized Results too large".
In the transform I have a list of dates that I am iterating in a for…

Jresearcher
- 297
- 3
- 13
2
votes
1 answer
Why is my Code Repo warning me about using withColumn in a for/while loop?
I'm noticing my code repo is warning me that using withColumn in a for/while loop is an antipattern. Why is this not recommended? Isn't this a normal use of the PySpark API?

vanhooser
- 1,497
- 3
- 19
2
votes
2 answers
How do I parse large compressed csv files in Foundry?
I have a large gziped csv file (.csv.gz) uploaded to a dataset that's about 14GB in size and 40GB when uncompressed. Is there a way to decompress, read, and write it out to a dataset using Python Transforms without causing the executor to OOM?

vanhooser
- 1,497
- 3
- 19
1
vote
1 answer
Why is my Code Repo warning me not to use union and instead use unionByName?
I see in my repository it's warning me about using union and instead I should use unionByName. Aren't these the same thing? Why would I care which one to use?

vanhooser
- 1,497
- 3
- 19
1
vote
1 answer
Does a count() over a DataFrame materialize the data to the driver / increase a risk of OOM?
I want to run df.count() on my DataFrame, but I know my total dataset size is pretty large. Does this run the risk of materializing the data back to the driver / increasing my risk of driver OOM?

vanhooser
- 1,497
- 3
- 19
1
vote
1 answer
How do I add a column indicating the row number from a file on disk?
I want to parse a series of .csv files using spark.read.csv, but I want to include the row number of each line inside the file.
I know that Spark typically doesn't order DataFrames unless explicitly told to do so, and I don't want to write my own…

vanhooser
- 1,497
- 3
- 19
1
vote
1 answer
How to throw a warning if threshold value exceeds in foundry code repositories
I have taken an input dataset and did some transformations on it, then wrote it into the output dataset.
I have built this output dataset, and now I have to take the time taken to build the output dataset and compare that with a threshold time…

Monica Gaddipati
- 69
- 2
1
vote
1 answer
How do I compute a range of statuses from a daily indicator?
I have a df in the format of:
| name | status | date |
____________________________
| ben | active | 01/01 |
| ben | active | 01/02 |
| ben | active | 01/03 |
| ben | in-active | 01/04 |
| ben | in-active | 01/05 |
| ben | active …

vanhooser
- 1,497
- 3
- 19