Questions tagged [foundry-python-transform]

33 questions
7
votes
2 answers

Why is my build hanging / taking a long time to generate my query plan with many unions?

I notice when I run the same code as my example over here but with a union or unionByName or unionAll instead of the join, my query planning takes significantly longer and can result in a driver OOM. Code included here for reference, with a slight…
5
votes
1 answer

How do I parse xml documents in Palantir Foundry?

I have a set of .xml documents that I want to parse. I previously have tried to parse them using methods that take the file contents and dump them into a single cell, however I've noticed this doesn't work in practice since I'm seeing slower and…
5
votes
2 answers

How to create python libraries and how to import it in palantir foundry

In order to generalize the python functions, I wanted to add functions to python libraries so that I can use these function across the multiple repositories. Anyone please answer the below questions. 1) How to create our own python libraries 2) how…
3
votes
1 answer

Shuffle Stage Failing Due To Executor Loss

I get the following error when my spark jobs fails **"org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 21), which maintains the block data to fetch is dead."** Over view of my spark job input size is ~35 GB I have…
3
votes
1 answer

How can I merge an incremental dataset and a snapshot dataset while retaining deleted rows?

I have a data connection source that creates two datasets: Dataset X (Snapshot) Dataset Y (Incremental) The two datasets pull from the same source. Dataset X consists of the current state of all rows in the source table. Dataset Y pulls all rows…
3
votes
1 answer

Palantir Foundry incremental testing is hard to iterate on, how do I find bugs faster?

I have a pipeline setup in my Foundry instance that is using incremental computation but for some reason isn't doing what I expect. Namely, I want to read the previous output of my transform and get the maximum value of a date, then read the input…
3
votes
1 answer

Is there a tool available within Foundry that can automatically populate column descriptions? If so, what is it called?

We are looking to see if there is a tool within the Foundry platform that will allow us to have a list of field descriptions and when the dataset builds, it can populated those descriptions automatically. Does this exist and if so what is the tool…
2
votes
1 answer

PySpark Serialized Results too Large OOM for loop in Spark

I have serious difficulties in understanding why I cannot run a transform which, after waiting so many minutes (sometimes hours), returns the error "Serialized Results too large". In the transform I have a list of dates that I am iterating in a for…
2
votes
1 answer

Why is my Code Repo warning me about using withColumn in a for/while loop?

I'm noticing my code repo is warning me that using withColumn in a for/while loop is an antipattern. Why is this not recommended? Isn't this a normal use of the PySpark API?
2
votes
2 answers

How do I parse large compressed csv files in Foundry?

I have a large gziped csv file (.csv.gz) uploaded to a dataset that's about 14GB in size and 40GB when uncompressed. Is there a way to decompress, read, and write it out to a dataset using Python Transforms without causing the executor to OOM?
vanhooser
  • 1,497
  • 3
  • 19
1
vote
1 answer

Why is my Code Repo warning me not to use union and instead use unionByName?

I see in my repository it's warning me about using union and instead I should use unionByName. Aren't these the same thing? Why would I care which one to use?
1
vote
1 answer

Does a count() over a DataFrame materialize the data to the driver / increase a risk of OOM?

I want to run df.count() on my DataFrame, but I know my total dataset size is pretty large. Does this run the risk of materializing the data back to the driver / increasing my risk of driver OOM?
1
vote
1 answer

How do I add a column indicating the row number from a file on disk?

I want to parse a series of .csv files using spark.read.csv, but I want to include the row number of each line inside the file. I know that Spark typically doesn't order DataFrames unless explicitly told to do so, and I don't want to write my own…
1
vote
1 answer

How to throw a warning if threshold value exceeds in foundry code repositories

I have taken an input dataset and did some transformations on it, then wrote it into the output dataset. I have built this output dataset, and now I have to take the time taken to build the output dataset and compare that with a threshold time…
1
vote
1 answer

How do I compute a range of statuses from a daily indicator?

I have a df in the format of: | name | status | date | ____________________________ | ben | active | 01/01 | | ben | active | 01/02 | | ben | active | 01/03 | | ben | in-active | 01/04 | | ben | in-active | 01/05 | | ben | active …
1
2 3