How to parse the complete set of records for a dataset through an API call?

Question

How can I get the full dataset records through foundry API call? I want to use the dataset in another Python application outside Foundry and using requests only first 300 rows of records are coming. The requests API end point I have is using Contour dataset-preview.

Could you better describe what you are trying to do? Are you trying to export data from Foundry to another app, or trying to download the dataset? If the dataset is 100gb of data are you intending to download all of it? — fmsf, Feb 09 '21 at 07:58
its only 100k records and i want to parse the whole of the data into another application. I intend to use python requests to parse the data into a pandas dataframe for use within my application — Asher, Feb 09 '21 at 08:26
@Asher while it's possible to do this, can't you just do whatever transformation you need in foundry? That's kinda the point of using it in the first place. It supports loading data into pandas dataframes, loading arbitrary python libraries, plotting etc., so it seems very likely that what you need to do is possible without downloading any data. The drawback to downloading data is that any results you derive are disconnected from the provenance/causality chain visible in foundry, are not updated automatically, and there might be legal/compliance issues (like GDPR deletion) — Jonathan Ringstad, Feb 09 '21 at 14:46
@JonathanRingstad, thank you for that valuable insight, one of the main reason for this is due to foundry is ip address restricted and therefore making visualization available on the mobile currently a challenge for us, if you have a better solution to access foundry data into visualization like power bi on mobile please do share with us. — Asher, Feb 10 '21 at 03:38
@Asher ask your palantir rep about foundry mobile or whether the IP restrictions can be loosened I'd say. Regardless, if you do want to download the data, I'd recommend just going into the datasets "details" tab and downloading the files (likely parquet files, that's the default for most.) If there are too many files, you can make a transform that just does a `repartition(1)` or so to reduce it down to a single (large) parquet file. These parquet files are then pretty easy to load into pandas dataframes in python, and you won't need to hardcode tokens etc into your script (security issues) — Jonathan Ringstad, Feb 10 '21 at 09:55
@JonathanRingstad i didn't know there was a foundry mobile , i'll check with palantir on that. do you have a transform code for the repartition, i have rarely worked with raw files in code authoring — Asher, Feb 10 '21 at 10:13
@Asher to repartition you can just create the empty python transform from the template (the one that just calls `identity` on the df) and then just fill out the function as something like `return df.repartition(10)` (to turn it into 10 files). You probably don't want to turn it into so few files that they end up being many gigabytes each. I don't know if palantir can offer foundry mobile to your org at this point, I think it's still beta with selective roll-out or some-such (I'm not quite in the loop on that) — Jonathan Ringstad, Feb 10 '21 at 18:35

score 3 · Accepted Answer · answered Feb 09 '21 at 15:31

There are different possibilities to query datasets in Foundry, depending on the dataset size and use case. Probably the easiest to start with is the data-proxy query sql, because you don't have to worry about the underlying file format of the dataset.

import requests
import pandas as pd

def query_foundry_sql(query, token, branch='master', base_url='https://foundry-instance.com') -> (list, list):
    """
    Queries the dataproxy query API with spark SQL.
    Example: query_foundry_sql("SELECT * FROM `/path/to/dataset` Limit 5000", "ey...")
    Args:
        query: the sql query
        branch: the branch of the dataset / query

    Returns: (columns, data) tuple. data contains the data matrix, columns the list of columns
    Can be converted to a pandas Dataframe:
    pd.DataFrame(data, columns)

    """
    response = requests.post(f"{base_url}/foundry-data-proxy/api/dataproxy/queryWithFallbacks",
                             headers={'Authorization': f'Bearer {token}'},
                             params={'fallbackBranchIds': [branch]},
                             json={'query': query})

    response.raise_for_status()
    json = response.json()
    columns = [e['name'] for e in json['foundrySchema']['fieldSchemaList']]
    return columns, json['rows']

columns, data = query_foundry_sql("SELECT * FROM `/Global/Foundry Operations/Foundry Support/iris` Limit 5000", 
                                  "ey...",
                                 base_url="https://foundry-instance.com")
df = pd.DataFrame(data=data, columns=columns)
df.head(5)

thank you for that this covers exactly what i am looking for :) — Asher, Feb 10 '21 at 04:08

How to parse the complete set of records for a dataset through an API call?

1 Answers1