When using Huggingface Tokenizer with return_overflowing_tokens=True
, the results can have multiple token sequence per input string. Therefore, when doing a Dataset.map
from strings to token sequence, you need to remove the original columns (as they are not 1:1).
For my application, I need to continue to reference the original dataset's columns. How can I copy them over to the tokenized dataset?
For example:
# Pseudocode
ds['txt'] == ['The quick brown fox', 'jumped over the lazy hens']
ds['src'] == ['Nursery rhyme 1', 'Nursery rhyme 2']
tokenize(ds['txt'], return_overflowing_tokens=True) =>
[tokens for 'The quick brown'],
[tokens for 'fox'],
[tokens for 'jumped over'],
[tokens for 'the lazy hens'],
# I'd like a tokenized_ds to look like this:
tokenized_ds[0] = {txt: 'The quick brown fox', src: 'Nursery rhyme 1', tokens: [tokens for 'The quick brown']}
tokenized_ds[1] = {txt: 'The quick brown fox', src: 'Nursery rhyme 1', tokens: [tokens for 'fox']}
Some clarifications:
Some of the columns that need to be preserved are strings. These are harder to preserve when you set the format to a Tensor.
The dataset will be batched via Dataloader, and only batches handed off for processing. This means that the original, full dataset will not necessarily be available. That makes it hard to map back to the original dataset on demand, which is why I want to preserve the columns within the transformed dataset.