0

Trying to seek some guidance on the best way of curating an extensive ETL process. My pipeline has a reasonably sleek extract section, and loads into a designated file in a succinct manner; but the only way I can think to do transformation steps is a series of variable assignments:

a = ['some','form','of','petl','data']
b = petl.addfield(a, 'NewStrField', str(a))
c = petl.addrownumbers(b)
d = petl.rename(c, 'row', 'ID')
.......

Reformatting to assign the same variable name makes some sense, but doesn't aid readability:

a = ['some','form','of','petl','data']
a = petl.addfield(a, 'NewStrField', str(a))
a = petl.addrownumbers(a)
a = petl.rename(a, 'row', 'ID')
.......

I've read up on multiple method calls like this:

a = ['some','form','of','data']

result = petl.addfield(a, 'NewStrField', str(a))
    .addrownumbers(a)
    .rename(a, 'row', 'ID')
.......

but that won't work, as the functions require the table as the first parameter passed.

Is there some fundamental I am missing? I'm loathe to believe that the right way of doing this commercially involves 1000+ LOC?

jukedl
  • 25
  • 5

1 Answers1

0

Create a list of partially applied functions, then loop over that list.

transforms = [
    lambda x: petl.addfield(x, 'NewStrField', str(x)),
    petl.addrownumbers,
    lambda x: petl.rename(x, 'row', 'ID')
]

a = ['some', 'form', 'of', 'petl', 'data']
for f in transforms:
    a = f(a)

Your "total" transformation is the composition of the transformations in the list transforms. You can do those upfront (at the cost of some additional function calls) using a library that provides function composition, or rolling your own.

def compose(*f):
    if not f:
        return lambda x: x  # Identity function, the identity for function composition
    return lambda x: f[0](compose(f[1:])(x))

# Note the reversed order of the functions compared to 
# the list above.
transform = compose(
    lambda x: petl.rename(x, 'row', 'ID'),
    petl.addrownumbers,
    lambda x: petl.addfield(x, 'NewStrField', str(x)),
)


a = ['some', 'form', 'of', 'petl', 'data']
result = transform(a)
chepner
  • 497,756
  • 71
  • 530
  • 681