1

I have an XLS file of district-level data, where each district is in a region. But the region column is only filled for the first district in the region. Something like:

District  Region Population
Foo       Bar    1234
          Baz    345
          Qux    69
Quux      Zob    1234
          Bax    42

and so on. How can I process it so that the empty District fields are filled with the correct District value (for eventual output to a CSV, but I can do that bit) to get:

District  Region Population
Foo       Bar    1234
Foo       Baz    345
Foo       Qux    69
Quux      Zob    1234
Quux      Bax    42

Am actually doing this with GeoKettle 2.0 - eventually eventually this is all going in a Shapefile.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • I don't know how to do this directly in Kettle. The reason being, I have a set of Python scripts that handle this kind of data. When I ran into a similar problem, I just pre-processed the files with my Python scripts using a Shell task. Doesn't sound too helpful, but -- who knows. – Brian.D.Myers Jan 30 '13 at 21:13
  • Thanks - I could easily do all my processing in Python, but I partly wanted to see how powerful these ETL systems were, and also wanted to see if I could construct something portable to other people without them having to install Python or R. – Spacedman Jan 30 '13 at 22:47
  • Yeah, the problem here is that PDI is geared towards working with tabular data, and XLS files aren't always truly tabular. FWIW, I don't think you could do it easily in SSIS or Informatica for the same reason. Also, you realize it's just as hard to install PDI as it is Python right? – Brian.D.Myers Jan 31 '13 at 01:26

1 Answers1

0

I don't think you will have a generic step for such transformation once, as @Brian said, most of PDI steps works with truly tabular data. But in those cases you can solve it with a simple Modified Java Script step.

What we're gonna do here is to write a script that stores the last non empty District as a variable lastDistict and uses it on the empty rows.

First insert a Modified Java Script and connect it in your already existing stream. Open the step and right click on the title of the script (tab) and select "Add new". At this new script only insert:

var lastDistrict;

Right click on the title of this script and select "Set start script". Doing this you will tell Kettle to execute the var creation only the first time the step is used, otherwise you would loose the value of lastDistrict on every new row.

On the other script at the same step (usually Script 1) insert:

if ((District == null) || (trim(Distict) == "")) {
    newDistrict = lastDistrict;
} else {
    newDistrict = District;
    lastDistrict = District;
}

At the bottom of the step you will see a grid called Fields. Create a new field called newDistrict and set it's type to String.

Now you'll have a new field newDistrict with values on all rows. If you don't want to create a new column, just fill Rename to with Disctrict and set Replace value 'Fieldname' or 'Rename to' to Y.

RFVoltolini
  • 390
  • 1
  • 7