Split CSV file based on characters in specific column

Question

I have a csv file, with a layout as follows:

Easting	Northing	Elevation_OD	Chainage	FC	Profile	Reg_ID
245932.499	200226.982	5.464	148.997	CE	N/A	8c12.11_021
245932.460	200225.448	5.214	150.530	S	N/A	8c12.11_021
245932.485	200224.993	5.111	153.222	S	N/A	8c12.1_001

The spreadsheet has ~55,000 rows and I need to be able to split the CSV based on specific research units - the '8c12.11' specifies a particular unit and the '_021' specifies a particular transect line and can be ignored. There are numerous units within this file and the character length of the units can vary e.g. '8c12.1' and I'm uncertain as to how to compensate for this alteration. Essentially the split needs to be able to identify the correct unit:

8c12.11 - The characters prior to the transect line (e.g _021)

and then compile all of the associated rows that correspond to this unit, into a new table.

Adriaan · Answer 1 · 2023-02-28T14:49:36.753

If your data is in data, you can loop over all rows. In each row, call a strsplit() on the seventh column, splitting on the underscore. Then, either collect all unique units, or simply verify it against your desired unit.

Here's a quick mock-up of a code that does the latter:

idx_keep = zeros(size(data,1),1); % Initialise output indices
for ii = 1:size(data,1)
    B = strsplit(data(ii, 7), '_');  % Split the string
    if strcmp(B{1}, '8c12.11')
        idx_keep(ii) = 1;  % keep those indices that match
    end
end

my_unit = data(idx_keep,:);  % Now contains only the rows with unit == 8c12.11

In case you also want to check for specific transect lines, also check B{2} == '021'.

score 2 · Accepted Answer · answered Mar 01 '23 at 08:33

2

The easiest way to do this is using extractBefore

data.Unit = extractBefore( data.Reg_ID, '_' );

answered Mar 01 '23 at 08:33

Wolfie

27,562
7
28
55

score 1 · Answer 3 · edited Feb 28 '23 at 20:29

data = 
readtable('Input/Wales_Interim_23/Pembrokeshire_23_Finished.csv');
unit_str = string(data.Reg_ID);
split = cellfun(@(x) strsplit(x, '_'), unit_str, 'UniformOutput', 
false);
split_unit_cell = cell2table(split);
split_unit = splitvars(split_unit_cell)
unit = table2array(split_unit(:,1));

data.Unit = unit;
Ids = unique(data.Unit);
NID = numel(Ids);

This allows me to identify the amount of unique units within the CSV file, after splitting the unit name from the Reg_Id column.

Split CSV file based on characters in specific column

3 Answers3

Linked