Split any file in multiple files after a specific text string

Question

I have a single .dat file that consists of a lot of webp images. I would like to split that file everytime before the string WEBM and safe that portion as a .webm file.

I would love to do this with automator or apple script, but I couldnt get my head around it. maybe anyone can help me out?

Why do you want to split on `WEBM`? If you look at a WEBP image, you will see it starts with `RIFF`. Try `xxd SOMEIMAGE.WEBP | head` — Mark Setchell, Aug 07 '23 at 00:24
Or are your images actually `WEBM` despite your saying they are `WEBP`? In that case the signature at the start is `1A 45 DF A3`. How about using Google Drive or Dropbox to share your file? — Mark Setchell, Aug 08 '23 at 09:55

Mark Setchell · Answer 1 · 2023-08-08T20:46:41.673

On splitting concatenated images...

It's not clear to me whether you actually have WEBP files or WEBM files, so let's do this:

synthesize a WEBP file, examine it and work out what we are looking for
concatenate a few together to synthesize your DAT file
write code to split it

Then we'll:

repeat the above steps with WEBM files
look at calling the above from Applescript

So, let's create a WEBP file with 20 frames that morph from lime green to blue using ImageMagick:

magick -size 180x100 xc:lime xc:blue -morph 18 a.webp

And now inspect the beginning of the file with xxd and see it begins with RIFF:

xxd a.webp | head -1
00000000: 5249 4646 8e0a 0000 5745 4250 5650 3858  RIFF....WEBPVP8X

Now concatenate 5 copies together to simulate your DAT file:

cat a.webp a.webp a.webp a.webp a.webp > BigBoy.dat

Note that if you want concatenate 100 files together you'd be there typing all day, so in that case use:

for i in {1..100}; do cat a.webp ; done > BigBoy.dat

Now we'll split it, but using Perl which is included in all macOS distributions:

#!/usr/bin/env perl                                                         

# Derived from https://superuser.com/a/405495

my $magic = "RIFF";
my $ext = "webp";

my $i = 0;
my $buffer;

# Slurp entire file
{ local $/ = undef; $buffer = <stdin>; }

# Split buffer on every occurence of $magic
my @images = split /${magic}/, $buffer;

# Write list of images to disk
for my $image (@images) {
    next if $image eq '';
    my $filename = sprintf("image-%04d.%s", $i++, $ext);
    open  FILE, ">", $filename or die "open $filename: ";
    print FILE $magic, $image  or die "print $filename: ";
    close FILE or die "close $filename: ";
}

And if we check what we got, we'll see the original WEBP file and all the extracted frames:

ls *webp
-rw-r--r--     1 mark  staff        2710  8 Aug 13:37 a.webp
-rw-r--r--     1 mark  staff        2710  8 Aug 16:23 image-0000.webp
-rw-r--r--     1 mark  staff        2710  8 Aug 16:23 image-0001.webp
-rw-r--r--     1 mark  staff        2710  8 Aug 16:23 image-0002.webp
-rw-r--r--     1 mark  staff        2710  8 Aug 16:23 image-0003.webp
-rw-r--r--     1 mark  staff        2710  8 Aug 16:23 image-0004.webp

Now we'll do the same thing for WEBM:

# Create one file
magick -size 180x100 xc:lime xc:blue -morph 18 a.webm

# Check its magic number
xxd a.webm | head -1
00000000: 1a45 dfa3 9f42 8681 0142 f781 0142 f281  .E...B...B...B..

And adapt the code to look for the new magic and extract with WEBM extension:

#!/usr/bin/env perl                                                         

# Derived from https://superuser.com/a/405495

my $magic = "\x1a\x45\xdf\xa3";
my $ext = "webm";
my $i = 0;
my $buffer;

# Slurp entire file
{ local $/ = undef; $buffer = <stdin>; }

# Split buffer on every occurence of $magic
my @images = split /${magic}/, $buffer;

# Write list of images to disk
for my $image (@images) {
    next if $image eq '';
    my $filename = sprintf("image-%04d.%s", $i++, $ext);
    open  FILE, ">", $filename or die "open $filename: ";
    print FILE $magic, $image  or die "print $filename: ";
    close FILE or die "close $filename: ";
}

If you are unfamiliar with running shell scripts, you would save the above script as $HOME/splitter and then, in Terminal, make it executable (only necessary once) using:

chmod +x $HOME/splitter

Then you run it, piping BigBoy.dat into it as the file to be processed like this:

$HOME/splitter < $HOME/BigBoy.dat

I am not sure if you really need to run it as Applescript, or if you are just unfamilar with the Terminal, but if you do really need to run it from Applescript, you can just do this:

do shell script "cd && ./splitter < BigBoy.dat"

which changes directory to your HOME directory and runs the splitter script you saved there on the BigBoy.dat file which is also hopefully in your HOME directory.

Note that this technique should be equally applicable to concatenated TIFF, JPEG, PNG, GIF or any other image.

Note that if you used TextEdit to create the Perl scripts, you will need to ensure you save as plain text rather than RTF - see here.

score 0 · Answer 2 · answered Aug 01 '23 at 04:55

This script will read a file and split its text on the string 'WEBM', making a list of those strings (or text items). It will then loop through that list and write each of the strings to a new file on the desktop with a name like 'image1.webm'. If a file with such a name already exists, it will overwrite it. I used an example text of three phrases here (i.e. fileText).

set sourceDat to "longtext.dat"
set fPath to (path to desktop as text) & sourceDat -- path to DAT file
--> "MacHD:Users:username:Desktop:longtext.dat"
set fileText to read file fPath as «class utf8»
--> WEBMStack Overflow at WeAreDevelopers World Congress in Berlin;WEBMTemporary policy: Generative AI (e.g., ChatGPT) is banned;WEBMPreview of Search and Question-Asking Powered by GenAI;

set AppleScript's text item delimiters to "WEBM"
set fileBlock to (get rest of text items of fileText)
--> {"Stack Overflow at WeAreDevelopers World Congress in Berlin;", "Temporary policy: Generative AI (e.g., ChatGPT) is banned;", "Preview of Search and Question-Asking Powered by GenAI;"}

repeat with cc from 1 to count of fileBlock
    set fileString to "WEBM" & text of item cc of fileBlock -- content of each new webm
    --> "WEBMPreview of Search and Question-Asking Powered by GenAI;"
    set newPath to ((path to desktop as text) & "image" & cc & ".webm")
    set eof of file newPath to 0
    try
        -- close access file newPath -- only required to treat 'file is already open' error
        set np to open for access file newPath with write permission
        write fileString to np as «class utf8»
    end try
    close access np
end repeat

To gain an understanding of how it works and how it might be modified, it might be helpful to read up on text item delimiters and the write command and perhaps other commands in the read/write suite as well.

Did you test this code with real WEBM/P images? It seems unlikely it would work with binary files such as images when you treat them as strings... — Mark Setchell, Aug 08 '23 at 11:07
No, I simply followed the OP's question and assumed that these DAT files would be amenable to splitting. — Mockman, Aug 08 '23 at 14:39

Split any file in multiple files after a specific text string

2 Answers2

On splitting concatenated images...