Can I use Perl's unpack to break up a string into vars?

Question

I have an image file name that consists of four parts:

$Directory (the directory where the image exists)
$Name (for a art site, this is the paintings name reference #)
$File (the images file name minus extension)
$Extension (the images extension)

$example 100020003000.png

Which I desire to be broken down accordingly:

$dir=1000 $name=2000 $file=3000 $ext=.png

I was wondering if substr was the best option in breaking up the incoming $example so I can do stuff with the 4 variables like validation/error checking, grabbing the verbose name from its $Name assignment or whatever. I found this post:

is unpack faster than substr? So, in my beginners "stone tool" approach:

my $example = "100020003000.png";
my $dir = substr($example, 0,4);
my $name = substr($example, 5,4);
my $file = substr($example, 9,4);
my $ext = substr($example, 14,3); # will add the the  "." later #

So, can I use unpack, or maybe even another approach that would be more efficient?

I would also like to avoid loading any modules unless doing so would use less resources for some reason. Mods are great tools I luv'em but, I think not necessary here.

I realize I should probably push the vars into an array/hash but, I am really a beginner here and I would need further instruction on how to do that and how to pull them back out.

Thanks to everyone at stackoverflow.com!

I believe in Perl you can use any function to do anything as long as you try hard enough. :-) — tvanfosson, Oct 07 '09 at 21:57
Yes, I am finding that out. All you have to say is "You can't do that with that" and bam, here comes the solutions! I also really prefer Perl over php by far. — Jim_Bo, Oct 07 '09 at 22:15
As for performance: `pack` is probably the fastest by a hair, but `pack`, `substr`, and regexes should *all* be fast enough that you don't need to worry. And if performance is really a concern, don't guess, benchmark with `Benchmark`. — hobbs, Oct 08 '09 at 05:58
Well I'm glad I said "don't guess, benchmark". Results at http://gist.github.com/204800 . They're all blazing fast but `substr` wins. — hobbs, Oct 08 '09 at 06:12
@hobbs Thank you. I guess I should test all my subs/routines in this manner. Thanks for the informative comment. — Jim_Bo, Oct 08 '09 at 14:00
Wow, that really is an amazing difference. Why would my initial approach (as in my question) be so much faster than a one liner? I am Apache version1.3.41 (Unix) / OS Linux . I assume prices will vary depending. ;-) — Jim_Bo, Oct 08 '09 at 14:09
@Jim_Bo The direct substring version does very minimal work. — Sinan Ünür, Oct 08 '09 at 18:47

Sinan Ünür · Accepted Answer · 2009-10-08T00:33:58.373

12

Absolutely:

my $example = "100020003000.png";
my ($dir, $name, $file, $ext) = unpack 'A4' x 4, $example;

print "$dir\t$name\t$file\t$ext\n";

Output:

1000    2000    3000    .png

edited Oct 08 '09 at 00:33

answered Oct 07 '09 at 22:00

Sinan Ünür

116,958
15
196
339

2

+1 for actually providing a good pack-based answer, even if I like the idea of using regexes more. :-) – C. K. Young Oct 07 '09 at 22:01
1

@Sinan Very nice. That explains to me how unpack works very well. I read docs on it today but, could not find an example that made sense to me. Thanks so much! KUDOS! – Jim_Bo Oct 07 '09 at 22:08

score 5 · Answer 2 · answered Oct 07 '09 at 21:57

5

I'd just use a regex for that:

my ($dir, $name, $file, $ext) = $path =~ m:(.*)/(.*)/(.*)\.(.*):;

Or, to match your specific example:

my ($dir, $name, $file, $ext) = $example =~ m:^(\d{4})(\d{4})(\d{4})\.(.{3})$:;

answered Oct 07 '09 at 21:57

C. K. Young

219,335
46
382
435

1

Excellent! Another great answer. Thanks very much. Now, which one to use? – Jim_Bo Oct 07 '09 at 22:11
I wonder which approach would actually be faster? I struggled with which answer to check but, I guess since the title is "can I use unpack instead...." the check goes there. I'll give you useful bumps here for sure though! ;-) – Jim_Bo Oct 07 '09 at 22:23
3

It depends. If you have a list of a whole lot of file names in this fixed format (as it seems from your question), use `unpack`. Otherwise, use regex. – Sinan Ünür Oct 07 '09 at 22:23
Yeah, this one is a one hit wonder. Only one $example will need to be processed. So, regex is prob the best choice. However, when my friend wants to view all his paintings in the directory, or view his hit data, I will try to utilize unpack. Believe me, I will be back here to see if I did it right! – Jim_Bo Oct 07 '09 at 22:35

score 3 · Answer 3 · answered Oct 07 '09 at 22:22

3

Using unpack is good, but since the elements are all the same width, the regex is very simple as well:

my $example = "100020003000.png";
my ($dir, $name, $file, $ext) = $example =~ /(.{4})/g;

answered Oct 07 '09 at 22:22

FMc

41,963
13
79
132

Man, keeps getting smaller and smaller! Thanks @FM ! That one goes in my Perl bible too. I kept each area (var) in sets of four chars because I knew it would benefit somehow. You just showed me why! Thanks! – Jim_Bo Oct 07 '09 at 22:28

score 2 · Answer 4 · edited Oct 08 '09 at 18:49

It isn't unpack, but since you have groups of 4 characters, you could use a limited split, with a capture:

my ($dir, $name, file, $ext) = grep length, split /(....)/, $filename, 4;

This is pretty obfuscated, so I probably wouldn't use it, but the capture in a split is an ofter overlooked ability.

So, here's an explanation of what this code does:

Step 1. split with capturing parentheses adds the values captured by the pattern to its output stream. The stream contains a mix of fields and delimiters.

qw( a 1 b 2 c 3 ) == split /(\d)/, 'a1b2c3';

Step 2. split with 3 args limits how many times the string is split.

qw( a b2c3 ) == split /\d/, 'a1b2c3', 2;

Step 3. Now, when we use a delimiter pattern that matches pretty much anything /(....)/, we get a bunch of empty (0 length) strings. I've marked delimiters with D characters, and fields with F:

 ( '', 'a', '', '1', '', 'b', '', '2' ) == split /(.)/, 'a1b2';
   F    D   F    D   F    D   F    D

Step 4. So if we limit the number of fields to 3 we get:

 ( '', 'a', '', '1', 'b2' ) == split /(.)/, 'a1b2', 3;
   F    D   F    D   F

Step 5. Putting it all together we can do this (I used a .jpeg extension so that the extension would be longer than 4 characters):

 ( '', 1000, '', 2000, '', 3000, '.jpeg' ) = split /(....)/, '100020003000.jpeg',4;
   F   D     F   D     F   D     F

Step 6. Step 5 is almost perfect, all we need to do is strip out the null strings and we're good:

( 1000, 2000, 3000, '.jpeg' ) = grep length, split /(....)/, '100020003000.jpeg',4;

This code works, and it is interesting. But it's not any more compact that any of the other solutions. I haven't bench-marked, but I'd be very surprised if it wins any speed or memory efficiency prizes.

But the real issue is that it is too tricky to be good for real code. Using split to capture delimiters (and maybe one final field), while throwing out the field data is just too weird. It's also fragile: if one field changes length the code is broken and has to be rewritten.

So, don't actually do this.

At least it provided an opportunity to explore some lesser known features of split.

score 0 · Answer 5 · edited Oct 08 '09 at 18:51

Both substr and unpack bias your thinking toward fixed-layout, while regex solutions are more oriented toward flexible layouts with delimiters.

The example you gave appeared to be fixed layout, but directories are usually separated from file names by a delimiter (e.g. slash for POSIX-style file systems, backwardslash for MS-DOS, etc.) So you might actually have a case for both; a regex solution to split directory and file name apart (or even directory/name/extension) and then a fixed-length approach for the name part by itself.

Can I use Perl's unpack to break up a string into vars?

5 Answers5

Linked