0

I have a collection of files with a certain structure:

COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf

Breakdown:

  • COMPANY -> Company name, fixed
  • DE -> Office location, fixed options: '_DE', '_BE', or absent for non-location-dependent files, if present always preceded by an underscore and company name
  • Actual-Contents-of-File, string glued with dashes
  • RGB -> Colormode, fixed options: 'RGB', 'CMYK', 'PMS', or absent for non-color related files
  • ENG -> Language of file, fixed options: 'GER', 'ENG', or absent for non-text related files
  • pdf -> Extension, can be anything

In the best case my result would be an array with above info with named keys but wouldn't know where to start.

Help would be greatly appreciated!

Thanks, Knal


Sorry to have been so unclear, but a few variables are not always present in the filename: - DE -> fixed options: '_DE', '_BE', or absent - RGB -> Colormode, fixed options: 'RGB', 'CMYK', 'PMS', or absent - ENG -> Language of file, fixed options: 'GER', 'ENG', or absent

knalpiap
  • 67
  • 1
  • 7

5 Answers5

1

Try

$string = "COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf";
$array = preg_split('/[-_\.]/', $string);

$len = count($array);
$struct = array($array[0], $array[1], '', $array[$len-3], $array[$len-2], $array[$len-1]);
unset($array[0], $array[1], $array[$len-3], $array[$len-2], $array[$len-1]);
$struct[2] = implode('-', $array);
var_dump($struct);

-

array
  0 => string 'COMPANY' (length=7)
  1 => string 'DE' (length=2)
  2 => string 'Actual-Contents-of-File' (length=23)
  3 => string 'RGB' (length=3)
  4 => string 'ENG' (length=3)
  5 => string 'pdf' (length=3)
Jack
  • 5,680
  • 10
  • 49
  • 74
  • split(perl) or explode(php) are best in this situation, yeah! – gaussblurinc Apr 19 '12 at 08:22
  • Thanks for your help, but if the company's name, language, etc. is anywhere else in the filename it breaks. I feel it should use the fixed positions of the given variables... – knalpiap Apr 19 '12 at 08:45
  • What do you mean? Can you give an example string of where it breaks? You mean like in the contents? – Jack Apr 19 '12 at 08:49
  • Ok well if the format is the same for all, I guess you can just use fixed positions. Updated code should work. – Jack Apr 19 '12 at 09:06
  • Thanks Jack, i now have a fail-safe solution mostly based on Armatus' answer and slightly on yours. I'll post it when i may. Thank you very much for your effort and help! – knalpiap Apr 19 '12 at 10:06
1

Try not to use regular expressions if possible, or keep them as simple as it gets.

$text = "COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf";
$options_location = array('DE','BE');
$options_color = array('RGB','CMYK','PMS');
$options_language = array('ENG','GER');

//Does it have multiple such lines? In that case this:
$lines = explode("\n",$text);
//Then loop over this with a foreach, doing the following for each line:

$parts = preg_split('/[-_\.]/', $line);
$data = array(); //result array
$data['company'] = array_shift($parts); //The first element is always the company
$data['filetype'] = array_pop($parts); //The last bit is always the file type
foreach($parts as $part) { //we'll have to test each of the remaining ones for what it is
    if(in_array($part,$options_location))
        $data['location'] = $part;
    elseif(in_array($part,$options_color))
        $data['color'] = $part;
    elseif(in_array($part,$options_language))
        $data['lang'] = $part;
    else
        $data['content'] = isset($data['content']) ? $data['content'].' '.$part : $part; //Wasn't any of the others so attach it to the content
}

This is easier to understand as well, instead of having to figure out what exactly a regex is doing.

Note that this assumes that no part of the content can be one of the words which are reserved for location, color or language. If it is possible for these to occur within the contents, you will have to add conditions like isset($data['location']) to check if there was already another location found and if so add the correct one to the content instead of storing it as the location.

Armatus
  • 2,206
  • 17
  • 28
0

Something like that:

preg_match('#^([^_]+)(_[^-]+)?-([\w-]+)-(\w+)-(\w+)(\.\w+)$#i', 'COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf', $m);

preg_match('#^([^_]+)(_[^-]+)?-([\w-]+)-(\w+)[_-]([^_]+)(\.\w+)$#i', 'COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf', $m); // for both '_' and '-'

preg_match('#^(\p{Lu}+)(-\p{Lu}+)?-([\w]+)(\-(\p{Lu}+))?(\-(\p{Lu}+))?(\.\w+)$#', 'COMPANY-NL-Actual_Contents_of_File-RGB-ENG.pdf', $m); // if filename parts divider is strictly '-'

var_dump($m);

In last variant as you wewe asking if no country code (-NL) it will be NULL. But with color and langage codes it's not. Try it yourself and you'll figure it out how it works!

s.webbandit
  • 16,332
  • 16
  • 58
  • 82
  • I can't reproduce this, my result is: `code` Array ( [0] => DESIGN_NL-Actual-Contents-of-File-RGB_ENG.docx [1] => DESIGN [2] => _NL [3] => Actual-Contents-of [4] => File [5] => RGB_ENG [6] => .docx ) `/code` So the filename is cut up, and the language and colormode are combined... – knalpiap Apr 19 '12 at 08:24
  • in your first example you gave `COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf` in second `DESIGN_NL-Actual-Contents-of-File-RGB_ENG.docx`. Different dividers between `RGB` and `ENG` - `_` and `-`. Which one would you use? Both? – s.webbandit Apr 19 '12 at 08:38
  • If both see new regexp in answer – s.webbandit Apr 19 '12 at 08:40
  • Sorry, my sloppy mistake. It should always be a dash... If i remove the colormode from the filename (which could happen) the filename breaks again... Secondly, would it be possible to give the results fixed positions in the array? E.g. $m[5] would always be the language, if no language is set, it's NULL... – knalpiap Apr 19 '12 at 08:51
  • give new string (with colormode removed) – s.webbandit Apr 19 '12 at 08:53
  • COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf could become COMPANY_DE-Actual-Contents-of-File-ENG.pdf (no color), or COMPANY_DE-Actual-Contents-of-File-RGB.pdf (no language), or COMPANY-Actual-Contents-of-File-RGB-ENG.pdf (no location), or a combination. – knalpiap Apr 19 '12 at 09:06
  • maybe you can divide `Atual-Contents-of-File` and `RGB` parts with `_`? not with `-`? – s.webbandit Apr 19 '12 at 09:35
  • That's still possible. In that case it would seem logic that all possible variables are separated by a dash, and `Actual-contents-of-file` becomes `Actual_contents_of_file` – knalpiap Apr 19 '12 at 09:44
  • And country code alse would be `COUNTRY-DE` not `COUNTRY_DE`? If so it's pretty easy – s.webbandit Apr 19 '12 at 09:47
  • Thanks webbandit, i've found a final fail-safe solution based on Armatus'answer. I'll post it if the site allows me to. Thank you very much for your effort and help! – knalpiap Apr 19 '12 at 10:05
0

How about:

$files = array(
    'COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf',
    'COMPANY_BE-Actual-Contents-of-File-CMYK-ENG.pdf',
    'COMPANY_DE-Actual-Contents-of-File-PMS-GER.doc',
    'COMPANY-Actual-Contents-of-File-PMS-GER.doc',
    'COMPANY-Actual-Contents-of-File-GER.doc',
    'COMPANY-Actual-Contents-of-File.doc',
);

foreach($files as $file) {
    preg_match('/^(?<COMPANY>.*?)_?(?<LOCATION>DE|BE)?-(?<CONTENT>.*?)-?(?<COLOR>RGB|CMYK|PMS)?-?(?<LANG>ENG|GER)?\.(?<EXT>[^.]+)$/', $file, $m);
    echo "\nfile=$file\n";
    echo "COMPANY: ",$m['COMPANY'],"\n";
    echo "LOCATION: ",$m['LOCATION'],"\n";
    echo "CONTENT: ",$m['CONTENT'],"\n";
    echo "COLOR: ",$m['COLOR'],"\n";
    echo "LANG: ",$m['LANG'],"\n";
    echo "EXT: ",$m['EXT'],"\n";
}

output:

file=COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf
COMPANY: COMPANY
LOCATION: DE
CONTENT: Actual-Contents-of-File
COLOR: RGB
LANG: ENG
EXT: pdf

file=COMPANY_BE-Actual-Contents-of-File-CMYK-ENG.pdf
COMPANY: COMPANY
LOCATION: BE
CONTENT: Actual-Contents-of-File
COLOR: CMYK
LANG: ENG
EXT: pdf

file=COMPANY_DE-Actual-Contents-of-File-PMS-GER.doc
COMPANY: COMPANY
LOCATION: DE
CONTENT: Actual-Contents-of-File
COLOR: PMS
LANG: GER
EXT: doc

file=COMPANY-Actual-Contents-of-File-PMS-GER.doc
COMPANY: COMPANY
LOCATION:
CONTENT: Actual-Contents-of-File
COLOR: PMS
LANG: GER
EXT: doc

file=COMPANY-Actual-Contents-of-File-GER.doc
COMPANY: COMPANY
LOCATION:
CONTENT: Actual-Contents-of-File
COLOR:
LANG: GER
EXT: doc

file=COMPANY-Actual-Contents-of-File.doc
COMPANY: COMPANY
LOCATION:
CONTENT: Actual-Contents-of-File
COLOR:
LANG:
EXT: doc
Toto
  • 89,455
  • 62
  • 89
  • 125
0

Inspired by @Armatus i've constructed the following which appears to be fail-safe:

$string = "COMPANY_DE-Actual-Contents+of-File-RGB-ENG.pdf";
$options_location = array('DE','BE');
$options_color = array('RGB','CMYK','PMS');
$options_language = array('ENG','GER');
$parts = preg_split( '/[\.\-\_]/', $string, NULL, PREG_SPLIT_NO_EMPTY );

$data = array();
$data['company'] = array_shift($parts);
$data['filetype'] = array_pop($parts);

if( in_array( $parts[0], $options_location ) ){
$data['location'] = array_shift($parts);
}else{
$data['location'] = NULL;
};

if( in_array( end( $parts), $options_language ) ){
$data['language'] = array_pop($parts);
}else{
$data['language'] = NULL;
};

if( in_array( end( $parts), $options_color ) ){
$data['colormode'] = array_pop($parts);
}else{
$data['colormode'] = NULL;
};

$data['content'] = implode( ' ', $parts );
print_r( $data );
knalpiap
  • 67
  • 1
  • 7