0

I have some files containing the similar formats as follows:

Nearest Location:
771 km S 43° E of Quezon City
051 km S 66° E of Surigao City 
007 km S 51° W of Socorro (Surigao Del Norte)
049 km N 70° E of PFZ EAST MINDANAO SEGMENT

What I did was to load each file through file_get_contents() and explode("\n",...). So I used foreach($rows_location as $row_location) and stored it to $content_location = $row_location;. I separated it into 2 parts, the distance and bearing info, and the location neglecting the first line which is as follows:

$distance_bearing = substr($content_location,0,18);
$location = substr($content_location, 18);

but the problem is that it only works for the first row so i saved them separately and got

771 km S 43° E of

for $distance_bearing and

Quezon City
051 km S 66° E of Surigao City 
007 km S 51° W of Socorro (Surigao Del Norte)
049 km N 70° E of PFZ EAST MINDANAO SEGMENT

for $location.

I tried converting each row to utf8_decode as it contains a degree symbol and it just returned the same but with the degree replaced with ?. I also checked $content_location if it stores the rows properly and got:

Nearest Location:
771 km S 43° E of Quezon City
051 km S 66° E of Surigao City 
007 km S 51° W of Socorro (Surigao Del Norte)
049 km N 70° E of PFZ EAST MINDANAO SEGMENT

so I guess it stores it correctly.

I don't know what the problem is, so any ideas will help. Thank you in advance for the help.


Sorry guys, I forgot to put the desired output it should be:

771 km S 43° E of
051 km S 66° E of
007 km S 51° W of
049 km N 70° E of

for $distance_bearing and

Quezon City
Surigao City 
Socorro (Surigao Del Norte)
PFZ EAST MINDANAO SEGMENT

for $location. Thanks again!

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
phiv
  • 23
  • 5
  • substr() only extracts 18 characters in what you have, was that what you wanted? – Jenny Holder Nov 22 '17 at 02:16
  • thanks for the input guys, I did some editing to the question for clarification – phiv Nov 22 '17 at 02:28
  • It looks like it is only recognizing the eol/`\n` after `Quezon City`, and not after the the other locations. – Sean Nov 22 '17 at 02:28
  • Thanks @Sean. Yeah I thought so too so I checked $content_location by printing it and it returns it row by row. That's why I'm confused why it cannot parse even though it explodes it properly. – phiv Nov 22 '17 at 02:31
  • Thanks @mickmackusa, yeah I do want to keep it – phiv Nov 22 '17 at 02:41
  • 1
    can you show your actual code? the simplified code you reference is working here - https://3v4l.org/9eMOv – Sean Nov 22 '17 at 02:42

1 Answers1

3

preg_match_all() will handle the block of text simply.

Code: (Demo)

$string='Nearest Location:
771 km S 43° E of Quezon City
051 km S 66° E of Surigao City
007 km S 51° W of Socorro (Surigao Del Norte)
049 km N 70° E of PFZ EAST MINDANAO SEGMENT';

if(preg_match_all('/^(.*? of) \K.+/m',$string,$out)){
    list($location,$distance_bearing)=$out;
}
var_export($distance_bearing);
echo "\n\n";
var_export($location);

Output:

array (
  0 => '771 km S 43° E of',
  1 => '051 km S 66° E of',
  2 => '007 km S 51° W of',
  3 => '049 km N 70° E of',
)

array (
  0 => 'Quezon City',
  1 => 'Surigao City',
  2 => 'Socorro (Surigao Del Norte)',
  3 => 'PFZ EAST MINDANAO SEGMENT',
)

Pattern Explanation:

/         # start the pattern
^         # match from the start of the line
(.*? of)  # match the leading text upto and including the first occurrence of `of`
 \K.+     # match 1 space, then restart the fullstring match using \K then match the rest of the string
/         # end the pattern
m         # force ^ character to match every line start instead of string start
mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • Great explanation! You can also utilize pattern name matching `preg_match_all('/^(?P.*? of) \K(?P.+)/m',$string,$out)` Which will give `$out` an associative index for the pattern names. `$out['location']` and `$out['distance_bearing']` like so: https://3v4l.org/jC5SK – Will B. Nov 22 '17 at 02:59
  • Yes, I think that it is a good option for this task. I generally stay away from named capture groups because you also get the indexed subarrays too, which means output array bloat. But I will grant you, it saves a function call `list()`. See the [bloat](https://3v4l.org/uGYc3) -- 5 subarrays versus 2. – mickmackusa Nov 22 '17 at 03:02
  • Thanks @mickmackusa, I tried it with the files that I have, the arrays output the same thing. It parses well with the first line but the rest prints as a whole line as follows: 771 km S 43РE of "\n" Quezon City "\n" 051 km S 66РE of Surigao City "\n" 007 km S 51РW of Socorro (Surigao Del Norte) "\n" 049 km N 70РE of PFZ EAST MINDANAO SEGMENT Do you think it's an eof thing? – phiv Nov 22 '17 at 03:08
  • @phiv can you provide a source file? – Will B. Nov 22 '17 at 03:09
  • um... maybe give us a var_dump or something, so we can count string length and prove there are invisible characters. I have another (more convoluted method) if that doesn't display anything. – mickmackusa Nov 22 '17 at 03:10
  • @mickmackusa here it is, thanks. https://drive.google.com/open?id=1EOv_1YLlrlzRzc_sCzaon19wnU57HfSs – phiv Nov 22 '17 at 03:13
  • I was having issues when I first copy pasted your input. You've got unicode control characters hiding in there... Use this pattern: `/^(.*? of) \K.+(?=[\pZ\pC]+$)/mu` See what my unicode metacharacters are standing for: [Whitespace Characters & Control Characters](https://www.regular-expressions.info/unicode.html) – mickmackusa Nov 22 '17 at 03:16
  • @phiv yea, your file uses `LF` after `Nearest Location` and `CR` after each location. Along with the unicode metacharacters. See: [image](https://i.imgur.com/OHVuve1.png) – Will B. Nov 22 '17 at 03:17
  • thanks, @fyrye may I know what you used to display the eof? – phiv Nov 22 '17 at 03:22
  • Notepad++, `View -> Show Symbol -> Show All Characters`. Can also convert encodings and EOLs. Most other IDEs have similar functions, such as PHPStorm which I normally use. – Will B. Nov 22 '17 at 03:23
  • How about `/^(.*? of) \K.+(?=\s*$)/m` then? (make the `\s` zero or more) https://regex101.com/r/dwqaYp/2 – mickmackusa Nov 22 '17 at 03:25
  • @mickmackusa it returns the same output with only the first line working – phiv Nov 22 '17 at 03:31
  • Please show your code block. `preg_match` or `preg_match_all()`? (me thinks you are using `preg_match()`) – mickmackusa Nov 22 '17 at 03:33
  • @mickmackusa, yup, I am using preg_match_all(), it maybe because only the nearest location end in LF and the rest is CR, i'm currently trying to convert the EOL – phiv Nov 22 '17 at 03:37
  • I expect that this should work: `if(preg_match_all('/^(.*? of) \K\S+(?: \S+)*/m',$string,$out)){` – mickmackusa Nov 22 '17 at 03:38
  • I am able to reproduce using `file_get_contents` See output: [image](https://i.imgur.com/uzjhhbA.png) of the initial answer. Your latest pattern breaks into an array of 3 strings. – Will B. Nov 22 '17 at 03:42
  • @phiv Use this `$string = utf8_encode(str_replace("\r", "\n", $string));` Then perform the original answer `preg_match_all` – Will B. Nov 22 '17 at 03:49
  • Here is the anchor-free version: http://sandbox.onlinephpfunctions.com/code/a47aeb7ca299d9255637d3f77adbb97ff9b65df2 `/(\d+ km [NESW] \d+° [NESW] of) \K[^\r\n]+/` – mickmackusa Nov 22 '17 at 03:50
  • @mickmackusa I get empty values with the `[NESW]` pattern and `file_get_contents` I believe that it is an encoding issue with the source file and the `default_charset` in PHP. When I convert the file to UTF-8 without BOM or use the snippet I provided on the file contents, the issue goes away with your initial answer. – Will B. Nov 22 '17 at 04:08
  • ...this has been a hairy one. – mickmackusa Nov 22 '17 at 04:10
  • @fyrye, thank you so much, it solved the EOL issue and the code ran :D – phiv Nov 22 '17 at 04:20
  • @mickmackusa, thank you so much, by applying the EOL your code works like a charm :D – phiv Nov 22 '17 at 04:21
  • thanks fyrye and mickmackusa, you both are life savers! – phiv Nov 22 '17 at 04:21