0

So I've seen questions asked before that are along the lines of finding the maximum occurence of a string within a file but all of those rely on knowing what to look for.

I have what you might almost call a flat file database that grabs a bunch of input data and basically wraps different parts of it in html span tags with referencing ids.

Each line comes out in this kind of fashion:

<p>
<span class="ip">58.106.**.***</span> 
Wrote <span class='text'>some text</span>
<span class='effect1'> and caused seizures </span>
<span class='time'>23:47</span> 
</p>

How would I then go about finding the #test contents that occurs the most times.

i.e if I had

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span id='text'>woof</span>
    <span class='effect1'> and caused seizures </span>
    <span class='time'>23:47</span> 
    </p>

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span class='text'>meow</span>
    <span class='effect1'> and caused mind-splosion </span>
    <span class='time'>23:47</span> 
    </p>

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span class='text'>meow</span>
    <span class='effect1'> and used no effect </span>
    <span class='time'>23:47</span> 
    </p>

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span class='text'>meow</span>
    <span class='effect1'> and used no effect </span>
    <span class='time'>23:47</span> 
    </p>

the output would be 'meow'.

How would I accomplish this in php?

Michael Zaporozhets
  • 23,588
  • 3
  • 30
  • 47

2 Answers2

1

Have a look at DOMXPath, you can use an XPath query to get all the #text and then find the most used one with php.
There is a problem that you used the same id few times which is not valid HTML so DOM might break.

Daniel
  • 30,896
  • 18
  • 85
  • 139
1

First off: Your format is not conducive to this type of data manipulation; you might want to consider changing it.

That said, based on this structure the logical solution would be to leverage DOMXPath as Dani says. This could have been problematic because of all the duplicate ids in there, but in practice it works (after emitting a boatload of warnings, which is one more reason that the data structure affords revision).

Here's some code to go with the idea:

$input = '<body>'.get_input().'</body>';
$doc = new DOMDocument;
$doc->loadHTML($input); // lots of warnings, duplicate ids!
$xpath = new DOMXPath($doc);
$result = $xpath->query("//*[@id='text']/text()");

$occurrences = array();
foreach ($result as $item) {
    if (!isset($occurrences[$item->wholeText])) {
        $occurrences[$item->wholeText] = 0;
    }
    $occurrences[$item->wholeText]++;
}

// Sort the results and produce final answer    
arsort($occurrences);
reset($occurrences);

echo "The most common text is '".key($occurrences).
     "', which occurs ".current($occurrences)." times.";

See it in action.

Update (seeing as you fixed the duplicate id issue): You would simply change the xpath query to "//*[@class='text']/text()" so that it continues to match. However this way of doing things remains inefficient, so if one or more of these apply:

  • you are going to do this all the time
  • you have lots of data
  • you need it to be really fast

then changing the data format is a good idea.

Jon
  • 428,835
  • 81
  • 738
  • 806
  • Yep i fixed the issue with the id's (need to sleep more haha) and this is amazing thanks a tonne, I don't have to load the input into the page that i'm in though do i? can i not simply reference the text file with something like file_get_contents($filename)? – Michael Zaporozhets Apr 21 '12 at 14:32
  • I don't need it to be really fast but it would certainly be a bonus, and the other two apply aswell :S but I want to keep it in html/text format and be able to reference the individual elements. – Michael Zaporozhets Apr 21 '12 at 14:34
  • @MagicDev: Yes to the first comment. For the second, it all depends on what exactly your requirements are. I can't say without all the context you have. – Jon Apr 21 '12 at 14:43
  • Just need to be able to sort through client side with basic functions and be able to retrieve things specific to ip address; that's basically it. – Michael Zaporozhets Apr 21 '12 at 14:46