0

I have a PHP web script that accepts a user-entered regular expression and uses it to search a large text file (9.4 million lines, around 160MB). In the first iteration of my script, I had the file sitting on a regular file system, and when I needed to search it, I would access it using fopen / fgets and search it line by line. Depending on the complexity of the regular expression, the script got through the entire file in 30-45 seconds.

To try and speed it up, I mounted a 1GB tmpfs partition and moved the large text file onto it. I then changed the path in the PHP script, and was hoping to see immediate improvement. However, the speed at which the script parsed through the file hasn't changed, and on multiple runs, sometimes appeared slower than when reading the file from a regular file system.

Furthermore, I tried loading the entire file into RAM in PHP, but pulling it into an array first, which did improve the search time by 40% or so. This is not an acceptable way to go for me, unfortunately, since the initial loading-file-into-an-array time is quite long.

This is all happening on a virtual server with 12GB of RAM, running Debian 7, with nginx / php5-fpm.

What is happening with my tmpfs? Is there something I am missing? I'll supply whatever additional information necessary.

allenrabinovich
  • 384
  • 2
  • 10
  • my first question- whats in the file? –  Jan 06 '15 at 21:01
  • @Dagon English words and phrases, one word or phrase per line, with a number at the end of the line that serves as a commonness metric. A combination of dictionary entries, common idioms, titles of Wikipedia articles. Here's a random sampling of a few lines from somewhere in the middle (these are one per line): belligerent 35 beneficial 35 beset 35 betel 35 bicker 35 bidder 35 bier 35 bill_of_rights 35 billboard 35 billiards 35 billing 35 billow 35 billy 35 biota 35 blackhead 35 bland 35 blistering 35 blood_brain_barrier 35 blood_vessel 35 blotter 35 blow_out 35 blue_book 35 – allenrabinovich Jan 06 '15 at 21:13
  • sounds like this should be in a db not a flat file, then it can be indexed and searched far more efficiently. –  Jan 06 '15 at 21:14
  • @Dagon I've actually tried this -- I had a simple mysql db with the words/phrases and an index on the relevant column. It didn't speed this up and I think in some cases slowed it down.I think the issue is that the regular expressions used can be rather arbitrary (could be something as simple as `/^beaut....$/`, but could also be something like `/^(.{3})(men|cat|dog)\1$/` (that matches "tormentor" in English, and that's it). I also have a requirement of being able to retrieve results as they are found, which doesn't seem possible with a db (neither are captured/named groups it seems). – allenrabinovich Jan 06 '15 at 21:26

0 Answers0