Finding the most common three-item sequence in a very large file

Question

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page sequence of all. The log files are too large to be held in main memory at once.

Sample log file:

User ID  Page ID
A          1
A          2
A          3
B          2
B          3
C          1
B          4
A          4

Corresponding results:

A： 1-2-3， 2-3-4
B： 2-3-4
2-3-4 is the most popular three-page sequence

My idea is to use use two hash tables. The first hashes on user ID and stores its sequence; the second hashes three-page sequences and stores the number of times each one appears. This takes O(n) space and O(n) time.

However, since I have to use two hash tables, memory cannot hold everything at once, and I have to use disk. It is not efficient to access disk very often.

How can I do this better?

Is the number of webpages quite large? (I'm getting at: is it reasonable to hold a "3-page visit" datastructure in memory?) — Larry OBrien, Dec 30 '11 at 19:26
yes, it is very large. It may not be able to be held in memory once. — user1002288, Dec 30 '11 at 19:30
The hash tables in this case would be num_users elements of last two pages (small), and one of (num_pages)*3 elements. I'd be surprised if both hashtables didn't fit in memory, and disk access can't be much less. — Mooing Duck, Dec 30 '11 at 19:33
This is a duplicate of [8195625](http://stackoverflow.com/questions/8195625/how-to-find-the-most-frequently-visited-3-webpage-sequence-from-a-large-log-file) but that did not get any compelling answers. — wallyk, Dec 30 '11 at 19:34
Well, the _problem_ is a duplicate; the users' ideas for solutions are different. Still, very nice find, @wallyk! — Pops, Dec 30 '11 at 19:37
How would you know the sequences without sorting the files by user id and timestamp? That would at least require O(nlog(n)) — user1123450, Dec 30 '11 at 19:38
@MooingDuck, if we have only 1000 webpages, the second table should hold up to 1000^3 3-page-sequences, which is a lot of memory. — Evgeny Kluev, Dec 30 '11 at 20:36
@EvgenyKluev: Oh, right, I did that math wrong. You're correct. — Mooing Duck, Dec 30 '11 at 20:40
If the "log files are too big to be held in main memory at once", doesn't that almost imply that your sample set is more than enough to infer meaningful stats? Why not just cut the log file into slices that are each big enough to hold in memory? Then run your algorithm on each (or stop after 1 set). You might have an "off by 1" issue because of the breaks between log files chunks, but given the size of the data set, I have a hunch that it won't matter. — selbie, Dec 30 '11 at 22:45
Possible duplicate of [Most Frequent 3 page sequence in a weblog](http://stackoverflow.com/questions/2991480/most-frequent-3-page-sequence-in-a-weblog) — craftsmannadeem, Mar 13 '16 at 17:25

Evgeny Kluev · Answer 1 · 2011-12-31T11:02:01.090

If you want to quickly get an approximate result, use hash tables, as you intended, but add a limited-size queue to each hash table to drop least recently used entries.

If you want exact result, use external sort procedure to sort logs by userid, then combine every 3 consecutive entries and sort again, this time - by page IDs.

Update (sort by timestamp)

Some preprocessing may be needed to properly use logfiles' timestamps:

If the logfiles are already sorted by timestamp, no preprocessing needed.
If there are several log files (possibly coming from independent processes), and each file is already sorted by timestamp, open all these files and use merge sort to read them.
If files are almost sorted by timestamp (as if several independent processes write logs to single file), use binary heap to get data in correct order.
If files are not sorted by timestamp (which is not likely in practice), use external sort by timestamp.

Update2 (Improving approximate method)

Approximate method with LRU queue should produce quite good results for randomly distributed data. But webpage visits may have different patterns at different time of day, or may be different on weekends. The original approach may give poor results for such data. To improve this, hierarchical LRU queue may be used.

Partition LRU queue into log(N) smaller queues. With sizes N/2, N/4, ... Largest one should contain any elements, next one - only elements, seen at least 2 times, next one - at least 4 times, ... If element is removed from some sub-queue, it is added to other one, so it lives in all sub-queues, which are lower in hierarchy, before it is completely removed. Such a priority queue is still of O(1) complexity, but allows much better approximation for most popular page.

+1 This is the simplest solution, and the one I'd probably go with if I didn't have a lot of time to spend on coming up with an "optimum" run-time solution. Doing two external sorts is expensive, but not nearly as expensive as my time. — Jim Mischel, Dec 31 '11 at 18:06

Mooing Duck · Answer 2 · 2011-12-31T17:35:20.957

There's probably syntax errors galore here, but this should take a limited amount of RAM for a virtually unlimited length log file.

typedef int pageid;
typedef int userid;
typedef pageid[3] sequence;
typedef int sequence_count;

const int num_pages = 1000; //where 1-1000 inclusive are valid pageids
const int num_passes = 4;
std::unordered_map<userid, sequence> userhistory;
std::unordered_map<sequence, sequence_count> visits;
sequence_count max_count=0;
sequence max_sequence={};
userid curuser;
pageid curpage;
for(int pass=0; pass<num_passes; ++pass) { //have to go in four passes
    std::ifstream logfile("log.log");
    pageid minpage = num_pages/num_passes*pass; //where first page is in a range
    pageid maxpage = num_pages/num_passes*(pass+1)+1;
    if (pass==num_passes-1) //if it's last pass, fix rounding errors
        maxpage = MAX_INT;
    while(logfile >> curuser >> curpage) { //read in line
        sequence& curhistory = userhistory[curuser]; //find that user's history
        curhistory[2] = curhistory[1];
        curhistory[1] = curhistory[0];
        curhistory[0] = curhistory[curpage]; //push back new page for that user
        //if they visited three pages in a row
        if (curhistory[2] > minpage && curhistory[2]<maxpage) { 
            sequence_count& count = visits[curhistory]; //get times sequence was hit
            ++count; //and increase it
            if (count > max_count) { //if that's new max
                max_count = count;  //update the max
                max_sequence = curhistory; //arrays, so this is memcpy or something
            }
        }
    }
}
std::cout << "The sequence visited the most is :\n";
std::cout << max_sequence[2] << '\n';
std::cout << max_sequence[1] << '\n';
std::cout << max_sequence[0] << '\n';
std::cout << "with " << max_count << " visits.\n";

Note that If you pageid or userid are strings instead of ints, you'll take a significant speed/size/caching penalty.

[EDIT2] It now works in 4 (customizable) passes, which means it uses less memory, making this work realistically in RAM. It just goes proportionately slower.

@user1002288: fixed answer to be your algorithm with 4 passes, shrinking the hash-table proportionately. — Mooing Duck, Dec 30 '11 at 21:20
Your second approach requires to know problem size in advance to determine the number of passes. Another disadvantage: this algorithm has O(N^2) complexity, but simple external sort is only O(N*log(N)). — Evgeny Kluev, Dec 31 '11 at 11:12
@Mooing, thanks for your coding, visit[] should record the seq that have been visited so that the count can be updated. And, unordered_map search time is O(n). — user1002288, Dec 31 '11 at 16:00
EvgenyKluev: it seemed reasonable to me for the OP to know the number of pages on the system, and it's safe to estimate. However, though this is O(n^2) technically, it normally runs in linear O(n) time. @user1002288: unordered_map is O(n) technically, but normally linear O(n) time. If you want better guarantees, switch it to std::map. Also note that the sort algorithms will write to the harddrive, this wont. This with std::map is still faster than any sort. — Mooing Duck, Dec 31 '11 at 17:34

score 2 · Answer 3 · answered Dec 30 '11 at 22:03

If you have 1000 web pages then you have 1 billion possible 3-page sequences. If you have a simple array of 32-bit counters then you'd use 4GB of memory. There might be ways to prune this down by discarding data as you go, but if you want to guarantee to get the correct answer then this is always going to be your worst case - there's no avoiding it, and inventing ways to save memory in the average case will make the worst case even more memory hungry.

On top of that, you have to track the users. For each user you need to store the last two pages they visited. Assuming the users are referred to by name in the logs, you'd need to store the users' names in a hash table, plus the two page numbers, so let's say 24 bytes per user on average (probably conservative - I'm assuming short user names). With 1000 users that would be 24KB; with 1000000 users 24MB.

Clearly the sequence counters dominate the memory problem.

If you do only have 1000 pages then 4GB of memory is not unreasonable in a modern 64-bit machine, especially with a good amount of disk-backed virtual memory. If you don't have enough swap space, you could just create an mmapped file (on Linux - I presume Windows has something similar), and rely on the OS to always have to most used cases cached in memory.

So, basically, the maths dictates that if you have a large number of pages to track, and you want to be able to cope with the worst case, then you're going to have to accept that you'll have to use disk files.

I think that a limited-capacity hash table is probably the right answer. You could probably optimize it for a specific machine by sizing it according to the memory available. Having got that you need to handle the case where the table reaches capacity. It may not need to be terribly efficient if it's likely you rarely get there. Here's some ideas:

Evict the least commonly used sequences to file, keeping the most common in memory. I'd need two passes over the table to determine what level is below average, and then to do the eviction. Somehow you'd need to know where you'd put each entry, whenever you get a hash-miss, which might prove tricky.
Just dump the whole table to file, and build a new one from scratch. Repeat. Finally, recombine the matching entries from all the tables. The last part might also prove tricky.
Use an mmapped file to extend the table. Ensure that the file is used primarily for the least-commonly used sequences, as in my first suggestion. Basically, you'd simply use it as virtual memory - the file would be meaningless later, after the addresses have been forgotten, but you wouldn't need to keep it that long. I'm assuming there isn't enough regular virtual memory here, and/or you don't want to use it. Obviously, this is for 64-bit systems only.

Rob H · Answer 4 · 2011-12-30T22:47:32.597

I think you only have to store the most recently seen triple for each userid right? So you have two hash tables. The first containing key of userid, value of most recently seen triple has size equal to number of userids.

EDIT: assumes file sorted by timestamp already.

The second hash table has a key of userid:page-triple, and a value of count of times seen.

I know you said c++ but here's some awk which does this in a single pass (should be pretty straight-forward to convert to c++):

#  $1 is userid, $2 is pageid

{
    old = ids[$1];          # map with id, most-recently-seen triple
    split(old,oldarr,"-"); 
    oldarr[1]=oldarr[2]; 
    oldarr[2]=oldarr[3]; 
    oldarr[3] = $2; 
    ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
    tripleid = $1":"ids[$1];  # build a triple-id of userid:triple
    if (oldarr[1] != "") { # don't accumulate incomplete triples
        triples[tripleid]++; }   # count this triple-id
}
END {
    MAX = 0;
    for (tid in  triples) {
        print tid" "triples[tid];
        if (triples[tid] > MAX) MAX = tid;
    }
    print "MAX is->" MAX" seen "triples[tid]" times";
}

this does nothing to address the memory usage issues, and exacerbates it by keeping separate sequence counts for each user (which was not specified). Nice use of awk though. — ams, Jan 04 '12 at 14:02

score 1 · Answer 5 · answered Dec 30 '11 at 23:51

1

If you are using Unix, the sort command can cope with arbitrarily large files. So you could do something like this:

sort -k1,1 -s logfile > sorted (note that this is a stable sort (-s) on the first column)
Perform some custom processing of sorted that outputs each triplet as a new line to another file, say triplets, using either C++ or a shell script. So in the example given you get a file with three lines: 1-2-3, 2-3-4, 2-3-4. This processing is quick because Step 1 means that you are only dealing with one user's visits at a time, so you can work through the sorted file a line at a time.
sort triplets | uniq -c | sort -r -n | head -1 should give the most common triplet and its count (it sorts the triplets, counts the occurrences of each, sorts them in descending order of count and takes the top one).

This approach might not have optimal performance, but it shouldn't run out of memory.

answered Dec 30 '11 at 23:51

Matthew Strawbridge

19,940
10
72
93

That won't work for sequences like 4-3-5 or anything not in numerical order. – ams Jan 04 '12 at 13:57
@ams Are you sure? I don't think anything in these steps relies on the page visits starting out in numerical order. Note that the sort in Step 1 is stable, so it sorts by user but leaves the relative order of that user's pages unchanged. – Matthew Strawbridge Jan 04 '12 at 15:22
ah, ok, I missed that detail. As long as it's true, this seems fine. Assuming one doesn't want to do this too often, and one doesn't care how long it takes, then this approach seems fine. The OP seems to think performance is important though. – ams Jan 05 '12 at 13:02

Finding the most common three-item sequence in a very large file

5 Answers5

Linked