0

I am having a dataset with the set of locations visited by users all over the world. The dataset looks like the following:

1 55 66 22 88
2 11 33
........
........
99 88 22 66 99 55 33
100 33 44 88

First column is the userid and the folowing columns represents the locations visited by each of them in sequential manner. So userid-1 has visited locations 55,66,22 and 88 sequentially. Each location name is represented as locationids for simplicity. My dataset is having 100 users with 570 different locations visited by them. So, a 100X570 matrix for the user-location trajectory data stored in matlab.

Question: I need to find a sample of 30 users from the total 100 users. So, that when the user-location matrix of these 30 users are provided to my Data Mining algorithm, then the mining is processed in less time and with better results.

Better results and time efficient program execution for the Data Mining algorithm varies with the total number of common locations in the user-location trajectory matrix.

That is, if a user-location matrix is having 100 unique locations, then my data mining algorithm will take less time to execute it than another user-location matrix having 300 different locations. Locations visited by the usrers are sequential, hence the order of locationids can never be changed.

Is there any sampling technique to overcome this problem, or should I have to perform something like clustering to group the users according to the visited locations??

I am working in Matlab, so any suggestions for Matlab would be better.

Pramit
  • 23
  • 7
  • Any smaller matrix should be faster, but also, using less information should never give you "better" results (you can't learn as much if you don't have as much data to learn from). You might prefer to say: faster processing and "good enough" results. In that case, you should also describe what it is about the original input that determines how effective your mining algorithm is. Otherwise, it is nearly impossible to give any meaningful answer. – TravisJ Feb 03 '15 at 16:10
  • I am having a variant of Prefix Span algorithm, for finding sequential patterns iin user trajectory. I will be giving trajectory of two users to the Prefix Span one by one, and finally will be comparing the common patterns if any. Now, say I selected randomly any two users, having 300 diff locations checked-in, but with no common locations. Then Time(Prefix on User-1 trajectory+Time(Prefix on User-2) will be much high. Added, I will not get any common patterns, as there are no common locations. So, I need to choose those users, for which the process time is less and good enough results. – Pramit Feb 03 '15 at 16:48
  • I'm not certain how you might implement a filter in Matlab, but you might start by collecting users who visited either location_1 or location_2, and then do the mining on just those users (varying the locations, or the number of locations). – TravisJ Feb 03 '15 at 17:55

0 Answers0