1

I am new to data mining, so I apologize if this question may be an obvious question to anyone. I know there are quite a few data mining algorithms out there, such as sequential pattern mining, or the apriori algorithm. I would like to know if the following code I have implemented would be considered data mining, specifically for sequential patterns, if I have a database with approximately 20,000 students, or do I have to specifically use one of the existing data mining algorithms?

String x = "SELECT STUDENTS.ROW, STUDENTS.MAJOR, STUDENTS.NAME " +
"CASE WHEN prior_row.NAME IS NOT NULL" +
"AND EXISTS(SELECT 'x' FROM STUDENTS prior_row " +
"WHERE STUDENTS.MAJOR = prior_row.MAJOR" +
"AND STUDENTS.ROW > prior_row.ROW + 1" +
"SELECT STUDENTS.MAJOR, STUDENTS.ROW, STUDENTS.NAME WHERE" +
"MAJOR < (SELECT MAJOR FROM STUDENTS WHERE MAJOR = 'MATH' 
"AND WHERE MAJOR > (SELECT MAJOR FROM STUDENTS WHERE MAJOR = 'SCIENCE' THEN 1 ELSE NULL          END Flagged_Values";

 st.executeQuery(x);

  String y = "SELECT STUDENTS.ROW, STUDENTS.MAJOR, STUDENTS.NAME" +
"CASE WHEN previous.NAME IS NOT NULL" +
"AND EXISTS(SELECT 'y' FROM STUDENTS previous" +
"WHERE STUDENTS.MAJOR = previous.MAJOR" +
"AND STUDENTS.ROW > previous.ROW + 1" +
"SELECT STUDENTS.MAJOR, STUDENTS.ROW, STUDENTS.NAME WHERE" +
"MAJOR < (SELECT THE_OUTCOME FROM STUDENTINFO WHERE MAJOR ='Math' +
"AND WHERE MAJOR > (SELECT MAJOR FROM STUDENTS WHERE MAJOR = 'SCIENCE'" +
"AND WHERE MAJOR > (SELECT MAJOR FROM STUDENTS WHERE MAJOR = 'Engineering'
"THEN 1 ELSE NULL END Flag ";

 st.executeQuery(y);
user2554121
  • 225
  • 1
  • 7
  • 17

3 Answers3

2

What you are doing are SQL select statements. Projection, selection and aggregation.

Have you read the Wikipedia article on data mining?

The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection) and dependencies (association rule mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting are part of the data mining step, but do belong to the overall KDD process as additional steps.

The term "data mining" is often misused for any kind of data collection or selection, but one should call these tasks "data collection" and "database query" instead of pulling up random buzzwords. Data mining is the intersection of statistics, AI, machine learning, and databases. If these components are missing (and except for databases, I don't see them in your query), it should be called e.g. "databases", "machine learning" or "statistics".

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I don't think it's possible to separate staticstics, AI, and machine learning. In fact I'm pretty sure that any time you have either AI or machine learning you must have all three of those. That said you can have statistics without AI/ML. Additionally I don't think I've ever seen any real AI/ML go down without a database of some sort. – Slater Victoroff Jul 23 '13 at 17:48
  • I've seen AI/ML go down with R matrixes all the time, and no database acceleration for the actual task. Then it's not data mining, but pure AI/ML... – Has QUIT--Anony-Mousse Jul 23 '13 at 17:54
  • Ah, it's been a long time since I dealt with data sets small enough that it made sense to do that. Either way I've been enjoying this chat. You on Kaggle? – Slater Victoroff Jul 23 '13 at 18:01
  • No, I don't do a lot of ML, and everything on Kaggle are in fact machine learning competitions (because that is easy to evaluate automatically). Either classification or numerical prediction, and all supervised, with no requirements on database support -> pure machine learning. – Has QUIT--Anony-Mousse Jul 23 '13 at 18:05
1

In general, and keep in mind, this is inherently opinion based, data mining refers to the process of taking data that is in a relatively unusable format and converting it into a format that is more usable.

For instance, if I have a huge .txt dump of unstructured text and I then extract relevant portions (according to some formal definition of relevant) and place it into a .bson store or something similar, that would be data mining, regardless of exactly how I do the extraction.

However, since your data is already in a SQL database, I wouldn't consider this data mining. I would consider it SQL development, though again, this is largely opinion-based. A SQL database is already a highly useful way of storing data, so accessing that data isn't introducing a level of functionality that wasn't already present.

tl;dr: I wouldn't say this counts as data mining, but it's a gray area.

Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
  • I have to disagree. Taking data in an unreadable format and making it usable is called **preprocessing**. Data mining is the application of advanced statistical and AI/ML methods to obtain new knowledge. I don't see advanced statistics in his question. – Has QUIT--Anony-Mousse Jul 23 '13 at 12:05
  • @Anony-Mousse Like I said, data mining is still pretty new, so basic things like this are still a matter of opinion. – Slater Victoroff Jul 23 '13 at 13:31
  • Well, the need to convert data is much older than data mining, so why should it be called data mining, then? – Has QUIT--Anony-Mousse Jul 23 '13 at 14:11
  • @Anony-Mousse well, it's not just any data conversion, right? It kind of requires the additional of functionality to that data that wasn't present before the mining took place. I feel like taking the Census is in a way data mining, but I suppose it would be more accurate to say it's a very new term, rather than a new field. To be clear, converting a JPG to a PNG is not data mining, but converting a series of landscape to a series of tree pictures would be. According to my definition at least. – Slater Victoroff Jul 23 '13 at 14:32
  • I'd call that "computer vision", not "data mining"... always choose the most appropriate term. "data mining" is most appropriate when it all comes together: databases and scale and AI/ML/CV/... – Has QUIT--Anony-Mousse Jul 23 '13 at 17:41
  • @Anony-Mousse Well the distinction between AI/ML/CV isn't really a clear one either. To add, data science could easily be added to that group and it still wouldn't be clear that there's a strong difference between any of them with the distinction that CV is definitely a subset. I would agree that when all of those are used `data mining` is unquestionably a proper use of the term, but I still think it's really ambiguous. I don't think that CV and data mining are exclusionary terms. For instance, if I were storing the tree images in a db at scale would it still just be CV? – Slater Victoroff Jul 23 '13 at 17:46
  • No, they are not exclusory. But it's not data mining if it it doesn't involve other domains besides CV. Data mining is really bringing the stuff together. AI and ML and CV have existed long before data mining. So use these names whenever possible, instead of buzzwording. – Has QUIT--Anony-Mousse Jul 23 '13 at 17:53
  • Like I said, it's a matter of opinion. Clearly we have different opinions. At the end of the day it's all just formalisms anyway. I feel the buzzwording line was a bit over the top though. – Slater Victoroff Jul 23 '13 at 17:58
  • It would help a lot if people would call things "ML" if they are about *learning* and "databases" if it's about querying databases. Then "data mining" would be much less a "gray area", if people were more precise on what they do (and willing to use the precise term, instead of the fancy "data mining" buzzword). – Has QUIT--Anony-Mousse Jul 23 '13 at 18:01
  • Thank you both for your clarification on data mining. Two different opinions, nonetheless, both were great answers which clarified my interpretation of data mining. By that discussion, I can see why it would be considered a gray area, but I also see that it's more than the SQL code I posted. I appreciate both your answers. – user2554121 Jul 23 '13 at 20:59
0

In the field of data mining, performing SQL queries would not be considered as data mining.

Phil
  • 3,375
  • 3
  • 30
  • 46