I have a table of around 100,000 rows. This table is in an Excel file, and here is a snapshot of it:
+------------+-----------+-----+-----+-----------------------------------------------------------+
| First Name | Last Name | Sex | Age | Address |
+------------+-----------+-----+-----+-----------------------------------------------------------+
| Parm | Jit | m | 23 | palm court scoeity, RD. golf course, delhi |
| Param | jit | m | 24 | palm cort society, road golf course, delhi |
| Pram | Jet | m | 28 | palm court socityt Road golf course, Delhi |
| Prm | jit | m | 31 | society palm court, Rod. Golf coure, delhi |
| Param | Jeet | m | 33 | palm court scoety, delhi |
| varun | nagraj | m | 36 | Thame Square, auckland-AZ-2014 |
| Janet | kumar | m | 40 | Thame Square, auckland-AZ-2014 |
| varun | kumar | m | 42 | Thame Square, auckland-AZ-2014 |
| Jatin | Kakkar | m | 45 | Noida, near shipra mall, sectr 57, Noida, U.P. |
| Jatin | Kakar | m | 56 | Noida, near shipra mall, sectr 57, Noida, Uttar pardesh |
| Jatin | Kakkr | m | 57 | Noida, Flat no- 23, near shipra mall, sectr 57, Noida, UP |
| Janet | Yellen | F | 23 | 11 CORONADO POINTELAGUNA NIGUELCA92677 |
| Janet | Yellen | F | 24 | 11 CORONADO POINTELAGUNA NIGUELCA |
| Janet | Yellen | F | 25 | 11 CORONADO POINTELAGUNA 92677-0000 |
| Jant | Yelen | F | 26 | 11 CORONADO POINTELAGUNA NIGUELCA0000 |
| Janet | Yellen | F | 26 | 11 CORONADO POINTELAGUNA NIGUELC |
| Abigail | Johnson | F | 24 | PRESERVE DRIVE NE, 11BELMONTMI4930 |
| andrew | symonds | m | 24 | Fame Stret, brisbane, hn 181 |
| Angel | Ahrendts | F | 26 | WYNGATE MANOR CTALEXANDRIAVA |
| Safra | Catz | F | 26 | 31155 ZOAR SCHOOL ROADLOCUST GROVEVA22508-0000 |
| Park | Geun-hye | F | 30 | CATHOLIC CHURCH RDBEACH LAKEPA |
| Sheryl | Sandberg | F | 24 | 80164 SULTANA AVEINDIOCA92201-0000 |
| Sheryl | Sandberg | F | 24 | SULTANA AVEINDIOC |
| Safra | Catz | F | 26 | OAR SCHOOL ROADLOCUST GROVEV |
| Park | Geun-hye | F | 30 | 308 CATHOLIC CHURCH RDBEACH LAKEPA18405-0000 |
| andrw | simnds | m | 24 | Fame Stret, 181 HOUSE NO |
| prashat | vats | m | 35 | Al thei, al nzar, dubai12 |
| prasant | vats | m | 37 | Al, al nazar, dubai23 |
| andrw | simonds | m | 34 | Fame brisbane, 181 H.N. |
| vats | prashant | m | 30 | Al thei, al nazar, dubai |
| vast | prshant | m | 30 | al nazar, dubai, street adamifullah |
| prashant | vats | m | 37 | Al thei, al nazar, dubai |
| ram | vats | m | 29 | Al thei, nazar, dubai |
| Kiss | hanes | m | 45 | Sydney, andrew str. 223 |
+------------+-----------+-----+-----+-----------------------------------------------------------+
I am trying to find out row similarities in this data, for example, row 1 is quite similar to row 2. I have tried clustering algorithms (namely BIRCH
, DBSCAN
, K means
, Spectral
and Markov Clustering
), but all of them take around half an hour to run on 100,000 rows before they give a memory error in python
(since I am taking all the data on my python
platform, my machine has a ram of 16gb
).
Should I use some better algorithm for this problem or do I need to move my data to a platform like spark
and then work on it? If former is the case, can you help me out with some algorithm that doesn't take too much time? Please do not consider this as a theoretical question, as I am looking forward to an approach to solve a practical problem with big data.