0

I have a 1830*6800 matrix like below:

enter image description here

The row 1830 is for different startup companies ID, the column is for 6800 different investors. Now I want to find the similarities between those companies that successfully collect enough money, and those who are not so lucky to acquire enough money.

I am thinking of using k-means clustering and spectral clustering, setting the cluster number to 2 to have 2 different groups (i.e. success & fail). But the k-means is giving me almost all 0's which means all rows are in the same cluster.

Can anyone give me some thought, how to choose a more suitable algorithm for this situation? It doesn't have to be clustering.

Tom Dawn
  • 185
  • 2
  • 3
  • 14
  • What are the values in the cells? – flyingmeatball May 04 '16 at 14:24
  • the values are either 1's or 0's. 1's is for successfully getting money from one of those investors, 0's for failure. – Tom Dawn May 04 '16 at 14:26
  • What is the sparsity of your data? If you sum your total dataframe, what do you get? – flyingmeatball May 04 '16 at 14:29
  • so actually the original matrix was 1830 * 140000000, and I ran random projection for dimension reduction to 1830*6800. The original matrix was really sparse, as most rows can have only 10-1000 investors against 140000000 investors in total – Tom Dawn May 04 '16 at 14:33

2 Answers2

1

Random projection is probably doing more harm than good here. Instead remove e.g. all investors who invested in a single company, all companies with no investors left, repeat.

But all in all, I'd say you have a hopeless task here.

Clustering won't help you. There is no way you will get success or failure clusters. You are much more likely to get east coast or west coast clusters; or different fields. Clustering is the wrong tool if you have an objective such as success/failure.

Furthermore, your data is full of anomalies, and k-means cannot handle them well. That is probably why almost everything is in the same cluster.

The best you can try are frequent itemsets, which will (depending on how you apply it) identify groups of investors which invest in the same companies, and groups of companies that tend to have same investors.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

I think you're using the data incorrectly. If there are 140,000,000 million investors, you have an extremely sparse matrix. Does every column have data? If not, remove it. You say your data is:

"1's or 0's. 1's is for successfully getting money from one of those investors, 0's for failure."

The vast majority of your cells should be null then, because I can't imagine a startup has attempted to get money from 6,800 investors - make sure your data is only 0's for those companies that have actively been petitioned for funding.

Also, how are you defining success? Is it a number of investors? An amount? I think as structured, your data is not going to give you the answers you are looking for.

flyingmeatball
  • 7,457
  • 7
  • 44
  • 62
  • You mean I should have only 0's and null's? And 0's for successful funding while null's for failure? But I don't think it will make any difference, as 1's & 0's are essentially the same with 0's & null's – Tom Dawn May 04 '16 at 14:54
  • And the definition of success is not something I know. Maybe it's the number of investors, I don't know. This is a real world data, and you can think of as no standard has been set. That's why it's called unsupervised learning and it's not as easy as supervised ones. – Tom Dawn May 04 '16 at 14:56
  • No, you should have 1 for success, 0 for failure, and null if they didn't try. Having 0 as the default value says something very different than having it as null. What you're really clustering here are investor patterns. Who invested in the same companies. That, to me, is a different question than what you asked above. – flyingmeatball May 04 '16 at 15:01
  • Okay, but the original data has only 0's and 1's. Let's assume we have null's and 0's and 1's, what algorithm do you think I should use in this case? – Tom Dawn May 04 '16 at 16:48
  • I think you should be using k-means, and if k-means isn't working it tells you that either your data is bad, you're asking the wrong question, or that there is no good clustering. – flyingmeatball May 04 '16 at 17:31