0

Im calculating Cosine Similarity using NLTK and exporting the cosine similarity values to SQL Server which i would like to use for other reporting purpose.

I have about 4773 columns with about 2k rows and SQL Server does not support these number of columns ? what would be a better alternative ? is there another open source DB that supports this scale of data?

I have 2 data sets which im calling as train set (2k documents) and test data set (4773 documents) and during the process all the test data records will become columns which is about 4773 columns and this is not supported by SQl Server .

My main motive is to find the nearest similar document from the test data for each and every document in my train data.

Any advise would be helpful - Thanks

here is the code that i use for calculating Cosine Similarities

 from sklearn.metrics.pairwise import cosine_similarity
 df = pd.DataFrame(cosine_similarity(trainVectorizerArray,testVectorizerArray))
Pandas to SQL Server
 import sqlalchemy
 import pypyodbc
 engine = sqlalchemy.create_engine("mssql+pyodbc://<user>:<password>@<DSN>")
 write the DataFrame to a table in the sql database
 df.to_sql("Cosine", engine)

Sample Output

          0         1    2         3         4         5
  0  0.428519  0.000000  0.0  0.541096  0.250099  0.345604
  1  0.056650  0.000000  0.0  0.000000  0.000000  0.000000
  2  0.000000  0.000000  0.0  0.000000  0.000000  0.000000
  3  0.849066  0.559117  0.0  0.374447  0.424247  0.586254
  4  0.317644  0.000000  0.0  0.271171  0.586686  0.424560
RData
  • 959
  • 1
  • 13
  • 33
  • What about a different schema: TrainSetId, DataSetId, Value? This gives 3 columns and 9.5 million rows. Assuming all of those documents are the same data type... – SMM Dec 09 '16 at 21:05
  • OK ! i see what you are saying and yes that might be a possibility but i'm not sure how i would be able to use that schema at a later point. I might need some assistance in the process, i have updated my question with my code. – RData Dec 09 '16 at 21:29
  • How were you going to compare before? If you were going to compare all of the column 2500 documents and get the closest by TrainSetId, now you would compare all of the DataSetId = 2500 documents and get the closest by TrainSetId. I don't have any idea what these documents look like I am just guessing at your requirements here :-) – SMM Dec 09 '16 at 21:34
  • Can you massage the data to a different layout before export? If not you might need to export to .CSV and then use something like SSIS to import the data to SQL. – SMM Dec 09 '16 at 21:36
  • i was able to export to CSV but again using SSIS i would not be able to do much of data massaging. The way i play to compare is , as you can see in my output - i would have a cosine values which would tell me how similar each document is with respect to my trainset. – RData Dec 09 '16 at 21:44
  • It seems like something SSIS should be able to do but I have very little experience with it. If you can't get the data in the right layout in NTLK maybe you can write a little utility in C# or something to pull the CSV reorder the columns as rows and then either export to CSV for SQL import or import to SQL directly. – SMM Dec 09 '16 at 21:51
  • Ok - good thought but I was hoping there would be an alternative way either for storing the data or actually performing the entire operation of text similarity in a different way – RData Dec 09 '16 at 21:58

0 Answers0