Im calculating Cosine Similarity using NLTK and exporting the cosine similarity values to SQL Server which i would like to use for other reporting purpose.
I have about 4773 columns with about 2k rows and SQL Server does not support these number of columns ? what would be a better alternative ? is there another open source DB that supports this scale of data?
I have 2 data sets which im calling as train set (2k documents) and test data set (4773 documents) and during the process all the test data records will become columns which is about 4773 columns and this is not supported by SQl Server .
My main motive is to find the nearest similar document from the test data for each and every document in my train data.
Any advise would be helpful - Thanks
here is the code that i use for calculating Cosine Similarities
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(cosine_similarity(trainVectorizerArray,testVectorizerArray))
Pandas to SQL Server
import sqlalchemy
import pypyodbc
engine = sqlalchemy.create_engine("mssql+pyodbc://<user>:<password>@<DSN>")
write the DataFrame to a table in the sql database
df.to_sql("Cosine", engine)
Sample Output
0 1 2 3 4 5
0 0.428519 0.000000 0.0 0.541096 0.250099 0.345604
1 0.056650 0.000000 0.0 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.0 0.000000 0.000000 0.000000
3 0.849066 0.559117 0.0 0.374447 0.424247 0.586254
4 0.317644 0.000000 0.0 0.271171 0.586686 0.424560