Got 2 data frames, did the below:
library(tm)
v<- Corpus(VectorSource(as.vector(bothsources[,1])))
inspect(head(v,3))
v <- tm_map(v, removeWords, stopwords("english"))
v <- tm_map(v, content_transformer(tolower))
inspect(head(v,3))
v<- tm_map(v, removePunctuation)
v<- tm_map(v, stripWhitespace)
v<- tm_map(v, removeNumbers)
install.packages("textstem")
library(textstem)
v <- tm_map(v, lemmatize_strings)
"bothsources" is a combined list of categorical variables that are now made into a corpus.
I already cleaned the data and the remaining is just clustering it according to same category variables to appear under each other. The issue here is the length and the fact that the words might mean the same thing but written in a different tense or structure or word but they imply the same meaning and I need a way to cluster them under each other.
I tried first attempt the Euclidean distance didnt give me the desired outcome.
Second attempt using cosine distance which showed better results, but still not everything was clustered properly, only half of the data.
I added to the process lemmatization using library(textstem) which brings the words back to its source. Ex: engineering back to engineer, production back to produce.
Any help as to how can I increase the efficiency and possibly get better results would be highly appreciated.
Note: the corpus "v" is a list of 1130, not sure if this list is considered small for such techniques to work as I read somewhere that it only works on huge data sets.
Data Sample:
bothsources
1
Agriculture Agricultural Support Services Irrigation Services
2
Agriculture Agricultural Support Services Management / Consultation Services
3
Agriculture Agricultural Support Services Other Agricultural Support Services
4
Agriculture Agricultural Support Services Shipping Services Chicken, Cattle, Sheep, and Goat Shipping
5
Agriculture Agricultural Support Services Shipping Services Fish Shipping
6
Agriculture Animal Farming Cattle, Goat, and Sheep Farming
7
Agriculture Animal Farming Fishing and Fish Farming
8
Agriculture Animal Farming Poultry Farming
9
Agriculture Crop Farming Animal Feed
10
Agriculture Crop Farming Cotton Farming
11
Agriculture Crop Farming Fresh Fruit and Vegetable Farming
12
Agriculture Crop Farming Grains and Oilseeds Farming
13
Agriculture Crop Farming Other Crop Farming
14
Agriculture Crop Farming Sugar Farming
15
Agriculture Crop Farming Tobacco Farming
16
Agriculture Floriculture Production
17
Agriculture Investment Companies - Agriculture Conventional
18
Agriculture Investment Companies - Agriculture Islamic
19
Construction Construction and Design Architectural and Engineering Services Architectural Services
20
Construction Construction and Design Architectural and Engineering Services Engineering Consulting Services
21
Construction Construction and Design Civil Contractors General
22
Construction Construction and Design Civil Contractors Heavy Industries
23
Construction Construction and Design Civil Contractors Infrastructure
24
Construction Construction and Design Civil Contractors Marine
25
Construction Construction and Design Electro-Mechanical Contractors
26
Construction Construction and Design Specialty Contractors
27
Construction Construction and Design Offshore Contractors
28
Construction Investment Companies - Construction Conventional
29
Construction Investment Companies - Construction Islamic
30
Consumer Goods Baby Supplies Distributors