Questions tagged [text-mining]

Text Mining is a process of deriving high-quality information from unstructured (textual) information.

Text Mining is a process of deriving high-quality information from unstructured (textual) information. Possible applications for text-mining are

  • Comments of Survey responses
  • Customer messages, emails, complaints etc.
  • Investigating competitors by crawling their web sites

More about text mining in below links.

2607 questions
0
votes
1 answer

Calculate frequency of word pattern in documents

I am trying to calculate frequency of word pattern in documents. e.g. How many times word pattern "Natural Language Processing" is appearing in the documents. I tried it using TF-IDF and Bag of words. however, it is giving me frequency of each word…
Abhijeet
  • 61
  • 3
  • 5
0
votes
2 answers

text mining preprocessing must be applied to test or to train set?

I'm doing some text-mining tasks and I have such a simple question and I still can't reach a conclusion. I am applying pre-processing, such as tokenization and stemming to my training set so i can train my model. Should I also apply this…
user14738548
  • 167
  • 1
  • 1
  • 7
0
votes
0 answers

R: Extracting Individual "Terms" from a Matrix

I am using the R programming language. Using the following 3 "articles" (Shakespeare's plays), I created a "term document matrix" (a R "object" used for text analytics). First, I create these 3 articles: #load…
stats_noob
  • 5,401
  • 4
  • 27
  • 83
0
votes
2 answers

turning json type column into R dataframe

Here I have a dataframe df1 that I would like to turn into a dataframe df2. Does anybody have any suggestions/ideas? df1 <- data.frame (ID = c("UniqueValue1", "UniqueValue2", "UniqueValue3", "UniqueValue4", "UniqueValue5",…
puj831
  • 109
  • 7
0
votes
1 answer

undo the tokenization in python

I would like to reverse the tokenization that I have applied to my data. data = [['this', 'is', 'a', 'sentence'], ['this', 'is', 'a', 'sentence', '2']] Expected output: ['this is a sentence', 'this is a sentence 2'] I tried to do this with the…
Tazz
  • 81
  • 9
0
votes
0 answers

Letter Missing in PDF Text Extraction

I am beginner python user (Python 3.8.8 mac), and facing the problem of letter missing in the process of pdf to text conversion. My Issue: I tried to extract texts from pdfs and tokenise words in texts. However, some words are missing ending…
Saya
  • 1
0
votes
1 answer

R: Converting Tibbles to a Term Document Matrix

I am using the R programming language. I learned how to take pdf files from the internet and load them into R. For example, below I load 3 different books by Shakespeare into R: library(pdftools) library(tidytext) library(textrank) library(tm) #1st…
stats_noob
  • 5,401
  • 4
  • 27
  • 83
0
votes
2 answers

For using a dataframe in Python

I have a df structured as follows: text sentiment XXXXX yes YYYYYY no I'm trying to check the accuracy manually, according to this code ... however, I can't apply it to my DF. and I have the following error: ValueError: too many values…
Tazz
  • 81
  • 9
0
votes
1 answer

how to calculate R1 (lexical richness index) in R?

Hi I need to write a function to calculate R1 which is defined as follows : R1 = 1 - ( F(h) - h*h/2N) ) where N is the number of tokens, h is the Hirsch point, and F(h) is the cumulative relative frequencies up to that point. Using quanteda package…
0
votes
0 answers

how to split texts in an increasing manner?

I have a list of texts read into the software using readtext library. files <-readtext(paste0(wd), "/r/*.pdf", ignore_missing_files = FALSE, text_field = "texts") The 100 pdf files are of different unequal sizes that vary from 6000 to 40000 words.…
0
votes
1 answer

how to calculate h-point (in R)

I am trying to write a function to calculate h-point. the function is defined over a rank frequency data frame. consider the following data.frame : DATA <-data.frame(frequency=c(64,58,54,32,29,29,25,17,17,15,12,12,10), rank=c(seq(1, 13))) and the…
0
votes
1 answer

How to transform unstructured text to rdf turtle in practice?

i am currently working on a study project, where i have to transform the vehicle complaints descriptions from the NHTSA Database (https://catalog.data.gov/dataset/nhtsas-office-of-defects-investigation-odi-complaints) into rdf-turtle and later into…
0
votes
3 answers

Generating a dummy variable using grepl()

I wrote the following and it works w/out errors. df2$qualifications <- as.numeric(grepl("high school|Bachelor|master|phd",df2$description,ignore.case=TRUE)) df2$qualifications This is the output, which shows 1 if any of the words above is mentioned…
maldini425
  • 307
  • 3
  • 14
0
votes
1 answer

Have error in file(con, "r") : cannot open the connection when do lapply

I have a folder with about 100 file txt. I only run simpl code: > setwd("E:/Yunlin/SMUNPO/TXTFILE/") > filenames <- list.files(getwd(),pattern="*.txt") > textfiles <- lapply(filenames, readLines) However, the result is Error in file(con, "r") :…
Pham Van
  • 1
  • 1
0
votes
1 answer

how to create a matrix from sub-elements of a list?( in R)

to put it simply, I have a list of DFMs created by quanteda package(LD1). each DFM has different texts of different lengths. now, I want to calculate and compare lexical diversity for each text within DFMs and among DFMs. lex.div <-lapply(LD1,…
1 2 3
99
100