Questions tagged [text-mining]

Text Mining is a process of deriving high-quality information from unstructured (textual) information.

Text Mining is a process of deriving high-quality information from unstructured (textual) information. Possible applications for text-mining are

Comments of Survey responses
Customer messages, emails, complaints etc.
Investigating competitors by crawling their web sites

Calculate frequency of word pattern in documents

I am trying to calculate frequency of word pattern in documents. e.g. How many times word pattern "Natural Language Processing" is appearing in the documents. I tried it using TF-IDF and Bag of words. however, it is giving me frequency of each word…

nlp text-mining

asked Apr 21 '21 at 11:03

Abhijeet

votes

2 answers

text mining preprocessing must be applied to test or to train set?

I'm doing some text-mining tasks and I have such a simple question and I still can't reach a conclusion. I am applying pre-processing, such as tokenization and stemming to my training set so i can train my model. Should I also apply this…

python nlp text-mining sentiment-analysis

asked Apr 17 '21 at 20:34

user14738548

votes

0 answers

R: Extracting Individual "Terms" from a Matrix

I am using the R programming language. Using the following 3 "articles" (Shakespeare's plays), I created a "term document matrix" (a R "object" used for text analytics). First, I create these 3 articles: #load…

r matrix nlp extract text-mining

asked Apr 14 '21 at 17:01

stats_noob

5,401
4
27
83

votes

2 answers

turning json type column into R dataframe

Here I have a dataframe df1 that I would like to turn into a dataframe df2. Does anybody have any suggestions/ideas? df1 <- data.frame (ID = c("UniqueValue1", "UniqueValue2", "UniqueValue3", "UniqueValue4", "UniqueValue5",…

r json dataframe text-mining

asked Apr 14 '21 at 13:44

puj831

votes

1 answer

undo the tokenization in python

I would like to reverse the tokenization that I have applied to my data. data = [['this', 'is', 'a', 'sentence'], ['this', 'is', 'a', 'sentence', '2']] Expected output: ['this is a sentence', 'this is a sentence 2'] I tried to do this with the…

python text-mining

asked Apr 13 '21 at 21:23

Tazz

votes

0 answers

Letter Missing in PDF Text Extraction

I am beginner python user (Python 3.8.8 mac), and facing the problem of letter missing in the process of pdf to text conversion. My Issue: I tried to extract texts from pdfs and tokenise words in texts. However, some words are missing ending…

python nlp text-mining text-extraction pdfminer

asked Apr 13 '21 at 17:01

Saya

votes

1 answer

R: Converting Tibbles to a Term Document Matrix

I am using the R programming language. I learned how to take pdf files from the internet and load them into R. For example, below I load 3 different books by Shakespeare into R: library(pdftools) library(tidytext) library(textrank) library(tm) #1st…

r text nlp text-mining term-document-matrix

asked Apr 09 '21 at 06:21

stats_noob

5,401
4
27
83

votes

2 answers

For using a dataframe in Python

I have a df structured as follows: text sentiment XXXXX yes YYYYYY no I'm trying to check the accuracy manually, according to this code ... however, I can't apply it to my DF. and I have the following error: ValueError: too many values…

python for-loop text-mining textblob

asked Apr 05 '21 at 19:55

Tazz

votes

1 answer

how to calculate R1 (lexical richness index) in R?

Hi I need to write a function to calculate R1 which is defined as follows : R1 = 1 - ( F(h) - h*h/2N) ) where N is the number of tokens, h is the Hirsch point, and F(h) is the cumulative relative frequencies up to that point. Using quanteda package…

r list function text-mining quanteda

asked Apr 05 '21 at 09:50

Mohammad Farsadnia

votes

0 answers

how to split texts in an increasing manner?

I have a list of texts read into the software using readtext library. files <-readtext(paste0(wd), "/r/*.pdf", ignore_missing_files = FALSE, text_field = "texts") The 100 pdf files are of different unequal sizes that vary from 6000 to 40000 words.…

r text-mining stringr stringi text-chunking

asked Apr 02 '21 at 16:30

Mohammad Farsadnia

votes

1 answer

how to calculate h-point (in R)

I am trying to write a function to calculate h-point. the function is defined over a rank frequency data frame. consider the following data.frame : DATA <-data.frame(frequency=c(64,58,54,32,29,29,25,17,17,15,12,12,10), rank=c(seq(1, 13))) and the…

r function if-statement text-mining

asked Mar 25 '21 at 23:01

Mohammad Farsadnia

votes

1 answer

How to transform unstructured text to rdf turtle in practice?

i am currently working on a study project, where i have to transform the vehicle complaints descriptions from the NHTSA Database (https://catalog.data.gov/dataset/nhtsas-office-of-defects-investigation-odi-complaints) into rdf-turtle and later into…

python nlp text-mining turtle-rdf knowledge-graph

asked Mar 25 '21 at 18:55

Dennis Luo

votes

3 answers

Generating a dummy variable using grepl()

I wrote the following and it works w/out errors. df2$qualifications <- as.numeric(grepl("high school|Bachelor|master|phd",df2$description,ignore.case=TRUE)) df2$qualifications This is the output, which shows 1 if any of the words above is mentioned…

r tidyverse text-mining

asked Mar 23 '21 at 18:18

maldini425

votes

1 answer

Have error in file(con, "r") : cannot open the connection when do lapply

I have a folder with about 100 file txt. I only run simpl code: > setwd("E:/Yunlin/SMUNPO/TXTFILE/") > filenames <- list.files(getwd(),pattern="*.txt") > textfiles <- lapply(filenames, readLines) However, the result is Error in file(con, "r") :…

r text-mining

asked Mar 23 '21 at 18:00

Pham Van

votes

1 answer

how to create a matrix from sub-elements of a list?( in R)

to put it simply, I have a list of DFMs created by quanteda package(LD1). each DFM has different texts of different lengths. now, I want to calculate and compare lexical diversity for each text within DFMs and among DFMs. lex.div <-lapply(LD1,…

r list matrix text-mining quanteda

asked Mar 23 '21 at 15:10

Mohammad Farsadnia

Prev 1 2 3

…

100