Questions tagged [text-parsing]

Text parsing is a variation of parsing which refers to the action of breaking a stream of text into different components, and capturing the relationship between those components.

When the stream of text is arbitrary, parsing is often used to mean breaking the stream into constituent atoms (words or lexemes).

When the stream of text corresponds to natural language, parsing is used to mean breaking the stream into natural language elements (words and punctuation) and discovering the structure of the text as phrases or sentences.

When the string of text corresponds to a computer source language (or other formal language), parsing consists of applying any of a variety of parsing algorithms (ad hoc, recursive descent, LL, LR, Packrat, Earley or other) to the source text (often broken into lexemes by another lower level parser called a "lexer") to verify the validity of the source language, and often to construct a parse tree representing the grammar productions used to tile the text.

1268 questions
6
votes
4 answers

PDF Text Extraction Approach Using OCR

Has anybody attempted to extract text from a PDF using an OCR library and Java? What did you find to be the most reliable library for text extraction. Most of the approaches I've seen (tesseract, GOCR) are C libraries that would require some JNI…
Jonathan Holloway
  • 62,090
  • 32
  • 125
  • 150
6
votes
5 answers

Retrieve definition for parenthesized abbreviation, based on letter count

I need to retrieve the definition of an acronym based on the number of letters enclosed in parentheses. For the data I'm dealing with, the number of letters in parentheses corresponds to the number of words to retrieve. I know this isn't a reliable…
6
votes
3 answers

Powershell: Read Text file line by line and split on "|"

I am having trouble splitting a line into an array using the "|" in a text file and reassembling it in a certain order. There are multiple lines like the original line in the text file. This is the original line:…
Dennis
  • 83
  • 1
  • 1
  • 5
6
votes
2 answers

How can I extract/parse tabular data from a text file in Perl?

I am looking for something like HTML::TableExtract, just not for HTML input, but for plain text input that contains "tables" formatted with indentation and spacing. Data could look like this: Here is some header text. Column One Column Two …
Thilo
  • 257,207
  • 101
  • 511
  • 656
6
votes
4 answers

Parse string into a tree structure?

I'm trying to figure out how to parse a string in this format into a tree like data structure of arbitrary depth. "{{Hello big|Hi|Hey} {world|earth}|{Goodbye|farewell} {planet|rock|globe{.|!}}}" [[["Hello big" "Hi" "Hey"] ["world" "earth"]] …
erikcw
  • 10,787
  • 15
  • 58
  • 75
6
votes
3 answers

Randomizing text between delimiters

I have this simple input I have {red;green;orange} fruit and cup of {tea;coffee;juice} I use Perl to identify patterns between two external brace delimiters { and }, and randomize the fields inside with the internal delimiter ;. I'm getting this…
kempinski
  • 63
  • 3
6
votes
2 answers

List files on HTTP/FTP server in R

I'm trying to get list of files on HTTP/FTP server from R!, so that in next step I will be able to download them (or select some of files which meet my criteria to download). I know that it is possible to use external program in web browser…
matandked
  • 1,527
  • 4
  • 26
  • 51
6
votes
2 answers

Parse values from a string

How would you parse the values in a string, such as the one below? 12:40:11 8 5 87 The gap between numbers varies, and the first value is a time. The following regular expression does not separate the time…
jgg
  • 1,136
  • 4
  • 22
  • 46
6
votes
2 answers

Regex pattern isn't matching certain show titles

Using C# regex to match and return data parsed from a string is returning unreliable results. The pattern I am using is as follows : Regex r=new Regex( @"(.*?)S?(\d{1,2})E?(\d{1,2})(.*)|(.*?)S?(\d{1,2})E?(\d{1,2})", …
Kraang Prime
  • 9,981
  • 10
  • 58
  • 124
6
votes
7 answers

How to do a circular shift of strings in bash?

I have a homework assignment where I need to take input from a file and continuously remove the first word in a line and append it to the end of the line until all combinations have been done. I really don't know where to begin and would be thankful…
Kyle Van Koevering
  • 169
  • 1
  • 3
  • 10
6
votes
1 answer

Can I control the way the CountVectorizer vectorizes the corpus in scikit learn?

I am working with a CountVectorizer from scikit learn, and I'm possibly attempting to do some things that the object was not made for...but I'm not sure. In terms of getting counts for occurrence: vocabulary = ['hi', 'bye', 'run away!'] corpus =…
tumultous_rooster
  • 12,150
  • 32
  • 92
  • 149
6
votes
0 answers

Parsing Expression Grammar for syntax highlighting

First... Would it be possible to accomplish simple syntax highlighting using a PEG. I'm only looking for it to be able to recognize and highlight basic things that are common to c style languages Second... If there are any examples of this or…
Matt Zera
  • 465
  • 3
  • 12
6
votes
5 answers

What do people mean when they say “Perl is very good at parsing”?

What do people mean when they say "Perl is very good at parsing"? How is Perl any better or more powerful than other scripting languages such as Python or Ruby?
Quintin Par
  • 15,862
  • 27
  • 93
  • 146
5
votes
3 answers

Parse string in javascript

How can I parse this string on a javascript, var string = "http://www.facebook.com/photo.php?fbid=322916384419110&set=a.265956512115091.68575.100001022542275&type=1"; I just want to get the "265956512115091" on the string. I somehow parse this…
Robin Carlo Catacutan
  • 13,249
  • 11
  • 52
  • 85
5
votes
6 answers

String parsing, extracting numbers and letters

What's the easiest way to parse a string and extract a number and a letter? I have string that can be in the following format (number|letter or letter|number), i.e "10A", "B5", "C10", "1G", etc. I need to extract the 2 parts, i.e. "10A" -> "10" and…
Matt Warren
  • 10,279
  • 7
  • 48
  • 63