Questions tagged [text-parsing]

Text parsing is a variation of parsing which refers to the action of breaking a stream of text into different components, and capturing the relationship between those components.

When the stream of text is arbitrary, parsing is often used to mean breaking the stream into constituent atoms (words or lexemes).

When the stream of text corresponds to natural language, parsing is used to mean breaking the stream into natural language elements (words and punctuation) and discovering the structure of the text as phrases or sentences.

When the string of text corresponds to a computer source language (or other formal language), parsing consists of applying any of a variety of parsing algorithms (ad hoc, recursive descent, LL, LR, Packrat, Earley or other) to the source text (often broken into lexemes by another lower level parser called a "lexer") to verify the validity of the source language, and often to construct a parse tree representing the grammar productions used to tile the text.

1268 questions
25
votes
6 answers

Split alphanumeric string between leading digits and trailing letters

I have a string like: $Order_num = "0982asdlkj"; How can I split that into the 2 variables, with the number as one element and then another variable with the letter element? The number element can be any length from 1 to 4 say and the letter…
David19801
  • 11,214
  • 25
  • 84
  • 127
22
votes
2 answers

Create Great Parser - Extract Relevant Text From HTML/Blogs

I'm trying to create a generalized HTML parser that works well on Blog Posts. I want to point my parser at the specific entrie's URL and get back clean text of the post itself. My basic approach (from python) has been to use a combination of…
nartz
21
votes
4 answers

Powershell command to trim path if it ends with "\"

I need to trim path if it ends with \. C:\Ravi\ I need to change to C:\Ravi I have a case where path will not end with \ (Then it must skip). I tried with .EndsWith("\"), but it fails when I have \\ instead of \. Can this be done in PowerShell…
Ravichandra
  • 2,162
  • 4
  • 24
  • 36
21
votes
9 answers

Elegant structured text file parsing

I need to parse a transcript of a live chat conversation. My first thought on seeing the file was to throw regular expressions at the problem but I was wondering what other approaches people have used. I put elegant in the title as i've previously…
russtbarnacle
  • 915
  • 1
  • 10
  • 14
18
votes
4 answers

Saving nltk drawn parse tree to image file

Is there any way to save the draw image from tree.draw() to an image file programmatically? I tried looking through the documentation, but I couldn't find anything.
John
  • 3,037
  • 8
  • 36
  • 68
17
votes
3 answers

How to find the shortest dependency path between two words in Python?

I try to find the dependency path between two words in Python given dependency tree. For sentence Robots in popular culture are there to remind us of the awesomeness of unbound human agency. I used practnlptools…
Sean
  • 1,161
  • 1
  • 13
  • 24
16
votes
11 answers

Create acronym from a string containing only words

I'm looking for a way that I can extract the first letter of each word from an input field and place it into a variable. Example: if the input field is "Stack-Overflow Questions Tags Users" then the output for the variable should be something like…
dmschenk
  • 379
  • 1
  • 5
  • 19
16
votes
4 answers

Python: Read configuration file with multiple lines per key

I am writing a small DB test suite, which reads configuration files with queries and expected results, e.g.: query = "SELECT * from cities WHERE name='Unknown';" count = 0 level = 1 name = "Check for cities whose…
Adam Matan
  • 128,757
  • 147
  • 397
  • 562
15
votes
3 answers

How do I keep a Scanner from throwing exceptions when the wrong type is entered?

Here's some sample code: import java.util.Scanner; class In { public static void main (String[]arg) { Scanner in = new Scanner (System.in) ; System.out.println ("how many are invading?") ; int a = in.nextInt() ; …
David
  • 14,569
  • 34
  • 78
  • 107
15
votes
4 answers

Parse a pipe-delimited string into 2, 3, 4 or 5 variables (depending on the input string)

I have a line like this in my code: list($user_id, $name, $limit, $remaining, $reset) = explode('|', $user); The last 3 parameters may or may not be there. Is there a function similar to list that will automatically ignore those last parameters if…
MikeG
  • 1,205
  • 12
  • 19
14
votes
6 answers

How to transpose the contents of lines and columns in a file in Vim?

I know I can use Awk, but I am on a Windows box, and I am making a function for others that may not have Awk. I also know I can write a C program, but I would love not to have something that requires compilation and maintenance for a little Vim…
ojblass
  • 21,146
  • 22
  • 83
  • 132
14
votes
2 answers

Javascript, Text Annotations and Ideas

I am very curious to hear input from others on a problem I've been contemplating for some time now. Essentially I would like to present a user with a text document and allow him/her to make selections of text and annotate it. Specific to the…
13
votes
5 answers

Strategy for parsing natural language descriptions into structured data

I have a set of requirements and I'm looking for the best Java-based strategy / algorthm / software to use. Basically, I want to take a set of recipe ingredients entered by real people in natural english and parse out the meta-data into a…
Jizzoe
  • 131
  • 4
13
votes
4 answers

NLTK Chunking and walking the results tree

I'm using NLTK RegexpParser to extract noungroups and verbgroups from tagged tokens. How do I walk the resulting tree to find only the chunks that are NP or V groups? from nltk.chunk import RegexpParser grammar = ''' NP: {
?**} V:…
Vincent Theeten
  • 251
  • 1
  • 3
  • 7
13
votes
5 answers

Howto clean comments from raw sql file

I have problem with cleaning comments and empty lines from already existing sql file. The file has over 10k lines so cleaning it manually is not an option. I have a little python script, but I have no idea how to handle comments inside multi line…
Szymon Lukaszczyk
  • 712
  • 1
  • 6
  • 14
1
2
3
84 85