4

I've got a data file that looks like this:

Things
├── Foo
│  ├── 1. Item One
│  ├── 2. Item Two
│  ├── 3. Item Three
│  ├── 4. Item Four
│  ├── 5. Item Five
│  └── 6. Item Six
├── Bar
│  ├── 1. Item Seven
│  ├── 2. Item Eight
│  ├── 3. Item Nine

What I'm trying to do is find a certain string, the number associated with it, and also the subheading that is a part of ('Foo' or 'Bar')

It's pretty easy to grab the item and the number:

str = "Item One"
data.each_line do |line|
    if line =~ /#{str}/
        /(?<num>\d).\s(?<item>.*)/ =~ line
    end
end

But I'm not sure how to get the subheading. What I was thinking is that once I found the line, I could count up from that point using the number. Is there a readlines or a seek command or some such that could do this?

Appreciate the help!

craigeley
  • 352
  • 2
  • 12
  • Your method for processing the text file isn't scalable. You're assuming you can hold the entire file in memory, but everything grows over time, and eventually you'll encounter data that will not fit. Also, what you're doing is called 'slurping', which is slower and less efficient than reading a file line-by-line using `foreach`. I'd recommend rethinking how you want to do this, and consider line-by-line IO for speed and scalability. http://stackoverflow.com/questions/25189262/why-is-slurping-a-file-bad. Also become familiar with `$.` or `$INPUT_LINE_NUMBER` and the related variables. – the Tin Man Aug 10 '15 at 19:45

2 Answers2

9

I came up with below, this seems to work:

data = <<-EOF
Things
├── Foo
│  ├── 1. Item One
│  ├── 2. Item Two
│  ├── 3. Item Three
│  ├── 4. Item Four
│  ├── 5. Item Five
│  └── 6. Item Six
├── Bar
│  ├── 1. Item Seven
│  ├── 2. Item Eight
│  ├── 3. Item Nine
EOF

str = "Item One"
data.lines.each_with_index do |line, i|
    if /(?<num>\d)\.\s+#{str}/ =~ line
        /(?<var>\w+)/ =~ data.lines[i - (n = $~[:num]).to_i] 
        p [n, str, var] # ["1", "Item One", "Foo"]
    end
end

(n = $~[:num]) is needed to store the captured value of num from

if /(?<num>\d)\.\s+#{str}/ =~ line

into a variable (say n) as last match data, represented by global variable $~, will get overwritten during the next regex match taking place in statement

/(?<var>\w+)/ =~ data.lines[i - (num = $~[:num]).to_i]

and unless we store it for later use we will lose the captured value num.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Wand Maker
  • 18,476
  • 8
  • 53
  • 87
  • Works like a charm! Thanks. Wasn't aware of the with_index enumerator. Very nice. – craigeley Aug 10 '15 at 17:30
  • You can replace the condition `line =~ /#{str}/` with `/(?\d).\s#{str}/ =~ line` and get rid of `/(?\d).\s(?.*)/ =~ line`. – sawa Aug 10 '15 at 17:39
  • @sawa That seems to give error `undefined local variable or method 'num' for main:Object (NameError)` inside the if block at `/(?\w+)/ =~ data.lines[i - num.to_i]` – Wand Maker Aug 10 '15 at 17:42
  • It is not calling `num` inside the condition. By the way, `\d` should become `\d+`, and `.` after `(?\d)` should be escaped. – sawa Aug 10 '15 at 17:43
  • @sawa I mean error gets reported at `num.to_i`, I think `num` is getting defined outside the scope of if block if we do the way you suggest. I did comment out the first line in `if` block assuming that `num` will be captured in `if` condition – Wand Maker Aug 10 '15 at 17:45
  • I see. Maybe you can then refer to it as `$~[:num]`. – sawa Aug 10 '15 at 17:48
  • Okay, that works fine. Will update the answer. Thanks - now I need to figure out what `$~[:num]` does :-) – Wand Maker Aug 10 '15 at 17:54
  • 1
    @sawa I reverted partially back to original answer as I was unable to print `num` and `item` value after modifying the `if` condition – Wand Maker Aug 10 '15 at 18:10
  • 1
    @sawa - Figured it out now, have updated the answer with explanation. Thanks for valuable inputs – Wand Maker Aug 10 '15 at 18:38
  • It isn't necessary to add "Update" sections in questions or answers. We can see what was changed if necessary. Instead, add the information in a way that makes sense and follows correct grammar. "PS" isn't necessary or particularly desirable; Answers are not conversations, they're a technical description of a solution like you'd see in a cookbook or an encyclopedia. Finally, there's no need or expectation of using headers like "Explanation". Look at the normal/usual formatting of questions and answers and follow that form. A common and consistent look and feel is the goal. – the Tin Man Aug 10 '15 at 19:40
  • @theTinMan Got it. Thanks for the tips – Wand Maker Aug 10 '15 at 19:51
  • Sorry to revisit this old thread, but I found that this has a weird exception when the list gets to item 10. In that case, `var` will return "10", instead of "Foo". This is not the case for numbers above 10, but just 10. Any guesses to why that's happening? – craigeley Dec 18 '15 at 22:30
  • 1
    Oh! It's the \d+ note that @Sawa made several comments up. Perhaps the answer can be edited to reflect that? – craigeley Dec 18 '15 at 22:38
2

Here's another way (using @Wand's data):

LAZY_T = "├── " 
target = "Item Four"

str = data.split(/\n#{LAZY_T}/).find { |s| s =~ /\b#{target}\b/ }
str && [str[/[a-zA-Z]+/], str[/(\d+)\.\s#{target}\b/,1]]
  #=> ["Foo", "4"]

The first line pulls out the applicable part of the string ("Foo" or "Bar"), if there is one. The second line extracts the two desired elements.

Note:

LAZY_T.split('').map(&:ord)
  #=> [9500, 9472, 9472, 32]
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100