0

I have the following input data, for which I'd like to remove repeated elements (leaving all strings in the same order of appearence) within each group and each sub group. A group begins with a string that has related s5, in this case all below "FIRST CHAPTER" and the next group begins in first appearence of "SECOND CHAPTER". Within each group could be sub groups that are related with s4. For example "FIRST PART", "INTRODUCTION", "SECOND PART", etc.

The input is like column on the left. The second column is the explanation that shows the occurrences of each string within the group and within group/sub group. The 3rd column is the expected output and the 4th column is the output I'm getting currently.

I've highlighted in yellow the first appearence of each string to show you better which elements should be printed in output. Those in yellow are the first appearence in their respective group/subgroup and removing all lines in white, we get the correct output. I hope make sense.

enter image description here

This is my current code, where the logic looks the uniq values. The output is similar but not correct, since the uniq values are compared agains the whole array and not agains each group.

a=<<_
s5>>FIRST CHAPTER
s4>>FIRST PART
s4>>INTRODUCTION
s3>>Article 1
s5>>FIRST CHAPTER
s4>>FIRST PART
s4>>INTRODUCTION
s3>>Article 2
s5>>FIRST CHAPTER
s4>>SECOND PART
s4>>REVIEW
s3>>Article 1
s5>>FIRST CHAPTER
s4>>SECOND PART
s4>>METHODOLOGY
s3>>Article1
s5>>SECOND CHAPTER
s4>>FIRST PART
s4>>INTRODUCTION
s3>>First section
s5>>SECOND CHAPTER
s4>>FIRST PART
s4>>INTRODUCTION
s3>>Second Section
_

b = a.split("\n")
c = b.uniq

puts c

May someone help me in how to do this. Thanks

Input and Output below

| Input                 | Output                |
|---------------------- |--------------------   |
| s5>>FIRST   CHAPTER   | s5>>FIRST CHAPTER     |
| s4>>FIRST   PART      | s4>>FIRST PART        |
| s4>>INTRODUCTION      | s4>>INTRODUCTION      |
| s3>>Arcticle   1      | s3>>Arcticle 1        |
| s5>>FIRST   CHAPTER   | s3>>Arcticle 2        |
| s4>>FIRST   PART      | s4>>SECOND PART       |
| s4>>INTRODUCTION      | s4>>REVIEW            |
| s3>>Arcticle   2      | s3>>Arcticle 1        |
| s5>>FIRST   CHAPTER   | s4>>METHODOLOGY       |
| s4>>SECOND   PART     | s3>>Arcticle1         |
| s4>>REVIEW            | s5>>SECOND CHAPTER    |
| s3>>Arcticle   1      | s4>>FIRST PART        |
| s5>>FIRST   CHAPTER   | s4>>INTRODUCTION      |
| s4>>SECOND   PART     | s3>>First section     |
| s4>>METHODOLOGY       | s3>>Second Section    |
| s3>>Arcticle1         |                       |
| s5>>SECOND   CHAPTER  |                       |
| s4>>FIRST   PART      |                       |
| s4>>INTRODUCTION      |                       |
| s3>>First   section   |                       |
| s5>>SECOND   CHAPTER  |                       |
| s4>>FIRST   PART      |                       |
| s4>>INTRODUCTION      |                       |
| s3>>Second   Section  |                       |
Ger Cas
  • 2,188
  • 2
  • 18
  • 45
  • Is this supposed to be CSV? What's the actual problem or error you're experiencing? Are you trying to parse some sort of hierarchy (e.g. HTML5 sections and headings) from the data? Please consider posting actual data structures and output, so that your problem statement is executable and testable. – Todd A. Jacobs Apr 15 '20 at 15:33
  • Is not CSV nor HTM5L, just a text column like I shown in my original post that I´ve just updated. I've put the input and desired output in text format in my update. Thanks for any help. – Ger Cas Apr 15 '20 at 16:27

1 Answers1

1

I would approach this problem by looking at all parent elements, for each element.

Consider the element called s3>>Arcticle 1 [sic] that is 4th from the top in your diagram. To look for duplicates, it is not enough to look at all other s3-level elements. Some of those other s3-level elements have different parents. For example, the s3-level element on line 12 has a different s4-level different parent.

But in fact, your code is currently ignoring parents. It is calling b.uniq, which will only look at the text representation of that element, such as "s3>>Arcticle 1". "s3>>Arcticle 1" has no information about parent elements, you see; is it "s3>>Arcticle 1" on line 4 or line 12? The one on line 4 has a parent called "s4>>FIRST PART", while the one on line 12 has a parent called "s4>>SECOND PART".

To see what I'm talking about, stop before you call b.uniq, and print out the full contents of b. You'll see that each element in b has no parent information. The parent information is in another element in b, but there is nothing in b currently to tie elements together with other elements that are their parents.

What needs to be done, is to go through each element and see if there are other elements that are the same, as well as have the same parents at each level. If so, that element will truly be a duplicate to be removed.

In Ruby, there are many ways to do this. I would suggest starting out by considering what data structure you can write in code, that would fully represent each element and its parents. That way data structures can be compared to each other, and duplicates be removed.

Potential data structures I'd recommend starting with are Classes and Structs. There are of course other ways to approach this, but hopefully that will get things started.

Aaron Thomas
  • 5,054
  • 8
  • 43
  • 89
  • Thanks for your answer Aaron and suggestions. It's a kind of advance for me go into classes hehe. But thanks again – Ger Cas Apr 18 '20 at 18:49
  • @GerCas glad to help. If you feel this answer addresses the original question appropriately, please mark as accepted. Otherwise I would recommend focusing the original question to a specific problem – Aaron Thomas Apr 20 '20 at 18:19