2

I have a large list of words:

my_list = ['[tag]', 'there', 'are', 'many', 'words', 'here', '[/tag]', '[tag]', 'some', 'more', 'here', '[/tag]', '[tag]', 'and', 'more', '[/tag]']

I would like to be able to count the number of elements in between (and including) the [tag] elements across the whole list. The goal is to be able to see the frequency distribution.

Can I use range() to start and stop on a string match?

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
archienorman
  • 1,434
  • 3
  • 20
  • 36
  • >>> from collections import Counter >>> z = ['blue', 'red', 'blue', 'yellow', 'blue', 'red'] >>> Counter(z) Counter({'blue': 3, 'red': 2, 'yellow': 1}) – Ami Patel Oct 28 '15 at 13:12
  • @ami, that is not counting elements between two values. That is counting the number of times an element appears in the entire list. – Andy Oct 28 '15 at 13:18
  • I am looking to calculate the total number of items between [tag] and [/tag] (inclusive), not just how many times one string appears in the list. – archienorman Oct 28 '15 at 13:19
  • Your example doesn't include any entries that would not be counted. – Kenny Ostrom Oct 28 '15 at 13:19
  • range is a builtin which returns a list of numbers. If you already know the list index of all the tags, then you can use range to generate the indexes of the list items inside the tags. But then you wouldn't need them, since you would already know everything you need for this question, without looking at the list. – Kenny Ostrom Oct 28 '15 at 13:24

6 Answers6

5

First, find all indices of [tag], the diff between adjacent indices is the number of words.

my_list = ['[tag]', 'there', 'are', 'many', 'words', 'here', '[/tag]', '[tag]', 'some', 'more', 'here', '[/tag]', '[tag]', 'and', 'more', '[/tag]']
indices = [i for i, x in enumerate(my_list) if x == "[tag]"]
nums = []
for i in range(1,len(indices)):
    nums.append(indices[i] - indices[i-1])

A faster way to find all indices is using numpy, like shown below:

import numpy as np
values = np.array(my_list)
searchval = '[tag]'
ii = np.where(values == searchval)[0]
print ii

Another way to get diff between adjacent indices is using itertools,

import itertools
diffs = [y-x for x, y in itertools.izip (indices, indices[1:])]
Hooting
  • 1,681
  • 11
  • 20
1

You can use .index(value, [start, [stop]]) to search through the list.

my_list = ['[tag]', 'there', 'are', 'many', 'words', 'here', '[/tag]', '[tag]', 'some', 'more', 'here', '[/tag]', '[tag]', 'and', 'more', '[/tag]']
my_list.index('[tag'])   # will return 0, as it occurs at the zero-eth element
my_list.index('[/tag]')  # will return 6

That will get you your first group length, then on the next iteration you just need to remember what the last closing tag's index was, and use that as the start point, plus 1

my_list.index('[tag]', 7)     # will return 7
my_list.index(['[/tag]'), 7)  # will return 11

And do that in a loop till you've reached your last closing tag in your list. Also remember, that .index will raise a ValueError if the value is not present, so you'll need to handle that exception when it occurs.

Christian Witts
  • 11,375
  • 1
  • 33
  • 46
0

I would go with the following since the OP wants to count the actual values. (No doubt he has figured out how to do that by now.)

i = [k for k, i in enumerate(my_list) if i == '[tag]']
j = [k for k, p in enumerate(my_list) if p == '[/tag]']
for z in zip(i,j):
    print (z[1]-z[0])
ajsp
  • 2,512
  • 22
  • 34
0

This should allow you to find the number of words between and including you tags:

MY_LIST = ['[tag]', 'there', 'are', 'many', 'words', 'here', '[/tag]', '[tag]',
           'some', 'more', 'here', '[/tag]', '[tag]', 'and', 'more', '[/tag]']


def main():
    ranges = find_ranges(MY_LIST, '[tag]', '[/tag]')
    for index, pair in enumerate(ranges, 1):
        print('Range {}: Start = {}, Stop = {}'.format(index, *pair))
        start, stop = pair
        print('         Size of Range =', stop - start + 1)


def find_ranges(iterable, start, stop):
    range_start = None
    for index, value in enumerate(iterable):
        if value == start:
            if range_start is None:
                range_start = index
            else:
                raise ValueError('a start was duplicated before a stop')
        elif value == stop:
            if range_start is None:
                raise ValueError('a stop was seen before a start')
            else:
                yield range_start, index
                range_start = None

if __name__ == '__main__':
    main()

This example will print out the following text so you can see how it works:

Range 1: Start = 0, Stop = 6
         Size of Range = 7
Range 2: Start = 7, Stop = 11
         Size of Range = 5
Range 3: Start = 12, Stop = 15
         Size of Range = 4
Noctis Skytower
  • 21,433
  • 16
  • 79
  • 117
0

Borrowing and slightly modifying the generator code from the selected answer to this question:

my_list = ['[tag]', 'there', 'are', 'many', 'words', 'here', '[/tag]', '[tag]', 'some', 'more', 'here', '[/tag]', '[tag]', 'and', 'more', '[/tag]']

def group(seq, sep):
    g = []
    for el in seq:
        g.append(el)
        if el == sep:
            yield g
            g = []

counts = [len(x) for x in group(my_list,'[/tag]')]

I changed the generator they gave in that answer to not return the empty list at the end and to include the separator in the list instead of putting it in the next list. Note that this assumes there will always be a matching '[tag]' '[/tag'] pair in that order, and that all the elements in the list are between a pair.

After running this, counts will be [7,5,4]

Community
  • 1
  • 1
Tofystedeth
  • 375
  • 2
  • 11
0

Solution using list comprehension and string manipulation.

my_list = ['[tag]', 'there', 'are', 'many', 'words', 'here', '[/tag]', '[tag]', 'some', 'more', 'here', '[/tag]', '[tag]', 'and', 'more', '[/tag]']

# string together your list
my_str = ','.join(mylist)

# split the giant string by tag, gives you a list of comma-separated strings
my_tags = my_str.split('[tag]')

# split for each word in each tag string
my_words = [w.split(',') for w in my_tags]

# count up each list to get a list of counts for each tag, adding one since the first split removed [tag]
my_cnt = [1+len(w) for w in my_words]

Do it one line:

# all as one list comprehension starting with just the string
[1+len(t.split(',')) for t in my_str.split('[tag]')]
postelrich
  • 3,274
  • 5
  • 38
  • 65