2

I have a list of unicode string lists.

Each string list represents a different document with the strings representing the authors' names. Some documents have only one author while other documents can have multiple co-authors.

For example, a sample of authorship of three documents looks like this:

authors = [[u'Smith, J.', u'Williams, K.', u'Daniels, W.'], [u'Smith, J.'], [u'Williams, K.', u'Daniels, W.']]

I want to convert my list into a dictionary and list.

First, a dictionary that provides an integer key for each name:

author_name = {0: u'Smith, J.', 1: u'Williams, K.', 2: u'Daniels, W.'}

Second, a list that identifies the authors for each document by the integer key:

doc_author = [[0, 1, 2], [0], [1, 2]]

What is the most efficient way to create these?

FYI: I need my author data in this format to run a pre-built author-topic LDA algorithm written in Python.

Rhymenoceros
  • 119
  • 1
  • 8
  • ​​​​​​​​​​​​​​​Do you already have `author_name` dictionary, or you're also going to create it? – Remi Guan Jun 06 '16 at 14:24
  • No. I need to create it. Any suggestions? – Rhymenoceros Jun 06 '16 at 14:27
  • 1
    ​​​​​​​​​​​​​​​I'm not sure that I understand your question correctly: Do you mean convert the longest list in `authors` to a dictionary? If so, try `author_name = dict(enumerate(max(authors, key=len)))`. – Remi Guan Jun 06 '16 at 14:30
  • Not necessarily. If you assume there's an additional document with a new author, then that method breaks down. For example, assume `authors = [[u'Smith, J.', u'Williams, K.', u'Daniels, W.'], [u'Smith, J.'], [u'Williams, K.', u'Daniels, W.'], [u'Johnson, A']]` then `author_name = dict(enumerate(max(authors, key=len))` doesn't capture the new author, u'Johnson, A' – Rhymenoceros Jun 06 '16 at 14:52
  • ​​​​​​​​​​​​​​​Huh, run `import itertools; author_name = []; for name in itertools.chain(*authors): if name not in author_name: author_name.append(name)`, then `author_name = dict(enumerate(max(author_name, key=len))`. – Remi Guan Jun 06 '16 at 15:06
  • Well done Kevin! It worked on my larger dataset. Thanks for your help!! – Rhymenoceros Jun 06 '16 at 15:10

3 Answers3

3

You need to invert your author_name dictionary; after that the conversion of your list is trivial, using a nested list comprehension:

author_to_id = {name: id for id, name in author_name.items()}

doc_author = [[author_to_id[name] for name in doc] for doc in authors]

Demo:

>>> authors = [[u'Smith, J.', u'Williams, K.', u'Daniels, W.'], [u'Smith, J.'], [u'Williams, K.', u'Daniels, W.']]
>>> author_name = {0: u'Smith, J.', 1: u'Williams, K.', 2: u'Daniels, W.'}
>>> author_to_id = {name: id for id, name in author_name.items()}
>>> [[author_to_id[name] for name in doc] for doc in authors]
[[0, 1, 2], [0], [1, 2]]
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
0
lst=['person', 'bicycle', 'car', 'motorbike', 'bus', 'truck' ]
dct = {}

for key, val in enumerate(lst):
    dct[key] = val

print(dct)


***output***
{0: 'person', 1: 'bicycle', 2: 'car', 3: 'motorbike', 4: 'bus', 5: 'truck'}
Bharath Kumar
  • 893
  • 9
  • 8
  • 1
    This answer was reviewed in the [Low Quality Queue](https://stackoverflow.com/help/review-low-quality). Here are some guidelines for [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). Code only answers are **not considered good answers**, and are likely to be downvoted and/or deleted because they are **less useful** to a community of learners. It's only obvious to you. Explain what it does, and how it's different / **better** than existing answers. [From Review](https://stackoverflow.com/review/low-quality-posts/32337735) – Trenton McKinney Jul 26 '22 at 17:52
0
### list of lists
authors = [[u'Smith, J.', u'Williams, K.', u'Daniels, W.'], [u'Smith, J.'], [u'Williams, K.', u'Daniels, W.']]


###flat lists
flat_list = [x for xs in authors for x in xs]
# print(flat_list)

### remove duplicates
res = [*set(flat_list)]
# print(res)

### create dict
dct = {}
for key, val in enumerate(res):
    dct[key] = val

print(dct)


**output**

{0: 'Daniels, W.', 1: 'Williams, K.', 2: 'Smith, J.'}
Bharath Kumar
  • 893
  • 9
  • 8