0

I have a pandas dataframe like this (simplified):

data = {'old': [['these','are','old','tokens'],
['here','are','some','more','old']], 'new': 
[['and','these','are','new'],['see','the','difference','between','them']]}

example_df = pd.DataFrame(data=data).astype(str)

So the dataframe looks like this:

                                              new  
0                   ['and', 'these', 'are', 'new']
1  ['see', 'the', 'difference', 'between', 'them']

                                      old
0       ['these', 'are', 'old', 'tokens']
 1  ['here', 'are', 'some', 'more', 'old']

In my real df, there are 968 rows. (this becomes relevant below)

I am performing a comparison function (for semantic analysis), again simplified:

def analysis(1st_token_list,2nd_token_list):
    synonymset1 = somefunction(1st_token_list) # specifics don't matter, this works fine
    synonymset2 = somefunction(2nd_token_list) # specifics don't matter, this works fine

    best_score_list = []

for synset in synonymset1:
    similaritylist = [synset.path_similarity(ss) for ss in synonymset2 if synset.path_similarity(ss) is not None]
    if not similaritylist:
        continue;
    best_score = max(similaritylist)

    if best_score is not None: 
        best_score_list.append(best_score)
        print(best_score_list)

return best_score_list

For added clarity, the function before the loop returns a list of synsets (from wordnet) for each token list, like so:

[Synset('old.v.01'), Synset('token.n.01')]

When I call the below,

notnull_df['maxsim_OtN'] = notnull_df.apply(lambda row: 
maxsim.word_similarity(row['old_tokens'], row['new_tokens']), axis=1)

I see the lists being generated (something along the lines of I get an error about the inappropriateness of shape.

Traceback (most recent call last):
File "/Users/anon/venv_lda/lib/python3.5/site-packages/pandas/core/internals.py", line 4637, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
File "/Users/anon/venv_lda/lib/python3.5/site-packages/pandas/core/internals.py", line 4701, in form_blocks
    float_blocks = _multi_blockify(float_items)
File "/Users/anon/venv_lda/lib/python3.5/site-packages/pandas/core/internals.py", line 4778, in _multi_blockify
    values, placement = _stack_arrays(list(tup_block), dtype)
File "/Users/anon/venv_lda/lib/python3.5/site-packages/pandas/core/internals.py", line 4823, in _stack_arrays
    stacked[i] = _asarray_compat(arr)
ValueError: could not broadcast input array from shape (6) into shape (5)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "semsim_calculation.py", line 133, in <module>
    notnull_df['maxsim_OtN'] = notnull_df.apply(lambda row: maxsim.word_similarity(row['old_tokens'], row['new_tokens']), axis=1)
File "/Users/anon/venv_lda/lib/python3.5/site-packages/pandas/core/frame.py", line 4877, in apply
    ignore_failures=ignore_failures)
File "/Users/anon/venv_lda/lib/python3.5/site-packages/pandas/core/frame.py", line 4990, in _apply_standard
    result = self._constructor(data=results, index=index)
File "/Users/anon/venv_lda/lib/python3.5/site-packages/pandas/core/frame.py", line 330, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/Users/anon/venv_lda/lib/python3.5/site-packages/pandas/core/frame.py", line 461, in _init_dict
    return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/Users/anon/venv_lda/lib/python3.5/site-packages/pandas/core/frame.py", line 6173, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/Users/anon/venv_lda/lib/python3.5/site-packages/pandas/core/internals.py", line 4642, in create_block_manager_from_arrays
    construction_error(len(arrays), arrays[0].shape, axes, e)
File "/Users/anon/venv_lda/lib/python3.5/site-packages/pandas/core/internals.py", line 4608, in construction_error
    passed, implied))
ValueError: Shape of passed values is (968, 5), indices imply (968, 11)

Can anyone explain why this happening? the print() actually does show me that the list of values ([0.25, 0.5, 0.07692307692307693]) is being generated, but it's not returning that list (similar question was asked but not resolved in this question.

SHJ9000
  • 3
  • 3
  • 1
    You'll have plenty of problems if you try to store lists rather than scalar values. Is there a reason that you want a list stored in a single column? – roganjosh Feb 06 '18 at 18:22
  • @roganjosh I need to perform a computation that takes into account all the scores per line. I tried doing it with tuples instead but it interestingly returns the same error. Do you have alternative suggestions? – SHJ9000 Feb 06 '18 at 18:27
  • 1
    @SHJ9000 `tuple` would be equally as bad as lists in this regard. In general, working with `pandas.Dataframe` of `object` dtype will cause headaches if the objects are containers like `list`, `dict` and `tuple`. `str` type is handled much better, but generally, you want to stick to `numpy` numeric types. – juanpa.arrivillaga Feb 06 '18 at 18:28
  • @juanpa.arrivillaga I would be fine creating unique columns for each value but the problem is that all the token lists are different length, so there will be different numbers of outputs. Is there a better way for me to be able to perform calculations on column-like structures? – SHJ9000 Feb 06 '18 at 18:32
  • Use a list-of-lists... – juanpa.arrivillaga Feb 06 '18 at 18:49
  • @juanpa.arrivillaga not sure how to accomplish this wrt this data. Can you show me an example? Let's say that I want to create 2 of these lists and multiply their values. – SHJ9000 Feb 06 '18 at 18:54

0 Answers0