I have two files (lhs and rhs), each with a grouping-variable field (other variables exist too).
The goal is to iterate through the Cartesian product pairs within each grouping-variable value as quickly as possible. There may be grouping-variable values unique to only the lhs or unique to only the rhs, and these items are to be ignored.
So I am really only looking for Cartesian product pairs where the lhs/rhs grouping-variable values are equal. The iteration through the lhs/rhs pairs involves a trivial variable length comparison for illustration purposes.
In reality, the pair processing involves a series of function calls. The grouping-variable value creation has also been simplified. In reality, it is created in a dynamic fashion from external file parameters.
My example is also simplified by just using English-dictionary words. Also, I do not want to sort the data because the real data is large. I have provided my current code, which provides the correct result.
My current strategy for quick access is a manual indexing process using dictionary data-types. Is there a faster way?
This is my current (simplified) code example, which functions correctly:
lhs_data = [{'lhs_id':'001', 'word':'clean'},
{'lhs_id':'002', 'word':'stop'},
{'lhs_id':'003', 'word':'those'}]
rhs_data = [{'rhs_id':'001', 'word':'quack'},
{'rhs_id':'002', 'word':'step'},
{'rhs_id':'003', 'word':'stir'},
{'rhs_id':'004', 'word':'state'},
{'rhs_id':'005', 'word':'storm'},
{'rhs_id':'006', 'word':'thin'},
{'rhs_id':'008', 'word':'thaw'},
{'rhs_id':'009', 'word':'thorn'},
{'rhs_id':'010', 'word':'thumb'},
{'rhs_id':'011', 'word':'true'},
{'rhs_id':'012', 'word':'trade'}]
# Create grouping-variable values and indices.
# The grouping-variable values are simply
# the first two characters of the word.
lhs_allgroups = {}
rhs_allgroups = {}
for recidx in range(len(lhs_data)):
lhs_group = lhs_data[recidx]['word'][0:2]
if lhs_group in lhs_allgroups:
lhs_allgroups[lhs_group].append(recidx)
else:
lhs_allgroups[lhs_group] = [recidx]
for recidx in range(len(rhs_data)):
rhs_group = rhs_data[recidx]['word'][0:2]
if rhs_group in lhs_allgroups:
if rhs_group in rhs_allgroups:
rhs_allgroups[rhs_group].append(recidx)
else:
rhs_allgroups[rhs_group] = [recidx]
# Perform the comparison using the indices
for lhs_group in lhs_allgroups:
if lhs_group in rhs_allgroups:
for lhs_idx in lhs_allgroups[lhs_group]:
for rhs_idx in rhs_allgroups[lhs_group]:
if len(lhs_data[lhs_idx]['word']) == len(rhs_data[rhs_idx]['word']):
print(lhs_data[lhs_idx]['lhs_id'], lhs_data[lhs_idx]['word'],
rhs_data[rhs_idx]['rhs_id'], rhs_data[rhs_idx]['word'])
This is the code output, which is correct:
002 stop 002 step
002 stop 003 stir
003 those 009 thorn
003 those 010 thumb
In reality, lhs and rhs can each have millions of items. So, speed is the most important thing.
The goal is to find a faster strategy/algorithm. How do I do this faster?
The following solution also provides the correct result. However, it seems to be quite a bit slower.
lhs_data = np.array([('001', 'clean'),
('002', 'stop'),
('003', 'those')],
dtype=[('lhs_id', 'S3'), ('word', 'S10')])
rhs_data = np.array([('001', 'quack'),
('002', 'step'),
('003', 'stir'),
('004', 'state'),
('005', 'storm'),
('006', 'thin'),
('008', 'thaw'),
('009', 'thorn'),
('010', 'thumb'),
('011', 'true'),
('012', 'trade')],
dtype=[('rhs_id', 'S3'), ('word', 'S10')])
# Create grouping-variable values and indices.
# The grouping-variable values are simply the
# first two characters of the word.
lhs_grouping_var = np.array([row['word'][:2] for row in lhs_data])
rhs_grouping_var = np.array([row['word'][:2] for row in rhs_data])
# Get the unique values in both lhs and rhs grouping variables
lhs_unique = np.unique(lhs_grouping_var)
rhs_unique = np.unique(rhs_grouping_var)
# Get common values between lhs rhs grouping variables
common = np.intersect1d(lhs_unique, rhs_unique)
# Get the indices for each common value in lhs and rhs grouping variables
for group in common:
lhs_idx = np.where(lhs_grouping_var == group)
rhs_idx = np.where(rhs_grouping_var == group)
for l in lhs_idx[0]:
for r in rhs_idx[0]:
if len(lhs_data[l]['word']) == len(rhs_data[r]['word']):
print(lhs_data[l]['lhs_id'], lhs_data[l]['word'], rhs_data[r]['rhs_id'], rhs_data[r]['word'])