I have the following query (it finds the row with the most similar name and selects it plus the value and rating):
SELECT
DISTINCT FIRST_VALUE(value) over (partition by name order by distance asc) value,
rating,
name
FROM (
SELECT
value,
rating,
LEAST(
UTL_MATCH.EDIT_DISTANCE('pineapple', upper(name)),
UTL_MATCH.EDIT_DISTANCE('yellowapple', upper(name)),
UTL_MATCH.EDIT_DISTANCE('greenapple', upper(name)),
) as apple,
LEAST(
UTL_MATCH.EDIT_DISTANCE('carbage', upper(name)),
UTL_MATCH.EDIT_DISTANCE('cabbagge', upper(name)),
UTL_MATCH.EDIT_DISTANCE('bagge', upper(name)),
) as cabbage,
LEAST(
UTL_MATCH.EDIT_DISTANCE('chakn', upper(name)),
UTL_MATCH.EDIT_DISTANCE('chkn', upper(name)),
UTL_MATCH.EDIT_DISTANCE('chikee', upper(name)),
) as chicken
-- more of the above could follow...
FROM food_table
) UNPIVOT (distance FOR name in (apple as 'apple', cabbage as 'cabbage', chicken as 'chicken'));
It turns my table from this:
category | subcategory | name | value | rating |
---|---|---|---|---|
fruit | red | apple | 4g | 4/10 |
vegetable | green | cabbage | 4g | 3/10 |
meat | white | chicken | 49g | 8/10 |
chemical | black | cocacola | 12g | 10/10 |
chemical | blue | pepsi | 12g | 10/10 |
to this:
name | value | rating |
---|---|---|
apple | 4g | 4/10 |
cabbage | 4g | 3/10 |
chicken | 49g | 8/10 |
The subquery basically returns something like this (before the unpivoting):
value | rating | apple | cabbage | chicken |
---|---|---|---|---|
4g | 4/10 | 3* | 8 | 7 |
4g | 3/10 | 8 | 4* | 6 |
49g | 8/10 | 5 | 9 | 4* |
12g | 10/10 | 7* | 12 | 10 |
12g | 10/10 | 11 | 7 | 6* |
- I match the most similar strings to what I have in the table.
- If there is more than one least per variable given, pick the one with the least distance (for example chicken over pepsi and apple over cocacola).
- I unpivot and partition by order so I can find the most similar and eliminate based on that (remove pepsi and cocacola).
I managed to do the following to only get the subquery, but I need to add the outer query now:
public Stream<Projection> stream(
Stream<Pair<String, Stream<String>>> similarStringGroups,
Class<Projection> projectionClass,
Class<T> entityClass,
String termField
) {
CriteriaBuilder criteriaBuilder = entityManager.getCriteriaBuilder();
CriteriaQuery<Projection> query = criteriaBuilder.createQuery(projectionClass);
Root<T> root = query.from(entityClass);
RecordComponent[] recordComponents = projectionClass.getRecordComponents();
Stream<Expression<Integer>> leastExpressions = similarStringGroups.map(
similarStringGroup ->
criteriaBuilder.function(
"LEAST",
Integer.class,
similarStringGroup.getSecond().map(
similarString -> criteriaBuilder.function(
"UTL_MATCH.EDIT_DISTANCE",
Integer.class,
criteriaBuilder.literal(similarString.toUpperCase()),
criteriaBuilder.upper(root.get(termField))
)
).toArray(Expression[]::new)
)
);
return entityManager.createQuery(criteriaQuery.multiselect(Stream.concat(leastExpressions, Stream.of(root.get(recordComponents[4].getName()), root.get(recordComponents[5].getName()))).toArray(Selection[]::new)))
}
Just to add to the above method that I wrote:
entityClass
is just theFood.class
.termField
isname
in the SQL query.- projectionClass is temporarily set to
public record TemroraryRecord(Integer apple, Integer cabbage, Integer chicken, String value, String rating)
but it will eventually bepublic record NameValueRating(String name, String value, String rating)
for the outer query. I can't have a class for the subquery because the number of result columns is variable. similarStringGroups
is basically the following:
Stream.of(
Pair.of("apple", Stream.of("pineapple", "yellowapple", "greenapple")),
Pair.of("cabbage", Stream.of("carbage", "cabbagge", "bagge")),
Pair.of("chicken", Stream.of("chakn", "chkn", "chikee"))
)