2

I'm trying to create a simple recommendation engine using Neo4j and Reco4PHP.

The data model consists of the following nodes and relationship:

(User)-[:HAS_BOUGHT]->(Product {category_id: int} )-[:DESIGNED_BY]->(Designer)

In this system I want to recommend products and boost products with the same designer as the user already bought. To create the recommendations I use one Discovery-class and one Post-Processor class to boost the products. See below. This works, but it is very slow. It takes more than 5 seconds to complete, while the datamodel holds ~1000 products and ~100 designers.

// Disovery class
    <?php
namespace App\Reco4PHP\Discovery;
use GraphAware\Common\Cypher\Statement;
use GraphAware\Common\Type\NodeInterface;
use GraphAware\Reco4PHP\Engine\SingleDiscoveryEngine;

class InCategory extends SingleDiscoveryEngine {

    protected $categoryId;

    public function __construct($categoryId) {
        $this->categoryId = $categoryId;
    }

    /**
     * @return string The name of the discovery engine
     */
    public function name() {
        return 'in_category';
    }

    /**
     * The statement to be executed for finding items to be recommended
     *
     * @param \GraphAware\Common\Type\NodeInterface $input
     * @return \GraphAware\Common\Cypher\Statement
     */
    public function discoveryQuery(NodeInterface $input) {

        $query = "
            MATCH (reco:Card)
            WHERE reco.category_id = {category_id}
            RETURN reco, 1 as score
        ";

        return Statement::create($query, ['category_id' => $this->categoryId]);
    }
}

// Boost shared designers
class RewardSharedDesigners extends RecommendationSetPostProcessor {

    public function buildQuery(NodeInterface $input, Recommendations $recommendations)
    {
        $ids = [];
        foreach ($recommendations->getItems() as $recommendation) {
            $ids[] = $recommendation->item()->identity();
        }

        $query = 'UNWIND {ids} as id
        MATCH (reco) WHERE id(reco) = id
        MATCH (user:User) WHERE id(user) = {userId}
        MATCH (user)-[:HAS_BOUGHT]->(product:Product)-[:DESIGNED_BY]->()<-[:DESIGNED_BY]-(reco)

        RETURN id, count(product) as sharedDesignedBy';

        return Statement::create($query, ['ids' => $ids, 'userId' => $input->identity()]);
    }

    public function postProcess(Node $input, Recommendation $recommendation, Record $record) {
        $recommendation->addScore($this->name(), new SingleScore((int)$record->get('sharedDesignedBy')));
    }

    public function name() {
        return 'reward_shared_designers';
    }
}

I'm happy that it works, but if it takes more than 5 seconds to compute it is not useable in a production environment.

To improve the speed I have:

  • created indexes in Product:id and Designer:id
  • Add node_auto_indexing=true to neo4j.properties.
  • Add -Xmx4096m to .neo4j-community.vmoptions But it doesn't really make a difference.

It is normal that these Cypher queries take more than 5 seconds or are there some improvements possible? :)

user1255553
  • 960
  • 2
  • 15
  • 27
  • Hi, I'm the author of Reco4PHP. Very cool I never expected to see a question on StackOverflow. Actually you didn't posted the code of the DiscoveryEngine, can you paste it please ? Actually with your dataset this should run in a couple of 10ms – Christophe Willemsen May 06 '16 at 16:32
  • @user125553 Check this repository and my answer : https://github.com/ikwattro/reco4php-example-so – Christophe Willemsen May 06 '16 at 19:59

2 Answers2

2

The main problem is with your post processor query. The goal is :

Boost the recommendation based on the number of products I bought from the designer having designed the recommended item.

Therefore, you can modify a bit your query to match directly the designer and aggregate on it, also it's best to find first the user before the UNWIND as otherwise it will match the user on every iteration of the product ids :

MATCH (user) WHERE id(user) = {userId}
UNWIND {ids} as productId
MATCH (product:Product)-[:DESIGNED_BY]->(designer)
WHERE id(product) = productId
WITH productId, designer, user
MATCH (user)-[:BOUGHT]->(p)-[:DESIGNED_BY]->(designer)
RETURN productId as id, count(*) as score

The complete post processor would look like this :

    public function buildQuery(NodeInterface $input, Recommendations $recommendations)
    {
        $ids = [];
        foreach ($recommendations->getItems() as $recommendation) {
            $ids[] = $recommendation->item()->identity();
        }

        $query = 'MATCH (user) WHERE id(user) = {userId}
        UNWIND {ids} as productId
        MATCH (product:Product)-[:DESIGNED_BY]->(designer)
        WHERE id(product) = productId
        WITH productId, designer, user
        MATCH (user)-[:BOUGHT]->(p)-[:DESIGNED_BY]->(designer)
        RETURN productId as id, count(*) as score';

        return Statement::create($query, ['userId' => $input->identity(), 'ids' => $ids]);
    }

    public function postProcess(Node $input, Recommendation $recommendation, Record $record)
    {
        $recommendation->addScore($this->name(), new SingleScore($record->get('score')));
    }

I have created a repository where I have a fully functional implementation following your domain :

https://github.com/ikwattro/reco4php-example-so

Update after receiving the data

enter image description here

The fact that you have multiple relationships of the same type between a product and a user is adding exponentionality to the number of found patterns.

There are two solutions :

Distinct them and add a WHERE clause for the end of the pattern :

MATCH (user) WHERE id(user) = {userId}
UNWIND {ids} as cardId
MATCH (reco:Card)-[:DESIGNED_BY]->(designer) WHERE id(reco) = cardId
MATCH (user)-[:HAS_BOUGHT]->(x)
WHERE (x)-[:DESIGNED_BY]->(designer)
RETURN cardId as id, count(*) as sharedDesignedBy

In Neo4j 3.0+, you can benefit from the USING JOIN usage and keep the same query as you had :

MATCH (user) WHERE user.id = 245
UNWIND ids as id
MATCH (reco:Card) WHERE id(reco) = id
MATCH (user:User)-[:HAS_BOUGHT]->(card:Card)-[:DESIGNED_BY]->(designer:Designer)<-[:DESIGNED_BY]-(reco:Card)
USING JOIN ON card
RETURN id, count(card) as sharedDesignedBy

Running those queries, I took down the time for discovery + post processing to 190ms with your current dataset.

Christophe Willemsen
  • 19,399
  • 2
  • 29
  • 36
  • Thank you! I have tried your query and it is a bit faster, but not much unfortunately. Even if execute the query manually via the Neo4j browser it takes a few seconds. Probably the query is just too heavy. – user1255553 May 07 '16 at 10:21
  • It runs under the second on my laptop with 1000 users and 5000 products with a degree of 100-200 purchases per user, so basically it finds 5000 products. Can you maybe share your dataset and your repo with me : christophe at graphaware dot com – Christophe Willemsen May 07 '16 at 12:55
0

I can only comment on Cypher and even then not so much since you didn't include the function GetItems() or data (cypher dump). But few things stand out

  1. It will be faster to use label on (reco) I assume it is Product?
  2. Also I assume this is Designer label that can be put in - [:DESIGNED_BY]->()<-[:DESIGNED_BY]?
  3. If by any chance GetItems() retrieving items one by one, that might be the problem and also where indexes needed. By the way why not put that condition in the main query?

I also don't understand indexes on id? If they are Neo4j id, they are physical locations and don't need to be indexed and if they are not why you use id() function?

In conclusion labels might help, but don't expect miracles if your dataset is large, aggregations are not super fast on Neo4j. Counting 10M records with no filters took me 12 seconds.

Dmitriy
  • 638
  • 1
  • 6
  • 12
  • 1. Yes, reco is a Product. 2. Yes, since the datamodel looks like (Product-[:DESIGNED_BY]->(Designer) 3. The code is based on the Reco4PHP template, where the result of the discovery-class is passed to the post-processor(s). I don't think this should be changed. – user1255553 May 06 '16 at 14:16