2

I wrote an application (Springboot + Data JPA + Data Rest) that keeps throwing OutOfMemoryException at me when the application loads. I can skip that code that runs on application start but then the exception may happen later down the road. It's probably best to show you what happens on application start because it's actually super simple and should not cause any problems imho:

@SpringBootApplication
@EnableAsync
@EnableJpaAuditing
public class ScraperApplication {
    public static void main(String[] args) {
        SpringApplication.run(ScraperApplication.class, args);
    }
}

@Component
@RequiredArgsConstructor(onConstructor = @__(@Autowired))
public class DefaultDataLoader {
    private final @NonNull LuceneService luceneService;

    @Transactional
    @EventListener(ApplicationReadyEvent.class)
    public void load() {
        luceneService.reindexData();
    }
}


@Service
@RequiredArgsConstructor(onConstructor = @__(@Autowired))
public class LuceneService {

    private static final Log LOG = LogFactory.getLog(LuceneService.class);

    private final @NonNull TrainingRepo trainingRepo;

    private final @NonNull EntityManager entityManager;

    public void reindexData() {
        LOG.info("Reindexing triggered");

        FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager);
        fullTextEntityManager.purgeAll(Training.class);

        LOG.info("Index purged");

        int page = 0;
        int size = 100;
        boolean morePages = true;
        Page<Training> pageData;

        while (morePages) {
            pageData = trainingRepo.findAll(PageRequest.of(page, size));
            LOG.info("Loading page " + (page + 1) + "/" + pageData.getTotalPages());
            pageData.getContent().stream().forEach(t -> fullTextEntityManager.index(t));
            fullTextEntityManager.flushToIndexes(); // flush regularly to keep memory footprint low
            morePages = pageData.getTotalPages() > ++page;
        }

        fullTextEntityManager.flushToIndexes();
        LOG.info("Index flushed");
    }

}

You can see what I am doing is clear out the index, read all Trainings from the TrainingRepo in a paged way (100 at a time) and write them into the index. Not much going on actually. A few minutes after the "Index purged" message I get this - and only this:

java.lang.OutOfMemoryError: Java heap space

In the logs I get to see "Index purged" but never see any "Loading page ..." message, so it must be stuck on the findAll() call.

I had the JVM write a heap dump and loaded it into Eclipse Memory Analyzer and got a full stack trace: https://gist.github.com/mathias-ewald/2fddb9762427374bb04d332bd0b6b499

I also looked around the report a bit, but I need help interpreting this information which is why I attached some screenshots from Eclipse Memory Analyzer.

Leak Suspects Histogram Data Dominator Tree

EDIT:

I just enabled "show-sql" and saw this before everything hung:

Hibernate: select training0_.id as id1_9_, training0_.created_date as created_2_9_, training0_.description as descript3_9_, training0_.duration_days as duration4_9_, training0_.execution_id as executi14_9_, training0_.level as level5_9_, training0_.modified_date as modified6_9_, training0_.name as name7_9_, training0_.price as price8_9_, training0_.product as product9_9_, training0_.quality as quality10_9_, training0_.raw as raw11_9_, training0_.url as url12_9_, training0_.vendor as vendor13_9_ from training training0_ where  not (exists (select 1 from training training1_ where training0_.url=training1_.url and training0_.created_date<training1_.created_date)) limit ?
Hibernate: select execution0_.id as id1_1_0_, execution0_.created_date as created_2_1_0_, execution0_.duration_millis as duration3_1_0_, execution0_.message as message4_1_0_, execution0_.modified_date as modified5_1_0_, execution0_.scraper as scraper6_1_0_, execution0_.stats_id as stats_id8_1_0_, execution0_.status as status7_1_0_, properties1_.execution_id as executio1_2_1_, properties1_.properties as properti2_2_1_, properties1_.properties_key as properti3_1_, stats2_.id as id1_5_2_, stats2_.avg_quality as avg_qual2_5_2_, stats2_.max_quality as max_qual3_5_2_, stats2_.min_quality as min_qual4_5_2_, stats2_.null_products as null_pro5_5_2_, stats2_.null_vendors as null_ven6_5_2_, stats2_.products as products7_5_2_, stats2_.tags as tags8_5_2_, stats2_.trainings as training9_5_2_, stats2_.vendors as vendors10_5_2_, producthis3_.stats_id as stats_id1_6_3_, producthis3_.product_histogram as product_2_6_3_, producthis3_.product_histogram_key as product_3_3_, taghistogr4_.stats_id as stats_id1_7_4_, taghistogr4_.tag_histogram as tag_hist2_7_4_, taghistogr4_.tag_histogram_key as tag_hist3_4_, vendorhist5_.stats_id as stats_id1_8_5_, vendorhist5_.vendor_histogram as vendor_h2_8_5_, vendorhist5_.vendor_histogram_key as vendor_h3_5_ from execution execution0_ left outer join execution_properties properties1_ on execution0_.id=properties1_.execution_id left outer join stats stats2_ on execution0_.stats_id=stats2_.id left outer join stats_product_histogram producthis3_ on stats2_.id=producthis3_.stats_id left outer join stats_tag_histogram taghistogr4_ on stats2_.id=taghistogr4_.stats_id left outer join stats_vendor_histogram vendorhist5_ on stats2_.id=vendorhist5_.stats_id where execution0_.id=?

Apparently, it creates the statement to fetch all the Training entities but the Execution statement is the last it manages to execute.

I changed the relation from Training to Execution from @ManyToOne to @ManyToOne(fetch = FetchType.LAZY) and suddenly I the code was able to load data into the index again. So I am thinking something might be wrong with my Execution entity mapping. Let me share the code with you:

@Entity
@Data
@EntityListeners(AuditingEntityListener.class)
public class Execution {

    public enum Status { SCHEDULED, RUNNING, SUCCESS, FAILURE };

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    @ToString.Include
    private Long id;

    @Column(updatable = false)
    private String scraper;

    @CreatedDate
    private LocalDateTime createdDate;

    @LastModifiedDate
    private LocalDateTime modifiedDate;

    @Min(0)
    @JsonProperty(access = Access.READ_ONLY)
    private Long durationMillis;

    @ElementCollection(fetch = FetchType.EAGER)
    private Map<String, String> properties;

    @NotNull
    @Enumerated(EnumType.STRING)
    private Status status;

    @Column(length = 9999999)
    private String message;

    @EqualsAndHashCode.Exclude
    @OneToOne(cascade = CascadeType.ALL)
    private Stats stats;

}

And since it is a relation of Execution, here's the Stats entity, too:

@Entity
@Data
public class Stats {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    @ToString.Include
    private Long id;

    private Long trainings;

    private Long vendors;

    private Long products;

    private Long tags;

    private Long nullVendors;

    private Long nullProducts;

    private Double minQuality;

    private Double avgQuality;

    private Double maxQuality;

    @ElementCollection(fetch = FetchType.EAGER)
    private Map<String, Long> vendorHistogram;

    @ElementCollection(fetch = FetchType.EAGER)
    private Map<String, Long> productHistogram;

    @ElementCollection(fetch = FetchType.EAGER)
    private Map<String, Long> tagHistogram;

}
user3235738
  • 335
  • 4
  • 22

2 Answers2

2

All this is running in a single transaction and I can't see a clear here, so the EntityManager loading all this data still references it.

To fix this inject the EntityManager and invoke clear. Or alternatively make the scope of the transaction the processing of one page.

I recommend the TransactionTemplate for this.

I'm not familiar with the FullTextEntityManager but it might have similar problems.

For more background you might want to read up on the JPA entity lifecycle.

Jens Schauder
  • 77,657
  • 34
  • 181
  • 348
  • Thanks for your comment. I always thought using the JPA repository abstractions, the handling of transactions is beyond what I have to worry about. Would it help to move the processing of a page to a separate method and annotate it with @Transactional? – user3235738 Apr 25 '20 at 09:05
  • You'd have to remove the `@Transactional` that you have right now, since you can't have stacked transactions and would still end up with the one big one. – Jens Schauder Apr 25 '20 at 09:30
  • Spring makes transaction handling easy but you still have to decide where you want/need it. Spring Data JPA offers easy ways to formulate queries. It does nothing to isolate you from the inherent complexities of JPA. – Jens Schauder Apr 25 '20 at 09:32
0

I believe it has to do with your FullTextEntityManager not finding enough memory. You have to configure your queryPlanCache.Go through this thread on how to Stackoverflow and this one too.

tksilicon
  • 3,276
  • 3
  • 24
  • 36
  • What makes you think it's about FullTextEntityManager? The two objects for which there are more than 12mio in the heap are from the com.mysql package and the code hangs loading data via JPA (trainingRepo.findAll()) before the fulltextentitymanager comes into play. Playing with plan_cache_max_size and plan_parameter_metadata_max_size didn't change anything :( – user3235738 Apr 25 '20 at 08:19