I'm attempting to use the clojure pantomime
library to extract/ocr text from a large number of tif
documents (among others).
My plan has been to use pmap for to apply the mapping over a sequence of input data (from a postgres database) and then update that same postgres database with the tika/tesseract OCR output. This has been working ok, however i notice in htop that many of the cores are idle at times.
Is there anyway to reconcile this, and what steps can i take to determine why this may be blocking somewhere? All processing occurs on a single tif file, and each thread is entirely mutually exclusive.
Additional info:
- some tika/tesseract processes take 3 seconds, others take up to 90 seconds. Generally speaking, tika is heavily CPU bound. I have ample memory available according to
htop
. - postgres has no locking issues in session management, so i don't think thats holding me up.
- maybe somewhere
future
's are waiting toderef
? how to tell where?
Any tips appreciated, thanks. Code added below.
(defn parse-a-path [{:keys [row_id, file_path]}]
(try
(let [
start (System/currentTimeMillis)
mime_type (pm/mime-type-of file_path)
file_content (-> file_path (extract/parse) :text)
language (pl/detect-language file_content)
]
{:mime_type mime_type
:file_content file_content
:language language
:row_id row_id
:parse_time_in_seconds (float (/ ( - (System/currentTimeMillis) start) 100))
:record_status "doc parsed"})))
(defn fetch-all-batch []
(t/info (str "Fetching lazy seq. all rows for batch.") )
(jdbc/query (db-connection)
["select
row_id,
file_path ,
file_extension
from the_table" ]))
(defn update-a-row [{:keys [row_id, file_path, file_extension] :as all-keys}]
(let [parse-out (parse-a-path all-keys )]
(try
(doall
(jdbc/execute!
(db-connection)
["update the_table
set
record_last_updated = current_timestamp ,
file_content = ? ,
mime_type = ? ,
language = ? ,
parse_time_in_seconds = ? ,
record_status = ?
where row_id = ? "
(:file_content parse-out) ,
(:mime_type parse-out) ,
(:language parse-out) ,
(:parse_time_in_seconds parse-out) ,
(:record_status parse-out) ,
row_id ])
(t/debug (str "updated row_id " (:row_id parse-out) " (" file_extension ") "
" in " (:parse_time_in_seconds parse-out) " seconds." )))
(catch Exception _ ))))
(dorun
(pmap
#(try
(update-a-row %)
(catch Exception e (t/error (.getNextException e)))
)
fetch-all-batch )
)