I'm running the stock Apache Tika 1.24.1 Server (tika-server-1.24.1.jar). My ASP.NET MVC web app then gets the parsed documents back from Tika using this VB.net code:
httpWebRequest = HttpWebRequest.Create("http://localhost:9998/tika")
httpWebRequest.Method = "PUT"
httpWebRequest.Accept = "text/plain"
httpWebRequest.UseDefaultCredentials = True
httpWebRequest.GetRequestStream.Write(fileContents, 0, fileContents.Count)
httpWebResponse = httpWebRequest.GetResponse
Using contentResponseStream As New StreamReader(_httpWebResponse.GetResponseStream)
tikaTextContents = contentResponseStream.ReadToEnd()
End Using
That part works (the parsed text is returned).
However, when the Tika server parses certain PDF files, it adds extra spaces in some places. I noticed in this Tika ticket that there's a potential solution (setEnableAutoSpace). https://issues.apache.org/jira/browse/TIKA-724
My question: Is there any way to set setEnableAutoSpace from the Tika web interface (or possibly to set it when you parse the file)? Or is the only option to tinker with the Java code if you want to turn this option on?
Thanks!