2

I have table posts which contains LONGTEXT. My issue is that I want to retrieve parts of a specific post (basically paging)

I use the following query:

SELECT SUBSTRING(post_content,1000,1000) FROM posts WHERE id=x

This is somehow good, but the problem is the position and the length. Most of the time, the first word and the last word is not complete, which makes sense.

How can I retrieve complete words from position x for length y?

Nikola K.
  • 7,093
  • 13
  • 31
  • 39
  • the solution is to find the position of nth space and zth space and get the substring between them. How can we do that? – user1690753 Sep 22 '12 at 11:21
  • by the way, my LONGTEXT is html formatted so there should be or
    – user1690753 Sep 22 '12 at 11:23
  • and how do you plan to break those HTML? – rationalboss Sep 22 '12 at 12:20
  • it is doable in oracle: INSTR (str, pattern, [starting position, [nth location]]) – user1690753 Sep 22 '12 at 12:54
  • well, it is also in Mysql using INSTR. but for HTML, you need to find closing tags of HTML. some don't even have closing tags like `
    ` - what would your query do if it encounters a ``? or dozens of `
  • ` inside an `
      ` near the 1000th character
  • – rationalboss Sep 22 '12 at 12:57
  • the pattern will be
    or or or . any one of these if reached, that would be the end of this page. of course after at lease 1000 characters. So the query should be something like this: "give me at least 1000 characters starting from position x until you reach
    or or ". That's a hell of a query ;) before I forget, the position x is the beginning of the end of the last query !!!! it's so complicated :(
    – user1690753 Sep 22 '12 at 15:19

1 Answers1

0

Presumably you're doing this for the purpose of saving on network traffic overhead between the MySQL server and the machine on which your application is running. As it happens, you're not saving any other sort of workload on the MySQL server. It has to fetch the LONGTEXT item from disk, then run it through SUBSTRING.

Presumably you've already decided based on solid performance analysis that you must save this network traffic. You might want to revisit this analysis now that you know it doesn't save much MySQL server workload. Your savings will be marginal, unless you have zillions of very long LONGTEXT items and lots of traffic to retrieve and display parts of them.

In other words, this is an optimization task. YAGNI? http://en.wikipedia.org/wiki/YAGNI

If you do need it you are going to have to create software to process the LONGTEXT item word by word. Your best bet is to do this in your client software. Start by retrieving the first page plus a k or two of the article. Then, parse the text looking for complete words. After you find the last complete word in the first page and its following whitespace, then that character position is the starting place for the next page.

This kind of task is a huge pain in the neck in a MySQL stored procedure. Plus, when you do it in a stored procedure you're going to use processing cycles on a shared and hard-to-scale-up resource (the MySQL server machine) rather than on a cloneable client machine.

I know I didn't give you clean code to just do what you ask. But it's not obviously a good idea to do what you're suggesting.

Edit:

An observation: A gigabyte of server RAM costs roughly USD20. A caching system like memcached does a good job of exploiting USD100 worth of memory efficiently. That's plenty for the use case you have described.

Another observation: many companies who serve large-scale documents use file systems rather than DBMSs to store them. File systems can be shared or replicated very easily among content servers, and files can be random-accessed trivially without any overhead.

It's a bit innovative to store whole books in single BLOBs or CLOBs. If you can break up the books by some kind of segment -- page? chapter? thousand-word chunk? -- and create separate data rows for each segment, your DBMS will scale up MUCH MUCH better than what you have described.

If you're going to do it anyway, here's what you do:

  1. always retrieve 100 characters more than you need in each segment. For example, when you need characters 30000 - 35000, retrieve 30000 - 35100.

  2. after you retrieve the segment, look for the first word break in the data (except on the very first segment) and display starting from that word.

  3. similarly, find the very first word break in the 100 extra bytes, and display up to that word break.

So your fetched data might be 30000 - 35100 and your displayed data might be 30013 - 35048, but it would be whole words.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • Thanks for your response. The thing is my LONGTEXT contain at least 100 page document. If I want to retrieve all of it for every user, first its not efficient, second, it's not convenient for the user to scroll 50 pages to reach page 51. so the best choice is to give the user what he needs, not the whole thing. – user1690753 Sep 22 '12 at 12:50
  • I can do it via PHP, each time divide the document into parts and give it to the user. but again it's not efficient. On every call for any page, I retrieve the whole thing from the table – user1690753 Sep 22 '12 at 12:52
  • then why not store it somewhere like a session? – rationalboss Sep 22 '12 at 15:56
  • I thought about it, it's not efficient either. Imagine 2000 users each have opened 2 documents, thats 4000 documents. if each doc has 200 KB, its 800 KB in the memory. Thats huge. – user1690753 Sep 22 '12 at 16:09
  • see my edits. this whole-book LONGTEXT thing is a bizarre design, in my opinion. – O. Jones Sep 22 '12 at 22:44