1

I'm trying to properly "wordwrap" a given string into English plaintext. Take this example string:

This here is an example of what I'm talking about. Notice how I just talk nonsense on and on for no reason other than to push the 80-character line limit. And this is some more text, etc.

Please note: in the following examples, I have added underscores to visualize what I'm talking about. Naturally, the underscores are not added in reality. They are only here to make it clear what is happening.

If I simply blindly add a linebreak after each 80 chars, I get:

This here is an example of what I'm talking about. Notice how I just talk nonsen
se on and on for no reason other than to push the 80-character line limit. And t
his is some more text, etc._____________________________________________________

If I use the built-in wordwrap() function with 80 chars, I get:

This here is an example of what I'm talking about. Notice how I just talk_______
nonsense on and on for no reason other than to push the 80-character line limit.
And this is some more text, etc.________________________________________________

Neither of those look good or resemble a proper book or magazine, which either (depending on their age) have used software or humans to beautifully typeset them, like this:

This here is an example of what I'm talking about. Notice how I just talk nonse-
nse on and on for no reason other than to push the 80-character line limit. And_
this is some more text, etc.____________________________________________________

Notice how "nonsense" has neither been hard-cut or fully dropped on the next line. Instead, it has fully utilized the line minus one character for the dash, continuing on the next line. (As for the "And" in the end of the second line, it does have an whitespace after it, but only because there is only one more character left on that line.)

The rules for doing this kind of "intelligent wordwrapping" are language-specific, locale-specific (I think) and very complex. As such, it would be madness for me to attempt to code in all the rules manually.

I strongly suspect that there is some kind of mature, popular PHP library for doing precisely this, and I further suspect that it supports all kinds of languages/locales. However, I have been unable to find it myself.

It is not a requirement that it has to support "all kinds of languages/locales", but it would be nice. English with either US or UK locale would be sufficient for me to be happy at the moment.

I hope that I've been crystal-clear about what I'm asking!

  • That sounds like a nice project to start. I don't think it is present in PHP yet. You would probably be interested in TeX line break algorithm. http://defoe.sourceforge.net/folio/knuth-plass.html Someone apparently tried or perhaps successfully implemented it in Javascript https://github.com/bramstein/typeset. – peekolo Dec 01 '19 at 04:49
  • There are some nice word wrap codes here, perhaps not what you want (with hyphens at appropriate places and stuff), but might be useful to you. https://stackoverflow.com/questions/9071205/balanced-word-wrap-minimum-raggedness-in-php – peekolo Dec 01 '19 at 04:51
  • @peekolo I mean, if it doesn't use hyphens, what does it have to do with this? Then it's just blindly wordwrapping like the built-in PHP function? –  Dec 01 '19 at 06:03
  • The built in PHP function uses balanced word wrap algorithm which could appear "blindly word wrapping". Those examples I gave are more complex or intelligent methods of word wrapping, one of which in the stack overflow link I shared, is minimum raggedness algorithm. As for the Tex Knuth algorithm, I believe it does add hyphens. I haven't got time to read through the whole paper yet. If it is important to you, perhaps you can read it first http://www.eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf – peekolo Dec 01 '19 at 13:25
  • However, as you pointed out, this is a highly complex task and depends on the context of the language. Hence even LaTex which is based on Tex Knuth algorithm had been reported to have inappropriate hyphenations on some words. https://www.tug.org/TUGboat/tb33-1/tb103hyf.pdf – peekolo Dec 01 '19 at 13:28
  • So, to respond fully to your comment, what I wrote has everything to do with this. Firstly, I have answered that PHP does not have a known library for your request. Unless it is a small private library, which I am unaware of. Secondly you didn't specify your preferred algorithm, so I suggested you sources of available, better, more intelligent word wrapping algorithms that could help you achieve your word wrapping needs. – peekolo Dec 01 '19 at 13:32
  • Anyway, I think you have correctly pointed out as well - (now recalling I had a friend who worked internship at a magazine publishing and her job was to check alignment and positionings), I believe magazines use a combination of algorithms (probably TeX) and HUMAN final intervention, to adjust the wrong hyphenations etc. – peekolo Dec 01 '19 at 13:38

0 Answers0