0

While Extracting Content from PDF using the MuPDF library, i am getting the Font name only not its font-face.

Do i guess (eg.bold in font-name though not the right way) or there is any other way to detect that specific font is Bold/Italic/Plain.

Orbling
  • 20,413
  • 3
  • 53
  • 64
Tech Enthusiast
  • 279
  • 1
  • 5
  • 18
  • How are you extracting the information? – Orbling May 22 '13 at 17:34
  • Using MUPDF open source library. – Tech Enthusiast May 22 '13 at 17:39
  • The font carries a number of flags, in addition to its name, which may or may not tell you something more about the font's attributes. These are not terribly reliable. – KenS May 23 '13 at 07:20
  • @OnceUponATimeInTheWest I think you are right.But is there any font parsers available in Java which i can possibly use. – Tech Enthusiast May 23 '13 at 16:52
  • I think I'm right too. Don't know why I got voted down for it ;-) Anyhow I would encourage you to use the information in the name if at all possible. If you actually need to get the font information from the font I think you should be able to throw some quick code to scan any TrueType font for this information. First look for the OS/2 tag and then grab these bits of information: http://www.microsoft.com/typography/otspec/os2.htm – OnceUponATimeInTheWest Jun 20 '13 at 12:02
  • 1
    Finally I got it resolved by loading the font through its FontDescriptor and then find it properties...Thanks to all. – Tech Enthusiast Jun 30 '13 at 10:12

2 Answers2

1

The PDF spec contains entries which allow you to specify the style of a font. However unfortunately in the real world you will often find that these are absent.

If the font is referenced rather than embeded this generally means you are stuck with the PostScript name for the font. It requires some heuristics but normally the name provides sufficient clues as to the style. It sounds this is pretty much where you are.

If the font is embedded you can parse it and try and find style information from the embedded font program. If it is subsetted then in theory this information might be removed but in general I don't think it will be. However parsing TrueType/OpenType fonts is boring and you may not feel that it is worth it.

I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)"

1

I have used itextsharp to extract font-family ,font color etc

public void Extract_inputpdf() {

  text_input_File = string.Empty;

  StringBuilder sb_inputpdf = new StringBuilder();
  PdfReader reader_inputPdf = new PdfReader(path); //read PDF
  for (int i = 0; i <= reader_inputPdf.NumberOfPages; i++) {

    TextWithFont_inputPdf inputpdf = new TextWithFont_inputPdf();
    text_input_File = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader_inputPdf, i, inputpdf);

    sb_inputpdf.Append(text_input_File);
    input_pdf = sb_inputpdf.ToString();
  }
  reader_inputPdf.Close();
  clear();
}

public class TextWithFont_inputPdf: iTextSharp.text.pdf.parser.ITextExtractionStrategy {
  public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {

    string curFont = renderInfo.GetFont().PostscriptFontName;
    string divide = curFont;
    string[] fontnames = null;

    //split the words from postscript if u want separate. it will be in this
  }
}
public string GetResultantText() {

  return result.ToString();
}
Gangula
  • 5,193
  • 4
  • 30
  • 59
pdp
  • 609
  • 9
  • 22