9

I have been using POI to parse XLS and XLSX files successfully. However, I am unable to correctly extract special characters, such as UTF-8 encoded characters like Chinese or Japanese, from an Excel spreadsheet. I have figured out how to extract data from a UTF-8 encoded csv or tab delimited file, but no luck with the Excel file. Can anyone help?

(Edit: Code snippet from comments)

HSSFSheet sheet = workbook.getSheet(worksheet); 
HSSFEvaluationWorkbook ewb = HSSFEvaluationWorkbook.create(workbook); 
while (rowCtr <= lastRow && !rowBreakOut) 
{ 
    Row row = sheet.getRow(rowCtr);//rows.next(); 
    for (int col=firstCell; col<lastCell && !breakOut; col++) { 
      Cell cell; 
      cell = row.getCell(col,Row.RETURN_BLANK_AS_NULL); 
      if (ctype == Cell.CELL_TYPE_STRING) { 
         sValue = cell.getStringCellValue(); 
         log.warn("String value = "+sValue); 
         String encoded = URLEncoder.encode(sValue, "UTF-8"); 
         log.warn("URL-encoded with UTF-8: " + encoded); 
         ....
Makoto
  • 104,088
  • 27
  • 192
  • 230
user1198370
  • 91
  • 1
  • 1
  • 2
  • Can you point to the portions of the POI API that you are using or show us some code? That would help us recommend API changes and or diagnostics that would help debug. – Mike Samuel Feb 08 '12 at 23:29
  • Excel doesn't store characters as UTF-8, so I think you're probably confused somewhere... – Gagravarr Feb 09 '12 at 09:55
  • Here is a code snippet:`HSSFSheet sheet = workbook.getSheet(worksheet); HSSFEvaluationWorkbook ewb = HSSFEvaluationWorkbook.create(workbook); while (rowCtr <= lastRow && !rowBreakOut) { Row row = sheet.getRow(rowCtr);//rows.next(); for (int col=firstCell; col – user1198370 Feb 09 '12 at 16:55
  • all log entries return junk, ie - non-chinese characters – user1198370 Feb 09 '12 at 16:55
  • I could not figure out how to attach a screen shot or excel file of chinese characters saved into an Excel file, so you're just going to have to take my word that it is possible to store chinese characters in an excel file. – user1198370 Feb 09 '12 at 16:56
  • @user1198370 - Welcome to stackoverflow :) It is a little hard to read that much code in comments, so I moved the code snippet into the question for better readability. FYI: You can always update your own questions to include more information. I think that option is available to all user levels. – Leigh Feb 26 '12 at 01:30

4 Answers4

14

I had the same problem while extracting Persian text from an Excel file. I was using Eclipse, and simply going to Project -> Properties and changing the "text file encoding" to UTF-8 solved the problem.

Roozbehan
  • 478
  • 5
  • 15
6

in POI you can use like this:

Workbook wb = new HSSFWorkbook();
Sheet sheet = wb.createSheet("new sheet");

// Create a row and put some cells in it. Rows are 0 based.
Row row = sheet.createRow(1);

// Create a new font and alter it.
Font font = wb.createFont();
font.setCharSet(FontCharset.ARABIC.getValue());
font.setFontHeightInPoints((short)24);
font.setFontName("B Nazanin");
font.setItalic(true);
font.setStrikeout(true);

// Fonts are set into a style so create a new one to use.
CellStyle style = wb.createCellStyle();
style.setFont(font);

// Create a cell and put a value in it.
Cell cell = row.createCell(1);
cell.setCellValue("سلام");
cell.setCellStyle(style);

// Write the output to a file
FileOutputStream fileOut = new FileOutputStream("workbook.xls");
wb.write(fileOut);
fileOut.close();

and can use another charset in FontCharset

1

Get bytes using UTF as follows

cell.getStringCellValue().getBytes(Charset.forName("UTF-8"));
yottabrain
  • 2,387
  • 5
  • 23
  • 37
1

The solution is simple, to read cell string values of any encoding (non English characters); just use the following method:

sValue = cell.getRichStringCellValue().getString();

instead of:

sValue = cell.getStringCellValue();

This applies to UTF-8 encoded characters like Chinese, Arabic or Japanese.

P.S if anybody is using the Command line utility nullpunkt/excel-to-json which utilize the "Apache POI" library, modify the file converter/ExcelToJsonConverter.java by replacing the occurrences of "getStringCellValue()" to avoid reading non-english characters as "???".

Yacoub Oweis
  • 352
  • 4
  • 11