-1

I have a .rdf file (over 2gb compressed) that apparently has some duplicated IRIs in the middle, and perhaps other issues.

The following error in the workbench during import:

  RDF Parse Error: ID '_D5C2483C53D3F747_up.name_uORF' has already been defined [line 6907110, column 53

Is there a tool to pre-process these huge files prior to import using some defined behavior, eg "just skip it", etc?

mkk
  • 879
  • 6
  • 19
  • Looks like you're trying to import Uniprot data. I stumbled over the same problem, and solved it via a Python script that removes the duplicated lines (always leaving the first instance). It's not a universal solution as it only solves this specific Uniprot case, but in case you're still interested I could post it as an answer. – gaspanic Nov 05 '21 at 15:18

1 Answers1

0

When you import files through the GraphDB Workbench, there's an "Advanced settings" foldout menu. Fold that out, it has several options you can enable or disable regarding validation, including "Should stop on error". I can't be sure that it will continue on this particular error if you disable that option (there are some syntax errors that the parser simply can't recover from), but it's worth a shot.

Jeen Broekstra
  • 21,642
  • 4
  • 51
  • 73