How to convert DJVU to PDF with table of contents

This article is based on the answer by pyrocrasty on StackExchange, and kcroker's dpsprep. It tells the essential idea and tools to convert DJVU to PDF with TOC preserved. A python script is given at the end to ease your usage.

Dependencies

PDF tool pdftk: install by brew install pdftk
DJVU library DjVuLibre (also delivering commandline tool ddjvu, djvused): install by brew install djvulibre
Python package sexpdata to parse bookmark files: install by pip install sexpdata

Procedures

step 1: convert the file text

First, use any tool to convert the DJVU file to a PDF (without bookmarks).

Suppose the files are called filename.djvu and filename.pdf.

step 2: extract DJVU outline

Next, output the DJVU outline data to a file, like this:

1

  djvused "filename.djvu" -e 'print-outline' > bmarks.out

This is a file listing the DJVU documents bookmarks in a serialized tree format. In fact it's just a SEXPR, and can be easily parsed. The format is as follows:

1
2
3
4
5
6
7


  file ::= (bookmarks
            <bookmark>*)
  bookmark ::= (name
                page
                <bookmark>*)
  name ::= "<character>*"
  page ::= "#<digit>+"

For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


  (bookmarks
    ("bmark1"
      "#1")
    ("bmark2"
      "#5"
      ("bmark2subbmark1"
        "#6")
      ("bmark2subbmark2"
        "#7"))
    ("bmark3"
      "#9"
      ...))

step 3: convert DJVU outline to PDF metadata format

Now, we need to convert these bookmarks into the format required by PDF metadata. This file has format:

1
2
3
4
5
6


  file ::= <entry>*
  entry ::= BookmarkBegin
            BookmarkTitle: <title>
            BookmarkLevel: <number>
            BookmarkPageNumber: <number>
  title ::= <character>*

So our example would become:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


  BookmarkBegin
  BookmarkTitle: bmark1
  BookmarkLevel: 1
  BookmarkPageNumber: 1
  BookmarkBegin
  BookmarkTitle: bmark2
  BookmarkLevel: 1
  BookmarkPageNumber: 5
  BookmarkBegin
  BookmarkTitle: bmark2subbmark1
  BookmarkLevel: 2
  BookmarkPageNumber: 6
  BookmarkBegin
  BookmarkTitle: bmark2subbmark2
  BookmarkLevel: 2
  BookmarkPageNumber: 7
  BookmarkBegin
  BookmarkTitle: bmark3
  BookmarkLevel: 1
  BookmarkPageNumber: 9

Basically, you just need to write a script to walk the SEXPR tree, keeping track of the level, and output the name, page number and level of each entry it comes to, in the correct format.

step 4: extract PDF metadata and splice in converted bookmarks

Once you've got the converted list, output the PDF metadata from your converted PDF file:

1

  pdftk "filename.pdf" dump_data > pdfmetadata.out

Now, open the file and find the line that begins: NumberOfPages:

insert the converted bookmarks after this line. Save the new file as pdfmetadata.in

step 5: create PDF with bookmarks

Now we can create a new PDF file incorporating this metadata:

1

  pdftk "filename.pdf" update_info "pdfmetadata.in" output out.pdf

The file out.pdf should be a copy of your PDF with the bookmarks imported from the DJVU file.

Python script

To use this script, create a script file (e.g., named djvu2pdftoc), and add executable permission by chmod +x djvu2pdftoc. Then you are allowed to use it as:

./djvu2pdftoc IN.djvu OUT.pdf (with default quality 80), or
./djvu2pdftoc --quality 100 IN.djvu OUT.pdf (lossless conversion)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112


  #!/usr/bin/env python3
  # Convert DJVU to PDF with table of contents, if available.
  # Modified from https://github.com/kcroker/dpsprep
  # License: GNU GPL v3

  import sexpdata
  import argparse
  import os
  import pipes
  import subprocess
  import re

  # Recursively walks the sexpr tree and outputs a metadata format understandable by pdftk
  def walk_bmarks(bmarks, level):
      output = ''
      wroteTitle = False
      for j in bmarks:
          if isinstance(j, list):
              output = output + walk_bmarks(j, level + 1)
          elif isinstance(j, str):
              if not wroteTitle:
                  output = output + "BookmarkBegin\nBookmarkTitle: %s\nBookmarkLevel: %d\n" % (j, level)
                  wroteTitle = True
              else:
                  output = output + "BookmarkPageNumber: %s\n" % j.split('#')[1]
                  wroteTitle = False
          else:
              pass
      return output

  workpath = os.getcwd()

  # From Python docs, nice and slick command line arguments
  parser = argparse.ArgumentParser(description='Convert DJVU format to PDF format preserving OCRd text and metadata.  Very useful for Sony Digital Paper system')
  parser.add_argument('src', metavar='djvufile', type=str,
                      help='the source DJVU file')
  parser.add_argument('dest', metavar='pdffile', type=str,
                      help='the destination PDF file')
  parser.add_argument('-q, --quality', dest='quality', type=int, default=80,
                      help='specify JPEG lossy compression quality (50-150).  See man ddjvu for more information.')

  args = parser.parse_args()

  # Reescape the filenames because we will just be sending them to commands via system
  # and we don't otherwise work directly with the DJVU and PDF files.
  # Also, stash the temp pdf in the clean spot
  args.src = pipes.quote(args.src)
  finaldest = pipes.quote(args.dest)
  args.dest = workpath + '/dumpd.pdf'

  # Check for a file presently being processed
  if os.path.isfile(workpath + '/inprocess'):
      fname = open(workpath + '/inprocess', 'r').read()
      if not fname == args.src:
          print("ERROR: Attempting to process %s before %s is completed. Aborting." % (args.src, fname))
          exit(3)
      else:
          print("NOTE: Continuing to process %s..." % args.src)
  else:
      # Record the file we are about to process
      open(workpath + '/inprocess', 'w').write(args.src)

  # Make the PDF, compressing with JPG so they are not ridiculous in size
  # (cwd)
  if not os.path.isfile(workpath + '/dumpd.pdf'):
      retval = os.system("ddjvu -v --format=pdf %s %s/dumpd.pdf" % (args.src, workpath))
      if retval > 0:
          print("\nNOTE: There was a problem on ddjvu to convert to pdf.")
          exit(retval)
  else:
      print("PDF (without toc) already found, use it.")

  # Extract the bookmark data from the DJVU document
  retval = 0
  retval = retval | os.system("djvused %s -u -e 'print-outline' > %s/bmarks.out" % (args.src, workpath))
  print("Bookmarks extracted.")

  # Check for zero-length outline
  if os.stat("%s/bmarks.out" % workpath).st_size > 0:

      # Extract the metadata from the PDF document
      retval = retval | os.system("pdftk %s dump_data_utf8 > %s/pdfmetadata.out" % (args.dest, workpath))
      print("Original PDF metadata extracted.")

      # Parse the sexpr
      pdfbmarks = walk_bmarks(sexpdata.load(open(workpath + '/bmarks.out')), 0)

      # Integrate the parsed bookmarks into the PDF metadata
      p = re.compile('NumberOfPages: [0-9]+')
      metadata = open("%s/pdfmetadata.out" % workpath, 'r').read()

      for m in p.finditer(metadata):
          loc = int(m.end())

          newoutput = metadata[:loc] + "\n" + pdfbmarks[:-1] + metadata[loc:]

          # Update the PDF metadata
          open("%s/pdfmetadata.in" % workpath, 'w').write(newoutput)
          retval = retval | os.system("pdftk %s update_info_utf8 %s output %s" % (args.dest, workpath + '/pdfmetadata.in', finaldest))

  else:
      retval = retval | os.system("mv %s %s" % (args.dest, finaldest))
      print("No bookmarks were present!")

  # If retval is shit, don't delete temp files
  if retval == 0:
      os.system("rm %s/inprocess %s" % (workpath, args.dest))
      print("SUCCESS. Temporary files cleared.")
      exit(0)
  else:
      print("There were errors in the metadata step.  Check the errors.")
      exit(retval)

Contents