This article is based on the answer by pyrocrasty on StackExchange, and kcroker's dpsprep. It tells the essential idea and tools to convert DJVU to PDF with TOC preserved. A python script is given at the end to ease your usage.

Dependencies

  1. PDF tool pdftk: install by brew install pdftk

  2. DJVU library DjVuLibre (also delivering commandline tool ddjvu, djvused): install by brew install djvulibre

  3. Python package sexpdata to parse bookmark files: install by pip install sexpdata

Procedures

step 1: convert the file text

First, use any tool to convert the DJVU file to a PDF (without bookmarks).

Suppose the files are called filename.djvu and filename.pdf.

step 2: extract DJVU outline

Next, output the DJVU outline data to a file, like this:

1
  djvused "filename.djvu" -e 'print-outline' > bmarks.out

This is a file listing the DJVU documents bookmarks in a serialized tree format. In fact it's just a SEXPR, and can be easily parsed. The format is as follows:

1
2
3
4
5
6
7
  file ::= (bookmarks
            <bookmark>*)
  bookmark ::= (name
                page
                <bookmark>*)
  name ::= "<character>*"
  page ::= "#<digit>+"

For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
  (bookmarks
    ("bmark1"
      "#1")
    ("bmark2"
      "#5"
      ("bmark2subbmark1"
        "#6")
      ("bmark2subbmark2"
        "#7"))
    ("bmark3"
      "#9"
      ...))

step 3: convert DJVU outline to PDF metadata format

Now, we need to convert these bookmarks into the format required by PDF metadata. This file has format:

1
2
3
4
5
6
  file ::= <entry>*
  entry ::= BookmarkBegin
            BookmarkTitle: <title>
            BookmarkLevel: <number>
            BookmarkPageNumber: <number>
  title ::= <character>*

So our example would become:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
  BookmarkBegin
  BookmarkTitle: bmark1
  BookmarkLevel: 1
  BookmarkPageNumber: 1
  BookmarkBegin
  BookmarkTitle: bmark2
  BookmarkLevel: 1
  BookmarkPageNumber: 5
  BookmarkBegin
  BookmarkTitle: bmark2subbmark1
  BookmarkLevel: 2
  BookmarkPageNumber: 6
  BookmarkBegin
  BookmarkTitle: bmark2subbmark2
  BookmarkLevel: 2
  BookmarkPageNumber: 7
  BookmarkBegin
  BookmarkTitle: bmark3
  BookmarkLevel: 1
  BookmarkPageNumber: 9

Basically, you just need to write a script to walk the SEXPR tree, keeping track of the level, and output the name, page number and level of each entry it comes to, in the correct format.

step 4: extract PDF metadata and splice in converted bookmarks

Once you've got the converted list, output the PDF metadata from your converted PDF file:

1
  pdftk "filename.pdf" dump_data > pdfmetadata.out

Now, open the file and find the line that begins: NumberOfPages:

insert the converted bookmarks after this line. Save the new file as pdfmetadata.in

step 5: create PDF with bookmarks

Now we can create a new PDF file incorporating this metadata:

1
  pdftk "filename.pdf" update_info "pdfmetadata.in" output out.pdf

The file out.pdf should be a copy of your PDF with the bookmarks imported from the DJVU file.

Python script

To use this script, create a script file (e.g., named djvu2pdftoc), and add executable permission by chmod +x djvu2pdftoc. Then you are allowed to use it as:

  • ./djvu2pdftoc IN.djvu OUT.pdf (with default quality 80), or

  • ./djvu2pdftoc --quality 100 IN.djvu OUT.pdf (lossless conversion)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
  #!/usr/bin/env python3
  # Convert DJVU to PDF with table of contents, if available.
  # Modified from https://github.com/kcroker/dpsprep
  # License: GNU GPL v3

  import sexpdata
  import argparse
  import os
  import pipes
  import subprocess
  import re

  # Recursively walks the sexpr tree and outputs a metadata format understandable by pdftk
  def walk_bmarks(bmarks, level):
      output = ''
      wroteTitle = False
      for j in bmarks:
          if isinstance(j, list):
              output = output + walk_bmarks(j, level + 1)
          elif isinstance(j, str):
              if not wroteTitle:
                  output = output + "BookmarkBegin\nBookmarkTitle: %s\nBookmarkLevel: %d\n" % (j, level)
                  wroteTitle = True
              else:
                  output = output + "BookmarkPageNumber: %s\n" % j.split('#')[1]
                  wroteTitle = False
          else:
              pass
      return output

  workpath = os.getcwd()

  # From Python docs, nice and slick command line arguments
  parser = argparse.ArgumentParser(description='Convert DJVU format to PDF format preserving OCRd text and metadata.  Very useful for Sony Digital Paper system')
  parser.add_argument('src', metavar='djvufile', type=str,
                      help='the source DJVU file')
  parser.add_argument('dest', metavar='pdffile', type=str,
                      help='the destination PDF file')
  parser.add_argument('-q, --quality', dest='quality', type=int, default=80,
                      help='specify JPEG lossy compression quality (50-150).  See man ddjvu for more information.')

  args = parser.parse_args()

  # Reescape the filenames because we will just be sending them to commands via system
  # and we don't otherwise work directly with the DJVU and PDF files.
  # Also, stash the temp pdf in the clean spot
  args.src = pipes.quote(args.src)
  finaldest = pipes.quote(args.dest)
  args.dest = workpath + '/dumpd.pdf'

  # Check for a file presently being processed
  if os.path.isfile(workpath + '/inprocess'):
      fname = open(workpath + '/inprocess', 'r').read()
      if not fname == args.src:
          print("ERROR: Attempting to process %s before %s is completed. Aborting." % (args.src, fname))
          exit(3)
      else:
          print("NOTE: Continuing to process %s..." % args.src)
  else:
      # Record the file we are about to process
      open(workpath + '/inprocess', 'w').write(args.src)

  # Make the PDF, compressing with JPG so they are not ridiculous in size
  # (cwd)
  if not os.path.isfile(workpath + '/dumpd.pdf'):
      retval = os.system("ddjvu -v --format=pdf %s %s/dumpd.pdf" % (args.src, workpath))
      if retval > 0:
          print("\nNOTE: There was a problem on ddjvu to convert to pdf.")
          exit(retval)
  else:
      print("PDF (without toc) already found, use it.")

  # Extract the bookmark data from the DJVU document
  retval = 0
  retval = retval | os.system("djvused %s -u -e 'print-outline' > %s/bmarks.out" % (args.src, workpath))
  print("Bookmarks extracted.")

  # Check for zero-length outline
  if os.stat("%s/bmarks.out" % workpath).st_size > 0:

      # Extract the metadata from the PDF document
      retval = retval | os.system("pdftk %s dump_data_utf8 > %s/pdfmetadata.out" % (args.dest, workpath))
      print("Original PDF metadata extracted.")

      # Parse the sexpr
      pdfbmarks = walk_bmarks(sexpdata.load(open(workpath + '/bmarks.out')), 0)

      # Integrate the parsed bookmarks into the PDF metadata
      p = re.compile('NumberOfPages: [0-9]+')
      metadata = open("%s/pdfmetadata.out" % workpath, 'r').read()

      for m in p.finditer(metadata):
          loc = int(m.end())

          newoutput = metadata[:loc] + "\n" + pdfbmarks[:-1] + metadata[loc:]

          # Update the PDF metadata
          open("%s/pdfmetadata.in" % workpath, 'w').write(newoutput)
          retval = retval | os.system("pdftk %s update_info_utf8 %s output %s" % (args.dest, workpath + '/pdfmetadata.in', finaldest))

  else:
      retval = retval | os.system("mv %s %s" % (args.dest, finaldest))
      print("No bookmarks were present!")

  # If retval is shit, don't delete temp files
  if retval == 0:
      os.system("rm %s/inprocess %s" % (workpath, args.dest))
      print("SUCCESS. Temporary files cleared.")
      exit(0)
  else:
      print("There were errors in the metadata step.  Check the errors.")
      exit(retval)