Home > AggregateContentsFromWeb

Aggregate Contents From the Web

From version 0.2.1, EPUB Parser can parse unpacked(unzipped) EPUB files on the web and aggregate contents in the books.

Let’s get contents of pretty cmmic Page Blanche from IDPF’s GitHub repository: https://github.com/IDPF/epub3-samples/tree/master/30/page-blanche

We can consider URI https://raw.githubusercontent.com/IDPF/epub3-samples/master/30/page-blanche/ as the root directory of the book because we can get EPUB Open Container Format’s container.xml file from https://raw.githubusercontent.com/IDPF/epub3-samples/master/30/page-blanche/META-INF/container.xml.

Note: Don’t forget slash at the end of URI

EPUB Parser can treat the URI as EPUB book file path and parse contents from it by using EPUB::OCF::PhysicalContainer::UnpackedURI:

require 'epub/parser'

uri = 'https://raw.githubusercontent.com/IDPF/epub3-samples/master/30/page-blanche/'
epub = EPUB::Parser.parse(uri, container_adapter: :UnpackedURI)

The trick is to set container adapter to :UnpackedURI. It makes it possible to parse EPUB book from the web. Now we can play with EPUB books as always!

As an example, I will show you a script to download all the files of specified EPUB book to local directory(source code is available in repository’s aggregate-contents-from-web).

require 'pathname'
require 'tmpdir'
require 'epub/parser'

EPUB_URI = URI.parse(ARGV.shift)
DOWNLOAD_DIR = Pathname.new(ARGV.shift || Dir.mktmpdir('epub-parser'))
$stderr.puts <
Started downloading EPUB contents...
  from: #{EPUB_URI}
  to:   #{DOWNLOAD_DIR}
EOI

# Make it possible to use URI as EPUB file path
EPUB::OCF::PhysicalContainer.adapter = :UnpackedURI

def main
  make_mimetype

  container_xml = 'META-INF/container.xml'
  download container_xml

  epub = EPUB::Parser.parse(EPUB_URI, container_adapter: :UnpackedURI)
  download epub.rootfile_path

  epub.resources.each do |resource|
    download resource.entry_name
  end
  puts DOWNLOAD_DIR
end

def make_mimetype
  $stderr.puts "Making mimetype file..."
  DOWNLOAD_DIR.join('mimetype').write 'application/epub+zip'
end

def download(path)
  path = path.to_s
  src = EPUB_URI + path
  dest = DOWNLOAD_DIR + path
  $stderr.puts "Downloading #{path} ..."
  dest.dirname.mkpath
  dest.write src.read
end

main

Execution:

$ ruby examples/aggregate-contents-from-web.rb https://raw.githubusercontent.com/IDPF/epub3-samples/master/30/page-blanche/
Started downloading EPUB contents...
  from: https://raw.githubusercontent.com/IDPF/epub3-samples/master/30/page-blanche/
  to: /tmp/epub-parser20150703-13148-ghdtfq
Making mimetype file...
Downloading META-INF/container.xml ...
Downloading EPUB/package.opf ...
Downloading EPUB/Style/style.css ...
Downloading EPUB/Navigation/nav.xhtml ...
Downloading EPUB/Navigation/toc.ncx ...
Downloading EPUB/Content/cover.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_000.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_001.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_002.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_003.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_004.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_005.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_006.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_007.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_008.xhtml ...
Downloading EPUB/Image/cover.jpg ...
Downloading EPUB/Image/PageBlanche_Page_001.jpg ...
Downloading EPUB/Image/PageBlanche_Page_002.jpg ...
Downloading EPUB/Image/PageBlanche_Page_003.jpg ...
Downloading EPUB/Image/PageBlanche_Page_004.jpg ...
Downloading EPUB/Image/PageBlanche_Page_005.jpg ...
Downloading EPUB/Image/PageBlanche_Page_006.jpg ...
Downloading EPUB/Image/PageBlanche_Page_007.jpg ...
Downloading EPUB/Image/PageBlanche_Page_008.jpg ...
/tmp/epub-parser20150703-13148-ghdtfq

The last line of the output is path to directory which contents are downloaded to. We can repackage it as an EPUB file. Let’s use epzip utility to do that easily:

$ epzip /tmp/epub-parser20150703-13148-ghdtfq ./page-blanche.epub

Command-line tools

Command-line tools epubinfo and epub-open may also handle with URI as EPUB books.