[OS X TeX] webarchive

Wed May 14 10:09:31 EDT 2008

Le 14 mai 08 à 02:15, George Gratzer a écrit :

> Is there a way to  convert a .webarchive to pdf?

Le 14 mai 08 à 05:49, Axel E. Retif a écrit :

> Open it in your browser, then choose ``Print...'' and in the ``PDF''  
> pull-down button (left bottom corner) choose ``Save as PDF...''.

Le 14 mai 08 à 12:38, Matthew Leingang a écrit :

> What's the structure of a .webarchive file?  If it's some kind of  
> zipped directory with HTML and images, I think that wkpdf could  
> work.  It's a command-line utility to convert HTML to PDF.

Regarding the .webarchive format, see <http://lists.apple.com/archives/Cocoa-dev/2005/Jul/msg02206.html 
 > which says it's a private Safari/WebKit format for storing web  
archives. Email signatures for Mail are also stored in this format,  
see ~/Library/Mail/Signatures.

Apparently there are a number of competing web archive formats <http://en.wikipedia.org/wiki/MHTML 
 >:

- MIME HTML aka MHTML or MHT introduced by Microsoft

- WAR introduced by Sun and recognized by KDE

- MAF introduced by Mozilla

- an ISO WARC format and a related ARC_IA format <http://www.digitalpreservation.gov/formats/fdd/webarch_fdd.shtml 
 >

- a WAFF format introduced by the now defunct Internet Explorer 5.2.3  
for the Mac (creator/type MSIE/WAFF, no extension)

- Camino saves complete web sites in the form of an HTML file and  
ancillary media (.js, .gif, .jpg, .png, ...) in a separate folder.

The natural way to process a .webarchive file seems accordingly to be  
to open it in Safari. Alas, printing to PDF from there saves the  
visual appearance of the page but not the navigational information it  
contains (ie hyperlinks are lost).

There are a free Web Archive Extractor <https://sourceforge.net/projects/webarchivext/ 
 > and a commercial version <http://robrohan.com/projects/WebArchiveExtractor/ 
 >, which transform a .webarchive file into separate files  
(.html, .css, .js and so forth).

In theory, after doing this you could then, if you own Adobe Acrobat  
Pro, open the .html file in it and export the whole page to PDF.

Alas, I just tried the free Web Archive Extractor and the result is  
disappointing: the format of the page is more-or-less preserved, but  
its navigational content is messed. For example, in Apple's home page www.apple.com 
, a link http://www.apple.com/imac/ becomes file:///imac/ after saving  
to .webarchive then operating the Extractor.

Maybe the commercial version works better, I've not tried.

Bruno Voisin