Chapter 11. Data conversion

Tools and tips for converting data formats on the Debian system are described.

Standard based tools are in very good shape but support for proprietary data formats are limited.

11.1. Text data conversion tools

Following packages for the text data conversion caught my eyes.

Table 11.1. List of text data conversion tools

package	popcon	size	keyword	description
`libc6`	http://qa.debian.org/popcon.php?package=libc6	9500	charset	text encoding converter between locales by iconv(1) (fundamental)
`recode`	http://qa.debian.org/popcon.php?package=recode	768	charset+eol	text encoding converter between locales (versatile, more aliases and features)
`konwert`	http://qa.debian.org/popcon.php?package=konwert	192	charset	text encoding converter between locales (fancy)
`nkf`	http://qa.debian.org/popcon.php?package=nkf	205	charset	character set translator for Japanese
`tcs`	http://qa.debian.org/popcon.php?package=tcs	544	charset	character set translator
`unaccent`	http://qa.debian.org/popcon.php?package=unaccent	76	charset	replace accented letters by their unaccented equivalent
`tofrodos`	http://qa.debian.org/popcon.php?package=tofrodos	67	eol	text format converter between DOS and Unix: fromdos(1) and todos(1)
`macutils`	http://qa.debian.org/popcon.php?package=macutils	320	eol	text format converter between Macintosh and Unix: frommac(1) and tomac(1)

11.1.1. Converting a text file with iconv

	Tip
	iconv(1) is provided as a part of the `libc6` package and it is always available on practically all systems to convert the encoding of characters.

You can convert encodings of a text file with iconv(1) by the following.

$ iconv -f encoding1 -t encoding2 input.txt >output.txt

Encoding values are case insensitive and ignore "-" and "_" for matching. Supported encodings can be checked by the "iconv -l" command.

Table 11.2. List of encoding values and their usage

encoding value	usage
ASCII.	American Standard Code for Information Interchange, 7 bit code w/o accented characters
UTF-8	current multilingual standard for all modern OSs
ISO-8859-1	old standard for western European languages, ASCII + accented characters
ISO-8859-2	old standard for eastern European languages, ASCII + accented characters
ISO-8859-15	old standard for western European languages, ISO-8859-1 with euro sign
CP850	code page 850, Microsoft DOS characters with graphics for western European languages, ISO-8859-1 variant
CP932	code page 932, Microsoft Windows style Shift-JIS variant for Japanese
CP936	code page 936, Microsoft Windows style GB2312, GBK or GB18030 variant for Simplified Chinese
CP949	code page 949, Microsoft Windows style EUC-KR or Unified Hangul Code variant for Korean
CP950	code page 950, Microsoft Windows style Big5 variant for Traditional Chinese
CP1251	code page 1251, Microsoft Windows style encoding for the Cyrillic alphabet
CP1252	code page 1252, Microsoft Windows style ISO-8859-15 variant for western European languages
KOI8-R	old Russian UNIX standard for the Cyrillic alphabet
ISO-2022-JP	standard encoding for Japanese email which uses only 7 bit codes
eucJP	old Japanese UNIX standard 8 bit code and completely different from Shift-JIS
Shift-JIS	JIS X 0208 Appendix 1 standard for Japanese (see CP932)

	Note
	Some encodings are only supported for the data conversion and are not used as locale values (Section 8.3.1, “Basics of encoding”).

For character sets which fit in single byte such as ASCII and ISO-8859 character sets, the character encoding means almost the same thing as the character set.

For character sets with many characters such as JIS X 0213 for Japanese or Universal Character Set (UCS, Unicode, ISO-10646-1) for practically all languages, there are many encoding schemes to fit them into the sequence of the byte data.

EUC and ISO/IEC 2022 (also known as JIS X 0202) for Japanese
UTF-8, UTF-16/UCS-2 and UTF-32/UCS-4 for Unicode

For these, there are clear differentiations between the character set and the character encoding.

The code page is used as the synonym to the character encoding tables for some vendor specific ones.

Note

Please note most encoding systems share the same code with ASCII for the 7 bit characters. But there are some exceptions. If you are converting old Japanese C programs and URLs data from the casually-called shift-JIS encoding format to UTF-8 format, use "CP932" as the encoding name instead of "shift-JIS" to get the expected results: 0x5C → "\" and 0x7E → "~" . Otherwise, these are converted to wrong characters.

	Tip
	recode(1) may be used too and offers more than the combined functionality of iconv(1), fromdos(1), todos(1), frommac(1), and tomac(1). For more, see "`info recode`".

11.1.2. Checking file to be UTF-8 with iconv

You can check if a text file is encoded in UTF-8 with iconv(1) by the following.

$ iconv -f utf8 -t utf8 input.txt >/dev/null || echo "non-UTF-8 found"

	Tip
	Use "`--verbose`" option in the above example to find the first non-UTF-8 character.

11.1.3. Converting file names with iconv

Here is an example script to convert encoding of file names from ones created under older OS to modern UTF-8 ones in a single directory.

#!/bin/sh
ENCDN=iso-8859-1
for x in *;
 do
 mv "$x" "$(echo "$x" | iconv -f $ENCDN -t utf-8)"
done

The "$ENCDN" variable should be set by the encoding value in Table 11.2, “List of encoding values and their usage”.

For more complicated case, please mount a filesystem (e.g. a partition on a disk drive) containing such file names with proper encoding as the mount(8) option (see Section 8.3.6, “Filename encoding”) and copy its entire contents to another filesystem mounted as UTF-8 with "cp -a" command.

11.1.4. EOL conversion

The text file format, specifically the end-of-line (EOL) code, is dependent on the platform.

Table 11.3. List of EOL styles for different platforms

platform	EOL code	control	decimal	hexadecimal
Debian (unix)	LF	`^J`	10	0A
MSDOS and Windows	CR-LF	`^M^J`	13 10	0D 0A
Apple's Macintosh	CR	`^M`	13	0D

The EOL format conversion programs, fromdos(1), todos(1), frommac(1), and tomac(1), are quite handy. recode(1) is also useful.

	Note
	Some data on the Debian system, such as the wiki page data for the `python-moinmoin` package, use MSDOS style CR-LF as the EOL code. So the above rule is just a general rule.

	Note
	Most editors (eg. `vim`, `emacs`, `gedit`, …) can handle files in MSDOS style EOL transparently.

	Tip
	The use of "`sed -e '/\r$/!s/$/\r/'`" instead of todos(1) is better when you want to unify the EOL style to the MSDOS style from the mixed MSDOS and Unix style. (e.g., after merging 2 MSDOS style files with diff3(1).) This is because `todos` adds CR to all lines.

11.1.5. TAB conversion

There are few popular specialized programs to convert the tab codes.

Table 11.4. List of TAB conversion commands from bsdmainutils and coreutils packages

function	`bsdmainutils`	`coreutils`
expand tab to spaces	"`col -x`"	`expand`
unexpand tab from spaces	"`col -h`"	`unexpand`

indent(1) from the indent package completely reformats whitespaces in the C program.

Editor programs such as vim and emacs can be used for TAB conversion, too. For example with vim, you can expand TAB with ":set expandtab" and ":%retab" command sequence. You can revert this with ":set noexpandtab" and ":%retab!" command sequence.

11.1.6. Editors with auto-conversion

Intelligent modern editors such as the vim program are quite smart and copes well with any encoding systems and any file formats. You should use these editors under the UTF-8 locale in the UTF-8 capable console for the best compatibility.

An old western European Unix text file, "u-file.txt", stored in the latin1 (iso-8859-1) encoding can be edited simply with vim by the following.

$ vim u-file.txt

This is possible since the auto detection mechanism of the file encoding in vim assumes the UTF-8 encoding first and, if it fails, assumes it to be latin1.

An old Polish Unix text file, "pu-file.txt", stored in the latin2 (iso-8859-2) encoding can be edited with vim by the following.

$ vim '+e ++enc=latin2 pu-file.txt'

An old Japanese unix text file, "ju-file.txt", stored in the eucJP encoding can be edited with vim by the following.

$ vim '+e ++enc=eucJP ju-file.txt'

An old Japanese MS-Windows text file, "jw-file.txt", stored in the so called shift-JIS encoding (more precisely: CP932) can be edited with vim by the following.

$ vim '+e ++enc=CP932 ++ff=dos jw-file.txt'

When a file is opened with "++enc" and "++ff" options, ":w" in the Vim command line stores it in the original format and overwrite the original file. You can also specify the saving format and the file name in the Vim command line, e.g., ":w ++enc=utf8 new.txt".

Please refer to the mbyte.txt "multi-byte text support" in vim on-line help and Table 11.2, “List of encoding values and their usage” for locale values used with "++enc".

The emacs family of programs can perform the equivalent functions.

11.1.7. Plain text extraction

The following reads a web page into a text file. This is very useful when copying configurations off the Web or applying basic Unix text tools such as grep(1) on the web page.

$ w3m -dump http://www.remote-site.com/help-info.html >textfile

Similarly, you can extract plain text data from other formats using the following.

Table 11.5. List of tools to extract plain text data

package	popcon	size	keyword	function
`w3m`	http://qa.debian.org/popcon.php?package=w3m	1825	html→text	HTML to text converter with the "`w3m -dump`" command
`html2text`	http://qa.debian.org/popcon.php?package=html2text	248	html→text	advanced HTML to text converter (ISO 8859-1)
`lynx`	http://qa.debian.org/popcon.php?package=lynx	242	html→text	HTML to text converter with the "`lynx -dump`" command
`elinks`	http://qa.debian.org/popcon.php?package=elinks	1364	html→text	HTML to text converter with the "`elinks -dump`" command
`links`	http://qa.debian.org/popcon.php?package=links	1275	html→text	HTML to text converter with the "`links -dump`" command
`links2`	http://qa.debian.org/popcon.php?package=links2	3092	html→text	HTML to text converter with the "`links2 -dump`" command
`antiword`	http://qa.debian.org/popcon.php?package=antiword	560	MSWord→text,ps	convert MSWord files to plain text or ps
`catdoc`	http://qa.debian.org/popcon.php?package=catdoc	2668	MSWord→text,TeX	convert MSWord files to plain text or TeX
`pstotext`	http://qa.debian.org/popcon.php?package=pstotext	123	ps/pdf→text	extract text from PostScript and PDF files
`unhtml`	http://qa.debian.org/popcon.php?package=unhtml	76	html→text	remove the markup tags from an HTML file
`odt2txt`	http://qa.debian.org/popcon.php?package=odt2txt	73	odt→text	converter from OpenDocument Text to text

11.1.8. Highlighting and formatting plain text data

You can highlight and format plain text data by the following.

Table 11.6. List of tools to highlight plain text data

package	popcon	size	keyword	description
`vim-runtime`	http://qa.debian.org/popcon.php?package=vim-runtime	22298	highlight	Vim MACRO to convert source code to HTML with "`:source $VIMRUNTIME/syntax/html.vim`"
`cxref`	http://qa.debian.org/popcon.php?package=cxref	1115	c→html	converter for the C program to latex and HTML (C language)
`src2tex`	http://qa.debian.org/popcon.php?package=src2tex	1968	highlight	convert many source codes to TeX (C language)
`source-highlight`	http://qa.debian.org/popcon.php?package=source-highlight	1939	highlight	convert many source codes to HTML, XHTML, LaTeX, Texinfo, ANSI color escape sequences and DocBook files with highlight (C++)
`highlight`	http://qa.debian.org/popcon.php?package=highlight	726	highlight	convert many source codes to HTML, XHTML, RTF, LaTeX, TeX or XSL-FO files with highlight (C++)
`grc`	http://qa.debian.org/popcon.php?package=grc	232	text→color	generic colouriser for everything (Python)
`txt2html`	http://qa.debian.org/popcon.php?package=txt2html	296	text→html	text to HTML converter (Perl)
`markdown`	http://qa.debian.org/popcon.php?package=markdown	96	text→html	markdown text document formatter to (X)HTML (Perl)
`asciidoc`	http://qa.debian.org/popcon.php?package=asciidoc	3165	text→any	AsciiDoc text document formatter to XML/HTML (Python)
`python-docutils`	http://qa.debian.org/popcon.php?package=python-docutils	1548	text→any	ReStructured Text document formatter to XML (Python)
`txt2tags`	http://qa.debian.org/popcon.php?package=txt2tags	1152	text→any	document conversion from text to HTML, SGML, LaTeX, man page, MoinMoin, Magic Point and PageMaker (Python)
`udo`	http://qa.debian.org/popcon.php?package=udo	556	text→any	universal document - text processing utility (C language)
`stx2any`	http://qa.debian.org/popcon.php?package=stx2any	484	text→any	document converter from structured plain text to other formats (m4)
`rest2web`	http://qa.debian.org/popcon.php?package=rest2web	576	text→html	document converter from ReStructured Text to html (Python)
`aft`	http://qa.debian.org/popcon.php?package=aft	259	text→any	"free form" document preparation system (Perl)
`yodl`	http://qa.debian.org/popcon.php?package=yodl	409	text→any	pre-document language and tools to process it (C language)
`sdf`	http://qa.debian.org/popcon.php?package=sdf	1414	text→any	simple document parser (Perl)
`sisu`	http://qa.debian.org/popcon.php?package=sisu	9149	text→any	document structuring, publishing and search framework (Ruby)

11.2. XML data

The Extensible Markup Language (XML) is a markup language for documents containing structured information.

See introductory information at XML.COM.

11.2.1. Basic hints for XML

XML text looks somewhat like HTML. It enables us to manage multiple formats of output for a document. One easy XML system is the docbook-xsl package, which is used here.

Each XML file starts with standard XML declaration as the following.

<?xml version="1.0" encoding="UTF-8"?>

The basic syntax for one XML element is marked up as the following.

<name attribute="value">content</name>

XML element with empty content is marked up in the following short form.

<name attribute="value"/>

The "attribute="value"" in the above examples are optional.

The comment section in XML is marked up as the following.

<!-- comment -->

Other than adding markups, XML requires minor conversion to the content using predefined entities for following characters.

Table 11.7. List of predefined entities for XML

predefined entity	character to be converted from
`"`	`"` : quote
`'`	`'` : apostrophe
`<`	`<` : less-than
`>`	`>` : greater-than
`&`	`&` : ampersand

	Caution
	"`<`" or "`&`" can not be used in attributes or elements.

	Note
	When SGML style user defined entities, e.g. "`&some-tag:`", are used, the first definition wins over others. The entity definition is expressed in "`<!ENTITY some-tag "entity value">`".

	Note
	As long as the XML markup are done consistently with certain set of the tag name (either some data as content or attribute value), conversion to another XML is trivial task using Extensible Stylesheet Language Transformations (XSLT).

11.2.2. XML processing

There are many tools available to process XML files such as the Extensible Stylesheet Language (XSL).

Basically, once you create well formed XML file, you can convert it to any format using Extensible Stylesheet Language Transformations (XSLT).

The Extensible Stylesheet Language for Formatting Object (XSL-FO) is supposed to be solution for formatting. The fop package is in the Debian contrib (not main) archive still. So the LaTeX code is usually generated from XML using XSLT and the LaTeX system is used to create printable file such as DVI, PostScript, and PDF.

Table 11.8. List of XML tools

package	popcon	size	keyword	description
`docbook-xml`	http://qa.debian.org/popcon.php?package=docbook-xml	2488	xml	XML document type definition (DTD) for DocBook
`xsltproc`	http://qa.debian.org/popcon.php?package=xsltproc	165	xslt	XSLT command line processor (XML→ XML, HTML, plain text, etc.)
`docbook-xsl`	http://qa.debian.org/popcon.php?package=docbook-xsl	11589	xml/xslt	XSL stylesheets for processing DocBook XML to various output formats with XSLT
`xmlto`	http://qa.debian.org/popcon.php?package=xmlto	134	xml/xslt	XML-to-any converter with XSLT
`dblatex`	http://qa.debian.org/popcon.php?package=dblatex	6799	xml/xslt	convert Docbook files to DVI, PostScript, PDF documents with XSLT
`fop`	http://qa.debian.org/popcon.php?package=fop	90	xml/xsl-fo	convert Docbook XML files to PDF

Since XML is subset of Standard Generalized Markup Language (SGML), it can be processed by the extensive tools available for SGML, such as Document Style Semantics and Specification Language (DSSSL).

Table 11.9. List of DSSL tools

package	popcon	size	keyword	description
`openjade`	http://qa.debian.org/popcon.php?package=openjade	1063	dsssl	ISO/IEC 10179:1996 standard DSSSL processor (latest)
`openjade1.3`	http://qa.debian.org/popcon.php?package=openjade1.3	2226	dsssl	ISO/IEC 10179:1996 standard DSSSL processor (1.3.x series)
`jade`	http://qa.debian.org/popcon.php?package=jade	872	dsssl	James Clark's original DSSSL processor (1.2.x series)
`docbook-dsssl`	http://qa.debian.org/popcon.php?package=docbook-dsssl	3100	xml/dsssl	DSSSL stylesheets for processing DocBook XML to various output formats with DSSSL
`docbook-utils`	http://qa.debian.org/popcon.php?package=docbook-utils	220	xml/dsssl	utilities for DocBook files including conversion to other formats (HTML, RTF, PS, man, PDF) with `docbook2*` commands with DSSSL
`sgml2x`	http://qa.debian.org/popcon.php?package=sgml2x	216	SGML/dsssl	converter from SGML and XML using DSSSL stylesheets

	Tip
	GNOME's `yelp` is sometimes handy to read DocBook XML files directly since it renders decently on X.

11.2.3. The XML data extraction

You can extract HTML or XML data from other formats using followings.

Table 11.10. List of XML data extraction tools

package	popcon	size	keyword	description
`wv`	http://qa.debian.org/popcon.php?package=wv	351	MSWord→any	document converter from Microsoft Word to HTML, LaTeX, etc.
`texi2html`	http://qa.debian.org/popcon.php?package=texi2html	2076	texi→html	converter from Texinfo to HTML
`man2html`	http://qa.debian.org/popcon.php?package=man2html	180	manpage→html	converter from manpage to HTML (CGI support)
`tex4ht`	http://qa.debian.org/popcon.php?package=tex4ht	515	tex↔html	converter between (La)TeX and HTML
`xlhtml`	http://qa.debian.org/popcon.php?package=xlhtml	184	MSExcel→html	converter from MSExcel .xls to HTML
`ppthtml`	http://qa.debian.org/popcon.php?package=ppthtml	120	MSPowerPoint→html	converter from MSPowerPoint to HTML
`unrtf`	http://qa.debian.org/popcon.php?package=unrtf	224	rtf→html	document converter from RTF to HTML, etc
`info2www`	http://qa.debian.org/popcon.php?package=info2www	156	info→html	converter from GNU info to HTML (CGI support)
`ooo2dbk`	http://qa.debian.org/popcon.php?package=ooo2dbk	941	sxw→xml	converter from OpenOffice.org SXW documents to DocBook XML
`wp2x`	http://qa.debian.org/popcon.php?package=wp2x	156	WordPerfect→any	WordPerfect 5.0 and 5.1 files to TeX, LaTeX, troff, GML and HTML
`doclifter`	http://qa.debian.org/popcon.php?package=doclifter	460	troff→xml	converter from troff to DocBook XML

For non-XML HTML files, you can convert them to XHTML which is an instance of well formed XML. XHTML can be processed by XML tools.

Table 11.11. List of XML pretty print tools

package	popcon	size	keyword	description
`libxml2-utils`	http://qa.debian.org/popcon.php?package=libxml2-utils	139	xml↔html↔xhtml	command line XML tool with xmllint(1) (syntax check, reformat, lint, …)
`tidy`	http://qa.debian.org/popcon.php?package=tidy	82	xml↔html↔xhtml	HTML syntax checker and reformatter

Once proper XML is generated, you can use XSLT technology to extract data based on the mark-up context etc.

11.3. Printable data

Printable data is expressed in the PostScript format on the Debian system. Common Unix Printing System (CUPS) uses Ghostscript as its rasterizer backend program for non-PostScript printers.

11.3.1. Ghostscript

The core of printable data manipulation is the Ghostscript PostScript (PS) interpreter which generates raster image.

The latest upstream Ghostscript from Artifex was re-licensed from AFPL to GPL and merged all the latest ESP version changes such as CUPS related ones at 8.60 release as unified release.

Table 11.12. List of Ghostscript PostScript interpreters

package	popcon	size	description
`ghostscript`	http://qa.debian.org/popcon.php?package=ghostscript	198	The GPL Ghostscript PostScript/PDF interpreter
`ghostscript-x`	http://qa.debian.org/popcon.php?package=ghostscript-x	193	GPL Ghostscript PostScript/PDF interpreter - X display support
`gs-cjk-resource`	http://qa.debian.org/popcon.php?package=gs-cjk-resource	4528	resource files for gs-cjk, Ghostscript CJK-TrueType extension
`cmap-adobe-cns1`	http://qa.debian.org/popcon.php?package=cmap-adobe-cns1	1572	CMaps for Adobe-CNS1 (for traditional Chinese support)
`cmap-adobe-gb1`	http://qa.debian.org/popcon.php?package=cmap-adobe-gb1	1552	CMaps for Adobe-GB1 (for simplified Chinese support)
`cmap-adobe-japan1`	http://qa.debian.org/popcon.php?package=cmap-adobe-japan1	2428	CMaps for Adobe-Japan1 (for Japanese standard support)
`cmap-adobe-japan2`	http://qa.debian.org/popcon.php?package=cmap-adobe-japan2	416	CMaps for Adobe-Japan2 (for Japanese extra support)
`cmap-adobe-korea1`	http://qa.debian.org/popcon.php?package=cmap-adobe-korea1	872	CMaps for Adobe-Korea1 (for Korean support)
`libpoppler13`	http://qa.debian.org/popcon.php?package=libpoppler13	2377	PDF rendering library based on xpdf PDF viewer
`libpoppler-glib6`	http://qa.debian.org/popcon.php?package=libpoppler-glib6	577	PDF rendering library (GLib-based shared library)
`poppler-data`	http://qa.debian.org/popcon.php?package=poppler-data	12240	CMaps for PDF rendering library (for CJK support: Adobe-*)

	Tip
	"`gs -h`" can display the configuration of Ghostscript.

11.3.2. Merge two PS or PDF files

You can merge two PostScript (PS) or Portable Document Format (PDF) files using gs(1) of Ghostscript.

$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pswrite -sOutputFile=bla.ps -f foo1.ps foo2.ps
$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=bla.pdf -f foo1.pdf foo2.pdf

	Note
	The PDF, which is widely used cross-platform printable data format, is essentially the compressed PS format with few additional features and extensions.

	Tip
	For command line, psmerge(1) and other commands from the `psutils` package are useful for manipulating PostScript documents. Commands in the `pdfjam` package work similarly for manipulating PDF documents. pdftk(1) from the `pdftk` package is useful for manipulating PDF documents, too.

11.3.3. Printable data utilities

The following packages for the printable data utilities caught my eyes.

Table 11.13. List of printable data utilities

package	popcon	size	keyword	description
`poppler-utils`	http://qa.debian.org/popcon.php?package=poppler-utils	542	pdf→ps,text,…	PDF utilities: `pdftops`, `pdfinfo`, `pdfimages`, `pdftotext`, `pdffonts`
`psutils`	http://qa.debian.org/popcon.php?package=psutils	243	ps→ps	PostScript document conversion tools
`poster`	http://qa.debian.org/popcon.php?package=poster	80	ps→ps	create large posters out of PostScript pages
`enscript`	http://qa.debian.org/popcon.php?package=enscript	2147	text→ps, html, rtf	convert ASCII text to PostScript, HTML, RTF or Pretty-Print
`a2ps`	http://qa.debian.org/popcon.php?package=a2ps	4292	text→ps	'Anything to PostScript' converter and pretty-printer
`pdftk`	http://qa.debian.org/popcon.php?package=pdftk	3039	pdf→pdf	PDF document conversion tool: `pdftk`
`mpage`	http://qa.debian.org/popcon.php?package=mpage	224	text,ps→ps	print multiple pages per sheet
`html2ps`	http://qa.debian.org/popcon.php?package=html2ps	320	html→ps	converter from HTML to PostScript
`pdfjam`	http://qa.debian.org/popcon.php?package=pdfjam	122	pdf→pdf	PDF document conversion tools: `pdf90`, `pdfjoin`, and `pdfnup`
`gnuhtml2latex`	http://qa.debian.org/popcon.php?package=gnuhtml2latex	53	html→latex	converter from html to latex
`latex2rtf`	http://qa.debian.org/popcon.php?package=latex2rtf	508	latex→rtf	convert documents from LaTeX to RTF which can be read by MS Word
`ps2eps`	http://qa.debian.org/popcon.php?package=ps2eps	136	ps→eps	converter from PostScript to EPS (Encapsulated PostScript)
`e2ps`	http://qa.debian.org/popcon.php?package=e2ps	188	text→ps	Text to PostScript converter with Japanese encoding support
`impose+`	http://qa.debian.org/popcon.php?package=impose+	180	ps→ps	PostScript utilities
`trueprint`	http://qa.debian.org/popcon.php?package=trueprint	188	text→ps	pretty print many source codes (C, C++, Java, Pascal, Perl, Pike, Sh, and Verilog) to PostScript. (C language)
`pdf2svg`	http://qa.debian.org/popcon.php?package=pdf2svg	60	ps→svg	converter from PDF to Scalable vector graphics format
`pdftoipe`	http://qa.debian.org/popcon.php?package=pdftoipe	91	ps→ipe	converter from PDF to IPE's XML format

11.3.4. Printing with CUPS

Both lp(1) and lpr(1) commands offered by Common Unix Printing System (CUPS) provides options for customized printing the printable data.

You can print 3 copies of a file collated using one of the following commands.

$ lp -n 3 -o Collate=True filename

$ lpr -#3 -o Collate=True filename

You can further customize printer operation by using printer option such as "-o number-up=2", "-o page-set=even", "-o page-set=odd", "-o scaling=200", "-o natural-scaling=200", etc., documented at Command-Line Printing and Options.

11.4. Type setting

The Unix troff program originally developed by AT&T can be used for simple typesetting. It is usually used to create manpages.

TeX created by Donald Knuth is very powerful type setting tool and is the de facto standard. LaTeX originally written by Leslie Lamport enables a high-level access to the power of TeX.

Table 11.14. List of type setting tools

package	popcon	size	keyword	description
`texlive`	http://qa.debian.org/popcon.php?package=texlive	103	(La)TeX	TeX system for typesetting, previewing and printing
`groff`	http://qa.debian.org/popcon.php?package=groff	8095	troff	GNU troff text-formatting system

11.4.1. roff typesetting

Traditionally, roff is the main Unix text processing system. See roff(7), groff(7), groff(1), grotty(1), troff(1), groff_mdoc(7), groff_man(7), groff_ms(7), groff_me(7), groff_mm(7), and "info groff".

You can read or print a good tutorial and reference on "-me" macro in "/usr/share/doc/groff/" by installing the groff package.

	Tip
	"`groff -Tascii -me -`" produces plain text output with ANSI escape code. If you wish to get manpage like output with many "^H" and "_", use "`GROFF_NO_SGR=1 groff -Tascii -me -`" instead.

	Tip
	To remove "^H" and "_" from a text file generated by `groff`, filter it by "`col -b -x`".

11.4.2. TeX/LaTeX

The TeX Live software distribution offers a complete TeX system. The texlive metapackage provides a decent selection of the TeX Live packages which should suffice for the most common tasks.

There are many references available for TeX and LaTeX.

The teTeX HOWTO: The Linux-teTeX Local Guide
tex(1)
latex(1)
"The TeXbook", by Donald E. Knuth, (Addison-Wesley)
"LaTeX - A Document Preparation System", by Leslie Lamport, (Addison-Wesley)
"The LaTeX Companion", by Goossens, Mittelbach, Samarin, (Addison-Wesley)

This is the most powerful typesetting environment. Many SGML processors use this as their back end text processor. Lyx provided by the lyx package and GNU TeXmacs provided by the texmacs package offer nice WYSIWYG editing environment for LaTeX while many use Emacs and Vim as the choice for the source editor.

There are many online resources available.

The TEX Live Guide - TEX Live 2007 ("/usr/share/doc/texlive-doc-base/english/texlive-en/live.html") (texlive-doc-base package)
A Simple Guide to Latex/Lyx
Word Processing Using LaTeX
Local User Guide to teTeX/LaTeX

When documents become bigger, sometimes TeX may cause errors. You must increase pool size in "/etc/texmf/texmf.cnf" (or more appropriately edit "/etc/texmf/texmf.d/95NonPath" and run update-texmf(8)) to fix this.

	Note
	The TeX source of "The TeXbook" is available at http://tug.ctan.org/tex-archive/systems/knuth/dist/tex/texbook.tex.

This file contains most of the required macros. I heard that you can process this document with tex(1) after commenting lines 7 to 10 and adding "\input manmac \proofmodefalse". It's strongly recommended to buy this book (and all other books from Donald E. Knuth) instead of using the online version but the source is a great example of TeX input!

11.4.3. Pretty print a manual page

You can print a manual page in PostScript nicely by one of the following commands.

$ man -Tps some_manpage | lpr

$ man -Tps some_manpage | mpage -2 | lpr

The second example prints 2 pages on one sheet.

11.4.4. Creating a manual page

Although writing a manual page (manpage) in the plain troff format is possible, there are few helper packages to create it.

Table 11.15. List of packages to help creating the manpage

package	popcon	size	keyword	description
`docbook-to-man`	http://qa.debian.org/popcon.php?package=docbook-to-man	213	SGML→manpage	converter from DocBook SGML into roff man macros
`help2man`	http://qa.debian.org/popcon.php?package=help2man	485	text→manpage	automatic manpage generator from --help
`info2man`	http://qa.debian.org/popcon.php?package=info2man	161	info→manpage	converter from GNU info to POD or man pages
`txt2man`	http://qa.debian.org/popcon.php?package=txt2man	88	text→manpage	convert flat ASCII text to man page format

11.5. The mail data conversion

The following packages for the mail data conversion caught my eyes.

Table 11.16. List of packages to help mail data conversion

package	popcon	size	keyword	description
`sharutils`	http://qa.debian.org/popcon.php?package=sharutils	1408	mail	shar(1), unshar(1), uuencode(1), uudecode(1)
`mpack`	http://qa.debian.org/popcon.php?package=mpack	109	MIME	encoder and decoder MIME messages: mpack(1) and munpack(1)
`tnef`	http://qa.debian.org/popcon.php?package=tnef	132	ms-tnef	unpacking MIME attachments of type "application/ms-tnef" which is a Microsoft only format
`uudeview`	http://qa.debian.org/popcon.php?package=uudeview	117	mail	encoder and decoder for the following formats: uuencode, xxencode, BASE64, quoted printable, and BinHex
`readpst`	http://qa.debian.org/popcon.php?package=readpst	228	PST	convert Microsoft Outlook PST files to mbox format

	Tip
	The Internet Message Access Protocol version 4 (IMAP4) server (see Section 6.7, “POP3/IMAP4 server”) may be used to move mails out from proprietary mail systems if the mail client software can be configured to use IMAP4 server too.

11.5.1. Mail data basics

Mail (SMTP) data should be limited to 7 bit. So binary data and 8 bit text data are encoded into 7 bit format with the Multipurpose Internet Mail Extensions (MIME) and the selection of the charset (see Section 8.3.1, “Basics of encoding”).

The standard mail storage format is mbox formatted according to RFC2822 (updated RFC822). See mbox(5) (provided by the mutt package).

For European languages, "Content-Transfer-Encoding: quoted-printable" with the ISO-8859-1 charset is usually used for mail since there are not much 8 bit characters. If European text is encoded in UTF-8, "Content-Transfer-Encoding: quoted-printable" is likely to be used since it is mostly 7 bit data.

For Japanese, traditionally "Content-Type: text/plain; charset=ISO-2022-JP" is usually used for mail to keep text in 7 bits. But older Microsoft systems may send mail data in Shift-JIS without proper declaration. If Japanese text is encoded in UTF-8, Base64 is likely to be used since it contains many 8 bit data. The situation of other Asian languages is similar.

	Note
	If your non-Unix mail data is accessible by a non-Debian client software which can talk to the IMAP4 server, you may be able to move them out by running your own IMAP4 server (see Section 6.7, “POP3/IMAP4 server”).

	Note
	If you use other mail storage formats, moving them to mbox format is the good first step. The versatile client program such as mutt(1) may be handy for this.

You can split mailbox contents to each message using procmail(1) and formail(1).

Each mail message can be unpacked using munpack(1) from the mpack package (or other specialized tools) to obtain the MIME encoded contents.

11.6. Graphic data tools

The following packages for the graphic data conversion, editing, and organization tools caught my eyes.

Table 11.17. List of graphic data tools

package	popcon	size	keyword	description
`gimp`	http://qa.debian.org/popcon.php?package=gimp	15168	image(bitmap)	GNU Image Manipulation Program
`imagemagick`	http://qa.debian.org/popcon.php?package=imagemagick	207	image(bitmap)	image manipulation programs
`graphicsmagick`	http://qa.debian.org/popcon.php?package=graphicsmagick	4335	image(bitmap)	image manipulation programs (folk of `imagemagick`)
`xsane`	http://qa.debian.org/popcon.php?package=xsane	702	image(bitmap)	GTK+-based X11 frontend for SANE (Scanner Access Now Easy)
`netpbm`	http://qa.debian.org/popcon.php?package=netpbm	3464	image(bitmap)	graphics conversion tools
`icoutils`	http://qa.debian.org/popcon.php?package=icoutils	160	png↔ico(bitmap)	convert MS Windows icons and cursors to and from PNG formats (favicon.ico)
`scribus`	http://qa.debian.org/popcon.php?package=scribus	54492	ps/pdf/SVG/…	Scribus DTP editor
`openoffice.org-draw`	http://qa.debian.org/popcon.php?package=openoffice.org-draw	164	image(vector)	OpenOffice.org office suite - drawing
`inkscape`	http://qa.debian.org/popcon.php?package=inkscape	80425	image(vector)	SVG (Scalable Vector Graphics) editor
`dia-gnome`	http://qa.debian.org/popcon.php?package=dia-gnome	617	image(vector)	diagram editor (GNOME)
`dia`	http://qa.debian.org/popcon.php?package=dia	617	image(vector)	diagram editor (Gtk)
`xfig`	http://qa.debian.org/popcon.php?package=xfig	1597	image(vector)	facility for Interactive Generation of figures under X11
`pstoedit`	http://qa.debian.org/popcon.php?package=pstoedit	683	ps/pdf→image(vector)	PostScript and PDF files to editable vector graphics converter (SVG)
`libwmf-bin`	http://qa.debian.org/popcon.php?package=libwmf-bin	118	Windows/image(vector)	Windows metafile (vector graphic data) conversion tools
`fig2sxd`	http://qa.debian.org/popcon.php?package=fig2sxd	200	fig→sxd(vector)	convert XFig files to OpenOffice.org Draw format
`unpaper`	http://qa.debian.org/popcon.php?package=unpaper	736	image→image	post-processing tool for scanned pages for OCR
`tesseract-ocr`	http://qa.debian.org/popcon.php?package=tesseract-ocr	435	image→text	free OCR software based on the HP's commercial OCR engine
`tesseract-ocr-eng`	http://qa.debian.org/popcon.php?package=tesseract-ocr-eng	58870	image→text	OCR engine data: tesseract-ocr language files for English text
`gocr`	http://qa.debian.org/popcon.php?package=gocr	473	image→text	free OCR software
`ocrad`	http://qa.debian.org/popcon.php?package=ocrad	255	image→text	free OCR software
`gtkam`	http://qa.debian.org/popcon.php?package=gtkam	1255	image(Exif)	manipulate digital camera photo files (GNOME) - GUI
`gphoto2`	http://qa.debian.org/popcon.php?package=gphoto2	1036	image(Exif)	manipulate digital camera photo files (GNOME) - command line
`kamera`	http://qa.debian.org/popcon.php?package=kamera	245	image(Exif)	manipulate digital camera photo files (KDE)
`jhead`	http://qa.debian.org/popcon.php?package=jhead	126	image(Exif)	manipulate the non-image part of Exif compliant JPEG (digital camera photo) files
`exif`	http://qa.debian.org/popcon.php?package=exif	212	image(Exif)	command-line utility to show EXIF information in JPEG files
`exiftags`	http://qa.debian.org/popcon.php?package=exiftags	198	image(Exif)	utility to read Exif tags from a digital camera JPEG file
`exiftran`	http://qa.debian.org/popcon.php?package=exiftran	91	image(Exif)	transform digital camera jpeg images
`exifprobe`	http://qa.debian.org/popcon.php?package=exifprobe	484	image(Exif)	read metadata from digital pictures
`dcraw`	http://qa.debian.org/popcon.php?package=dcraw	424	image(Raw)→ppm	decode raw digital camera images
`findimagedupes`	http://qa.debian.org/popcon.php?package=findimagedupes	123	image→fingerprint	find visually similar or duplicate images
`ale`	http://qa.debian.org/popcon.php?package=ale	757	image→image	merge images to increase fidelity or create mosaics
`imageindex`	http://qa.debian.org/popcon.php?package=imageindex	171	image(Exif)→html	generate static HTML galleries from images
`f-spot`	http://qa.debian.org/popcon.php?package=f-spot	8219	image(Exif)	personal photo management application (GNOME)
`bins`	http://qa.debian.org/popcon.php?package=bins	2008	image(Exif)→html	generate static HTML photo albums using XML and EXIF tags
`gallery2`	http://qa.debian.org/popcon.php?package=gallery2	46635	image(Exif)→html	generate browsable HTML photo albums with thumbnails
`outguess`	http://qa.debian.org/popcon.php?package=outguess	252	jpeg,png	universal Steganographic tool
`qcad`	http://qa.debian.org/popcon.php?package=qcad	31	DXF	CAD data editor (KDE)
`blender`	http://qa.debian.org/popcon.php?package=blender	57306	blend, TIFF, VRML, …	3D content editor for animation etc
`mm3d`	http://qa.debian.org/popcon.php?package=mm3d	5171	ms3d, obj, dxf, …	OpenGL based 3D model editor
`open-font-design-toolkit`	http://qa.debian.org/popcon.php?package=open-font-design-toolkit	27	ttf, ps, …	metapackage for open font design
`fontforge`	http://qa.debian.org/popcon.php?package=fontforge	6696	ttf, ps, …	font editor for PS, TrueType and OpenType fonts
`xgridfit`	http://qa.debian.org/popcon.php?package=xgridfit	1060	ttf	program for gridfitting and hinting TrueType fonts

	Tip
	Search more image tools using regex "`~Gworks-with::image`" in aptitude(8) (see Section 2.2.6, “Search method options with aptitude”).

Although GUI programs such as gimp(1) are very powerful, command line tools such as imagemagick(1) are quite useful for automating image manipulation with the script.

The de facto image file format of the digital camera is the Exchangeable Image File Format (EXIF) which is the JPEG image file format with additional metadata tags. It can hold information such as date, time, and camera settings.

The Lempel-Ziv-Welch (LZW) lossless data compression patent has been expired. Graphics Interchange Format (GIF) utilities which use the LZW compression method are now freely available on the Debian system.

	Tip
	Any digital camera or scanner with removable recording media works with Linux through USB storage readers since it follows the Design rule for Camera Filesystem and uses FAT filesystem. See Section 10.1.10, “Removable storage device”.

11.7. Miscellaneous data conversion

There are many other programs for converting data. Following packages caught my eyes using regex "~Guse::converting" in aptitude(8) (see Section 2.2.6, “Search method options with aptitude”).

Table 11.18. List of miscellaneous data conversion tools

package	popcon	size	keyword	description
`alien`	http://qa.debian.org/popcon.php?package=alien	209	rpm/tgz→deb	converter for the foreign package into the Debian package
`freepwing`	http://qa.debian.org/popcon.php?package=freepwing	568	EB→EPWING	converter from "Electric Book" (popular in Japan) to a single JIS X 4081 format (a subset of the EPWING V1)

You can also extract data from RPM format with the following.

$ rpm2cpio file.src.rpm | cpio --extract