Character sets

Sub-headings within this document:


CHARACTER_SET

# CHARACTER_SET defines the display character set, i.e., assumed to be
# installed on the user's terminal.  It determines which characters or strings
# will be used to represent 8-bit character entities within HTML.  New
# character sets may be defined as explained in the README files of the
# src/chrtrans directory in the Lynx source code distribution.  For Asian (CJK)
# character sets, it also determines how Kanji code will be handled.  The
# default is defined in userdefs.h and can be changed here or via the
# 'o'ptions menu.  The 'o'ptions menu setting will be stored in the user's RC
# file whenever those settings are saved, and thereafter will be used as the
# default.  For Lynx a "character set" has two names:  a MIME name (for
# recognizing properly labeled charset parameters in HTTP headers etc.), and a
# human-readable string for the 'O'ptions Menu (so you may find info about
# language or group of languages besides MIME name).  Not all 'human-readable'
# names correspond to exactly one valid MIME charset (example is "Chinese");
# in that case an appropriate valid (and more specific) MIME name should be
# used where required.  Well-known synonyms are also processed in the code.
#
# Raw (CJK) mode
#
# Lynx normally translates characters from a document's charset to display
# charset, using ASSUME_CHARSET value (see below) if the document's charset
# is not specified explicitly.  Raw (CJK) mode is OFF for this case.
# When the document charset is specified explicitly, that charset
# overrides any assumption like ASSUME_CHARSET or raw (CJK) mode.
#
# For the Asian (CJK) display character sets, the corresponding charset is
# assumed in documents, i.e., raw (CJK) mode is ON by default.  In raw CJK
# mode, 8-bit characters are not reverse translated in relation to the entity
# conversion arrays, i.e., they are assumed to be appropriate for the display
# character set.  The mode should be toggled OFF when an Asian (CJK) display
# character set is selected but the document is not CJK and its charset not
# specified explicitly.
#
# Raw (CJK) mode may be toggled by user via '@' (LYK_RAW_TOGGLE) key,
# the -raw command line switch or from the 'o'ptions menu.
#
# Raw (CJK) mode effectively changes the charset assumption about unlabeled
# documents.  You can toggle raw mode ON if you believe the document has a
# charset which does correspond to your Display Character Set.  On the other
# hand, if you set ASSUME_CHARSET the same as Display Character Set you get raw
# mode ON by default (but you get assume_charset=iso-8859-1 if you try raw mode
# OFF after it).
#
# Note that "raw" does not mean that every byte will be passed to the screen.
# HTML character entities may get expanded and translated, inappropriate
# control characters filtered out, etc.  There is a "Transparent" pseudo
# character set for more "rawness".
#
# Since Lynx now supports a wide range of platforms it may be useful to note
# the cpXXX codepages used by IBM PC compatible computers, and windows-xxxx
# used by native MS-Windows apps.  We also note that cpXXX pages rarely are
# found on Internet, but are mostly for local needs on DOS.
#
# Recognized character sets include:
#
#
#    string for 'O'ptions Menu          MIME name
#    ===========================        =========
#    7 bit approximations (US-ASCII)    us-ascii
#    Western (ISO-8859-1)               iso-8859-1
#    Western (ISO-8859-15)              iso-8859-15
#    Western (cp850)                    cp850
#    Western (windows-1252)             windows-1252
#    IBM PC US codepage (cp437)         cp437
#    DEC Multinational                  dec-mcs
#    Macintosh (8 bit)                  macintosh
#    NeXT character set                 next
#    HP Roman8                          hp-roman8
#    Chinese                            euc-cn
#    Japanese (EUC-JP)                  euc-jp
#    Japanese (Shift_JIS)               shift_jis
#    Korean                             euc-kr
#    Taipei (Big5)                      big5
#    Vietnamese (VISCII)                viscii
#    Eastern European (ISO-8859-2)      iso-8859-2
#    Eastern European (cp852)           cp852
#    Eastern European (windows-1250)    windows-1250
#    Latin 3 (ISO-8859-3)               iso-8859-3
#    Latin 4 (ISO-8859-4)               iso-8859-4
#    Baltic Rim (ISO-8859-13)		iso-8859-13
#    Baltic Rim (cp775)                 cp775
#    Baltic Rim (windows-1257)          windows-1257
#    Celtic (ISO-8859-14)		iso-8859-14
#    Cyrillic (ISO-8859-5)              iso-8859-5
#    Cyrillic (cp866)                   cp866
#    Cyrillic (windows-1251)            windows-1251
#    Cyrillic (KOI8-R)                  koi8-r
#    Arabic (ISO-8859-6)                iso-8859-6
#    Arabic (cp864)                     cp864
#    Arabic (windows-1256)              windows-1256
#    Greek (ISO-8859-7)                 iso-8859-7
#    Greek (cp737)                      cp737
#    Greek2 (cp869)                     cp869
#    Greek (windows-1253)               windows-1253
#    Hebrew (ISO-8859-8)                iso-8859-8
#    Hebrew (cp862)                     cp862
#    Hebrew (windows-1255)              windows-1255
#    Turkish (ISO-8859-9)               iso-8859-9
#    North European (ISO-8859-10)	iso-8859-10
#    Ukrainian Cyrillic (cp866u)        cp866u
#    Ukrainian Cyrillic (KOI8-U)        koi8-u
#    UNICODE (UTF-8)                    utf-8
#    RFC 1345 w/o Intro                 mnemonic+ascii+0
#    RFC 1345 Mnemonic                  mnemonic
#    Transparent                        x-transparent
#
#
# The value should be the MIME name of a character set recognized by
# Lynx (case insensitive).
# Find RFC 1345 at http://www.ics.uci.edu/pub/ietf/uri/rfc1345.txt .
#
#CHARACTER_SET:iso-8859-1
CHARACTER_SET:utf-8


LOCALE_CHARSET

# LOCALE_CHARSET overrides CHARACTER_SET if true, using the current locale to
# lookup a MIME name that corresponds, and use that as the display charset.
#
# Note that while nl_langinfo(CODESET) itself is standardized, the return
# values and their relationship to the locale value is not.  GNU libiconv
# happens to give useful values, but other implementations are not guaranteed
# to do this.
#LOCALE_CHARSET:FALSE


ASSUME_CHARSET

# ASSUME_CHARSET changes the handling of documents which do not
# explicitly specify a charset.  Normally Lynx assumes that 8-bit
# characters in those documents are encoded according to iso-8859-1
# (the official default for the HTTP protocol).  When ASSUME_CHARSET
# is defined here or by an -assume_charset command line flag is in effect,
# Lynx will treat documents as if they were encoded accordingly.
# See above on how this interacts with "raw mode" and the Display
# Character Set.
# ASSUME_CHARSET can also be changed via the 'o'ptions menu but will
# not be saved as permanent value in user's .lynxrc file to avoid more chaos.
#
#ASSUME_CHARSET:iso-8859-1


ASSUMED_DOC_CHARSET_CHOICE

DISPLAY_CHARSET_CHOICE

# It is possible to reduce the number of charset choices in the 'O'ptions menu
# for "display charset" and "assumed document charset" fields via
# DISPLAY_CHARSET_CHOICE and ASSUMED_DOC_CHARSET_CHOICE settings correspondingly.
# Each of these settings can be used several times to define the set of possible
# choices for corresponding field. The syntax for the values is
#
#	string | prefix* | *
#
# where
#
#	'string' is either the MIME name of charset or it's full name (listed
#		either in the left or in the right column of table of
#		recognized charsets), case-insensitive - e.g.  'Koi8-R' or
#		'Cyrillic (KOI8-R)' (both without quotes),
#
#	'prefix' is any string, and such value will select all charsets having
#		the name with prefix matching given (case insensitive), i.e.,
#		for the charsets listed in the table of recognized charsets,
#
#
# Example:
# ASSUMED_DOC_CHARSET_CHOICE:cyrillic*
#		will be equal to specifying
#
# Examples:
# ASSUMED_DOC_CHARSET_CHOICE:cp866
# ASSUMED_DOC_CHARSET_CHOICE:windows-1251
# ASSUMED_DOC_CHARSET_CHOICE:koi8-r
# ASSUMED_DOC_CHARSET_CHOICE:iso-8859-5
#		or lines with full names of charsets.
#
#	literal string '*' (without quotes) will enable all charset choices
#		in corresponding field.  This is useful for overriding site
#		defaults in private pieces of lynx.cfg included via INCLUDE
#		directive.
#
# Default values for both settings are '*', but any occurrence of settings
# with values that denote any charsets will make only listed choices available
# for corresponding field.
#ASSUMED_DOC_CHARSET_CHOICE:*
#DISPLAY_CHARSET_CHOICE:*


ASSUME_LOCAL_CHARSET

# ASSUME_LOCAL_CHARSET is like ASSUME_CHARSET but only applies to local
# files.  If no setting is given here or by an -assume_local_charset
# command line option, the value for ASSUME_CHARSET or -assume_charset
# is used.  It works for both text/plain and text/html files.
# This option will ignore "raw mode" toggling when local files are viewed
# (it is "stronger" than "assume_charset" or the effective change
# of the charset assumption caused by changing "raw mode"),
# so only use when necessary.
#
#ASSUME_LOCAL_CHARSET:iso-8859-1
ASSUME_LOCAL_CHARSET:utf-8


PREPEND_CHARSET_TO_SOURCE

# PREPEND_CHARSET_TO_SOURCE:TRUE tells Lynx to prepend a META CHARSET line
# to text/html source files when they are retrieved for 'd'ownloading
# or passed to 'p'rint functions, so HTTP headers will not be lost.
# This is necessary for resolving charset for local html files,
# while the assume_local_charset is just an assumption.
# For the 'd'ownload option, a META CHARSET will be added only if the HTTP
# charset is present.  The compilation default is TRUE.
# It is generally desirable to have charset information for every local
# html file, but META CHARSET string potentially could cause
# compatibility problems with other browsers, see also PREPEND_BASE_TO_SOURCE.
# Note that the prepending is not done for -source dumps.
#
#PREPEND_CHARSET_TO_SOURCE:TRUE
PREPEND_CHARSET_TO_SOURCE:TRUE


NCR_IN_BOOKMARKS

# NCR_IN_BOOKMARKS:TRUE allows you to save 8-bit characters in bookmark titles
# in the unicode format (NCR).  This may be useful if you need to switch
# display charsets frequently.  This is the case when you use Lynx on different
# platforms, e.g., on UNIX and from a remote PC, and want to keep the bookmarks
# file persistent.
# Another aspect is compatibility:  NCR is part of I18N and HTML4.0
# specifications supported starting with Lynx 2.7.2, Netscape 4.0 and MSIE 4.0.
# Older browser versions will fail so keep NCR_IN_BOOKMARKS:FALSE if you
# plan to use them.
#
#NCR_IN_BOOKMARKS:FALSE


FORCE_8BIT_TOUPPER

# FORCE_8BIT_TOUPPER overrides locale settings and uses internal 8-bit
# case-conversion mechanism for case-insensitive searches in non-ASCII display
# character sets.  It is FALSE by default and should not be changed unless
# you encounter problems with case-insensitive searches.
#
#FORCE_8BIT_TOUPPER:FALSE


OUTGOING_MAIL_CHARSET

# While Lynx supports different platforms and display character sets
# we need to limit the charset in outgoing mail to reduce
# trouble for remote recipients who may not recognize our charset.
# You may try US-ASCII as the safest value (7 bit), any other MIME name,
# or leave this field blank (default) to use the display character set.
# Charset translations currently are implemented for mail "subjects= " only.
#
#OUTGOING_MAIL_CHARSET:
OUTGOING_MAIL_CHARSET:us-ascii


ASSUME_UNREC_CHARSET

# If Lynx encounters a charset parameter it doesn't recognize, it will
# replace the value given by ASSUME_UNREC_CHARSET (or a corresponding
# -assume_unrec_charset command line option) for it.  This can be used
# to deal with charsets unknown to Lynx, if they are "sufficiently
# similar" to one that Lynx does know about, by forcing the same
# treatment.  There is no default, and you probably should leave this
# undefined unless necessary.
#
#ASSUME_UNREC_CHARSET:iso-8859-1


PREFERRED_LANGUAGE

# PREFERRED_LANGUAGE is the language in MIME notation (e.g., "en",
# "fr") which will be indicated by Lynx in its Accept-Language headers
# as the preferred language.  If available, the document will be
# transmitted in that language.  Users can override this setting via
# the 'o'ptions menu and save that preference in their RC file.
# This may be a comma-separated list of languages in decreasing preference.
#
#PREFERRED_LANGUAGE:en
PREFERRED_LANGUAGE:en


PREFERRED_CHARSET

# PREFERRED_CHARSET specifies the character set in MIME notation (e.g.,
# "ISO-8859-2", "ISO-8859-5") which Lynx will indicate you prefer in
# requests to http servers using an Accept-Charsets header.  Users can
# change it via the 'o'ptions menu and save that preference in their RC file.
# The value should NOT include "ISO-8859-1" or "US-ASCII",
# since those values are always assumed by default.
# If a file in that character set is available, the server will send it.
# If no Accept-Charset header is present, the default is that any
# character set is acceptable.  If an Accept-Charset header is present,
# and if the server cannot send a response which is acceptable
# according to the Accept-Charset header, then the server SHOULD send
# an error response with the 406 (not acceptable) status code, though
# the sending of an unacceptable response is also allowed.  See RFC 2068
# (http://www.ics.uci.edu/pub/ietf/uri/rfc2068.txt).
#
#PREFERRED_CHARSET:


CHARSETS_DIRECTORY

# CHARSETS_DIRECTORY specifies the directory with the fonts (glyph data)
# used by Lynx to switch the display-font to a font best suited for the
# given document.  The font should be in a format understood by the
# platforms TTY-display-font-switching API.  Currently supported on OS/2 only.
#
# Lynx expects the glyphs for the charset CHARSET with character cell
# size HHHxWWW to be stored in a file HHHxWWW/CHARSET.fnt inside the directory
# specified by CHARSETS_DIRECTORY.  E.g., the font for koi8-r sized 14x9
# should be in the file 14x9/koi8-r.fnt.
#
#CHARSETS_DIRECTORY:


CHARSET_SWITCH_RULES

# CHARSET_SWITCH_RULES hints lynx on how to choose the best display font given
# the document encoding.  This string is a sequence of chunks, each chunk
# having the following form:
#
# IN_CHARSET1 IN_CHARSET2 ... IN_CHARSET5 :OUT_CHARSET
#
# For readability, one may insert arbitrary additional punctuation (anything
# but : is ignored).  E.g., if lynx is able to switch only to display charsets
# cp866, cp850, cp852, and cp862, then the following setting may be useful
# (split for readability):
#
# CHARSET_SWITCH_RULES: koi8-r ISO-8859-5 windows-1251 cp866u KOI8-U :cp866,
#	iso-8859-1 windows-1252 ISO-8859-15 :cp850,
#	ISO-8859-2 windows-1250 :cp852,
#	ISO-8859-8 windows-1255 :cp862
#
#CHARSET_SWITCH_RULES:


Prev: CGI scripts || Next: Cookies