add bytevector->string and string->bytevector in new (ice-9 iconv) module

* module/Makefile.am:
* module/ice-9/iconv.scm: New module implementing procedures to encode
  and decode representations of strings as bytes.

* test-suite/Makefile.am:
* test-suite/tests/iconv.test: Add tests.

* doc/ref/api-data.texi: Add docs.
This commit is contained in:
Andy Wingo 2013-01-10 22:50:27 +01:00
commit f05bb8494c
5 changed files with 277 additions and 6 deletions

View file

@ -1,6 +1,6 @@
@c -*-texinfo-*-
@c This is part of the GNU Guile Reference Manual.
@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2006, 2007, 2008, 2009, 2010, 2011, 2012
@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013
@c Free Software Foundation, Inc.
@c See the file guile.texi for copying conditions.
@ -2881,6 +2881,7 @@ Guile provides all procedures of SRFI-13 and a few more.
* Reversing and Appending Strings:: Appending strings to form a new string.
* Mapping Folding and Unfolding:: Iterating over strings.
* Miscellaneous String Operations:: Replicating, insertion, parsing, ...
* Representing Strings as Bytes:: Encoding and decoding strings.
* Conversion to/from C::
* String Internals:: The storage strategy for strings.
@end menu
@ -4163,6 +4164,70 @@ a predicate, if it is a character, it is tested for equality and if it
is a character set, it is tested for membership.
@end deffn
@node Representing Strings as Bytes
@subsubsection Representing Strings as Bytes
Out in the cold world outside of Guile, not all strings are treated in
the same way. Out there there are only bytes, and there are many ways
of representing a strings (sequences of characters) as binary data
(sequences of bytes).
As a user, usually you don't have to think about this very much. When
you type on your keyboard, your system encodes your keystrokes as bytes
according to the locale that you have configured on your computer.
Guile uses the locale to decode those bytes back into characters --
hopefully the same characters that you typed in.
All is not so clear when dealing with a system with multiple users, such
as a web server. Your web server might get a request from one user for
data encoded in the ISO-8859-1 character set, and then another request
from a different user for UTF-8 data.
@cindex iconv
@cindex character encoding
Guile provides an @dfn{iconv} module for converting between strings and
sequences of bytes. @xref{Bytevectors}, for more on how Guile
represents raw byte sequences. This module gets its name from the
common @sc{unix} command of the same name.
Unlike the rest of the procedures in this section, you have to load the
@code{iconv} module before having access to these procedures:
@example
(use-modules (ice-9 iconv))
@end example
@deffn string->bytevector string encoding [#:conversion-strategy='error]
Encode @var{string} as a sequence of bytes.
The string will be encoded in the character set specified by the
@var{encoding} string. If the string has characters that cannot be
represented in the encoding, by default this procedure raises an
@code{encoding-error}, though the @code{#:conversion-strategy} keyword
can specify other behaviors.
The return value is a bytevector. @xref{Bytevectors}, for more on
bytevectors. @xref{Ports}, for more on character encodings and
conversion strategies.
@end deffn
@deffn bytevector->string bytevector encoding
Decode @var{bytevector} into a string.
The bytes will be decoded from the character set by the @var{encoding}
string. If the bytes do not form a valid encoding, by default this
procedure raises an @code{decoding-error}, though that may be overridden
with the @code{#:conversion-strategy} keyword. @xref{Ports}, for more
on character encodings and conversion strategies.
@end deffn
@deffn call-with-output-encoded-string encoding proc [#:conversion-strategy='error]
Like @code{call-with-output-string}, but instead of returning a string,
returns a encoding of the string according to @var{encoding}, as a
bytevector. This procedure can be more efficient than collecting a
string and then converting it via @code{string->bytevector}.
@end deffn
@node Conversion to/from C
@subsubsection Conversion to/from C
@ -4172,9 +4237,9 @@ important.
In C, a string is just a sequence of bytes, and the character encoding
describes the relation between these bytes and the actual characters
that make up the string. For Scheme strings, character encoding is
not an issue (most of the time), since in Scheme you never get to see
the bytes, only the characters.
that make up the string. For Scheme strings, character encoding is not
an issue (most of the time), since in Scheme you usually treat strings
as character sequences, not byte sequences.
Converting to C and converting from C each have their own challenges.
@ -4305,6 +4370,9 @@ into @var{encoding}.
If @var{lenp} is @code{NULL}, this function will return a null-terminated C
string. It will throw an error if the string contains a null
character.
The Scheme interface to this function is @code{encode-string}, from the
@code{ice-9 iconv} module. @xref{Representing Strings as Bytes}.
@end deftypefn
@deftypefn {C Function} SCM scm_from_stringn (const char *str, size_t len, const char *encoding, scm_t_string_failed_conversion_handler handler)
@ -4313,6 +4381,9 @@ length in bytes of the C string is input as @var{len}. The encoding of the C
string is passed as the ASCII, null-terminated C string @code{encoding}.
The @var{handler} parameters suggests a strategy for dealing with
unconvertable characters.
The Scheme interface to this function is @code{decode-string}.
@xref{Representing Strings as Bytes}.
@end deftypefn
The following conversion functions are provided as a convenience for the
@ -4810,6 +4881,7 @@ the host's native endianness.
Bytevector contents can also be interpreted as Unicode strings encoded
in one of the most commonly available encoding formats.
@xref{Representing Strings as Bytes}, for a more generic interface.
@lisp
(utf8->string (u8-list->bytevector '(99 97 102 101)))

View file

@ -1,6 +1,6 @@
## Process this file with automake to produce Makefile.in.
##
## Copyright (C) 2009, 2010, 2011, 2012 Free Software Foundation, Inc.
## Copyright (C) 2009, 2010, 2011, 2012, 2013 Free Software Foundation, Inc.
##
## This file is part of GUILE.
##
@ -210,6 +210,7 @@ ICE_9_SOURCES = \
ice-9/getopt-long.scm \
ice-9/hcons.scm \
ice-9/i18n.scm \
ice-9/iconv.scm \
ice-9/lineio.scm \
ice-9/ls.scm \
ice-9/mapping.scm \

82
module/ice-9/iconv.scm Normal file
View file

@ -0,0 +1,82 @@
;;; Encoding and decoding byte representations of strings
;; Copyright (C) 2013 Free Software Foundation, Inc.
;;;; This library is free software; you can redistribute it and/or
;;;; modify it under the terms of the GNU Lesser General Public
;;;; License as published by the Free Software Foundation; either
;;;; version 3 of the License, or (at your option) any later version.
;;;;
;;;; This library is distributed in the hope that it will be useful,
;;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
;;;; Lesser General Public License for more details.
;;;;
;;;; You should have received a copy of the GNU Lesser General Public
;;;; License along with this library; if not, write to the Free Software
;;;; Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
;;; Code:
(define-module (ice-9 iconv)
#:use-module (rnrs bytevectors)
#:use-module (ice-9 binary-ports)
#:use-module ((ice-9 rdelim) #:select (read-delimited))
#:export (string->bytevector
bytevector->string
call-with-encoded-output-string))
;; Like call-with-output-string, but actually closes the port.
(define (call-with-output-string* proc)
(let ((port (open-output-string)))
(proc port)
(let ((str (get-output-string port)))
(close-port port)
str)))
(define (call-with-output-bytevector* proc)
(call-with-values (lambda () (open-bytevector-output-port))
(lambda (port get-bytevector)
(proc port)
(let ((bv (get-bytevector)))
(close-port port)
bv))))
(define* (call-with-encoded-output-string encoding proc
#:key (conversion-strategy 'error))
(if (string-ci=? encoding "utf-8")
;; I don't know why, but this appears to be faster; at least for
;; serving examples/debug-sxml.scm (1464 reqs/s versus 850
;; reqs/s).
(string->utf8 (call-with-output-string* proc))
(call-with-output-bytevector*
(lambda (port)
(set-port-encoding! port encoding)
(if conversion-strategy
(set-port-conversion-strategy! port conversion-strategy))
(proc port)))))
;; TODO: Provide C implementations that call scm_from_stringn and
;; friends?
(define* (string->bytevector str encoding #:key (conversion-strategy 'error))
(if (string-ci=? encoding "utf-8")
(string->utf8 str)
(call-with-encoded-output-string
encoding
(lambda (port)
(display str port))
#:conversion-strategy conversion-strategy)))
(define* (bytevector->string bv encoding #:key (conversion-strategy 'error))
(if (string-ci=? encoding "utf-8")
(utf8->string bv)
(let ((p (open-bytevector-input-port bv)))
(set-port-encoding! p encoding)
(if conversion-strategy
(set-port-conversion-strategy! p conversion-strategy))
(let ((res (read-delimited "" p)))
(close-port p)
(if (eof-object? res)
""
res)))))

View file

@ -1,7 +1,7 @@
## Process this file with automake to produce Makefile.in.
##
## Copyright 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
## 2010, 2011, 2012 Software Foundation, Inc.
## 2010, 2011, 2012, 2013 Software Foundation, Inc.
##
## This file is part of GUILE.
##
@ -62,6 +62,7 @@ SCM_TESTS = tests/00-initial-env.test \
tests/hash.test \
tests/hooks.test \
tests/i18n.test \
tests/iconv.test \
tests/import.test \
tests/interp.test \
tests/keywords.test \

115
test-suite/tests/iconv.test Normal file
View file

@ -0,0 +1,115 @@
;;;; iconv.test --- Exercise the iconv API. -*- coding: utf-8; mode: scheme; -*-
;;;;
;;;; Copyright (C) 2013 Free Software Foundation, Inc.
;;;; Andy Wingo
;;;;
;;;; This library is free software; you can redistribute it and/or
;;;; modify it under the terms of the GNU Lesser General Public
;;;; License as published by the Free Software Foundation; either
;;;; version 3 of the License, or (at your option) any later version.
;;;;
;;;; This library is distributed in the hope that it will be useful,
;;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
;;;; Lesser General Public License for more details.
;;;;
;;;; You should have received a copy of the GNU Lesser General Public
;;;; License along with this library; if not, write to the Free Software
;;;; Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
(define-module (test-suite iconv)
#:use-module (ice-9 iconv)
#:use-module (rnrs bytevectors)
#:use-module (test-suite lib))
(define exception:encoding-error
'(encoding-error . ""))
(define exception:decoding-error
'(decoding-error . ""))
(with-test-prefix "ascii string"
(let ((s "Hello, World!"))
;; For ASCII, all of these encodings should be the same.
(pass-if "to ascii bytevector"
(equal? (string->bytevector s "ASCII")
#vu8(72 101 108 108 111 44 32 87 111 114 108 100 33)))
(pass-if "to ascii bytevector (length check)"
(equal? (string-length s)
(bytevector-length (string->bytevector s "ascii"))))
(pass-if "from ascii bytevector"
(equal? s
(bytevector->string (string->bytevector s "ascii") "ascii")))
(pass-if "to utf-8 bytevector"
(equal? (string->bytevector s "ASCII")
(string->bytevector s "utf-8")))
(pass-if "to UTF-8 bytevector (testing encoding case sensitivity)"
(equal? (string->bytevector s "ascii")
(string->bytevector s "UTF-8")))
(pass-if "from utf-8 bytevector"
(equal? s
(bytevector->string (string->bytevector s "utf-8") "utf-8")))
(pass-if "to latin1 bytevector"
(equal? (string->bytevector s "ASCII")
(string->bytevector s "latin1")))
(pass-if "from latin1 bytevector"
(equal? s
(bytevector->string (string->bytevector s "utf-8") "utf-8")))))
(with-test-prefix "narrow non-ascii string"
(let ((s "été"))
(pass-if "to latin1 bytevector"
(equal? (string->bytevector s "latin1")
#vu8(233 116 233)))
(pass-if "to latin1 bytevector (length check)"
(equal? (string-length s)
(bytevector-length (string->bytevector s "latin1"))))
(pass-if "from latin1 bytevector"
(equal? s
(bytevector->string (string->bytevector s "latin1") "latin1")))
(pass-if "to utf-8 bytevector"
(equal? (string->bytevector s "utf-8")
#vu8(195 169 116 195 169)))
(pass-if "from utf-8 bytevector"
(equal? s
(bytevector->string (string->bytevector s "utf-8") "utf-8")))
(pass-if-exception "encode latin1 as ascii" exception:encoding-error
(string->bytevector s "ascii"))
(pass-if-exception "misparse latin1 as utf8" exception:decoding-error
(bytevector->string (string->bytevector s "latin1") "utf-8"))
(pass-if-exception "misparse latin1 as ascii" exception:decoding-error
(bytevector->string (string->bytevector s "latin1") "ascii"))))
(with-test-prefix "wide non-ascii string"
(let ((s "ΧΑΟΣ"))
(pass-if "to utf-8 bytevector"
(equal? (string->bytevector s "utf-8")
#vu8(206 167 206 145 206 159 206 163) ))
(pass-if "from utf-8 bytevector"
(equal? s
(bytevector->string (string->bytevector s "utf-8") "utf-8")))
(pass-if-exception "encode as ascii" exception:encoding-error
(string->bytevector s "ascii"))
(pass-if-exception "encode as latin1" exception:encoding-error
(string->bytevector s "latin1"))))