add bytevector->string and string->bytevector in new (ice-9 iconv) module
* module/Makefile.am: * module/ice-9/iconv.scm: New module implementing procedures to encode and decode representations of strings as bytes. * test-suite/Makefile.am: * test-suite/tests/iconv.test: Add tests. * doc/ref/api-data.texi: Add docs.
This commit is contained in:
parent
b194b59fa1
commit
f05bb8494c
5 changed files with 277 additions and 6 deletions
|
|
@ -1,6 +1,6 @@
|
|||
@c -*-texinfo-*-
|
||||
@c This is part of the GNU Guile Reference Manual.
|
||||
@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2006, 2007, 2008, 2009, 2010, 2011, 2012
|
||||
@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013
|
||||
@c Free Software Foundation, Inc.
|
||||
@c See the file guile.texi for copying conditions.
|
||||
|
||||
|
|
@ -2881,6 +2881,7 @@ Guile provides all procedures of SRFI-13 and a few more.
|
|||
* Reversing and Appending Strings:: Appending strings to form a new string.
|
||||
* Mapping Folding and Unfolding:: Iterating over strings.
|
||||
* Miscellaneous String Operations:: Replicating, insertion, parsing, ...
|
||||
* Representing Strings as Bytes:: Encoding and decoding strings.
|
||||
* Conversion to/from C::
|
||||
* String Internals:: The storage strategy for strings.
|
||||
@end menu
|
||||
|
|
@ -4163,6 +4164,70 @@ a predicate, if it is a character, it is tested for equality and if it
|
|||
is a character set, it is tested for membership.
|
||||
@end deffn
|
||||
|
||||
@node Representing Strings as Bytes
|
||||
@subsubsection Representing Strings as Bytes
|
||||
|
||||
Out in the cold world outside of Guile, not all strings are treated in
|
||||
the same way. Out there there are only bytes, and there are many ways
|
||||
of representing a strings (sequences of characters) as binary data
|
||||
(sequences of bytes).
|
||||
|
||||
As a user, usually you don't have to think about this very much. When
|
||||
you type on your keyboard, your system encodes your keystrokes as bytes
|
||||
according to the locale that you have configured on your computer.
|
||||
Guile uses the locale to decode those bytes back into characters --
|
||||
hopefully the same characters that you typed in.
|
||||
|
||||
All is not so clear when dealing with a system with multiple users, such
|
||||
as a web server. Your web server might get a request from one user for
|
||||
data encoded in the ISO-8859-1 character set, and then another request
|
||||
from a different user for UTF-8 data.
|
||||
|
||||
@cindex iconv
|
||||
@cindex character encoding
|
||||
Guile provides an @dfn{iconv} module for converting between strings and
|
||||
sequences of bytes. @xref{Bytevectors}, for more on how Guile
|
||||
represents raw byte sequences. This module gets its name from the
|
||||
common @sc{unix} command of the same name.
|
||||
|
||||
Unlike the rest of the procedures in this section, you have to load the
|
||||
@code{iconv} module before having access to these procedures:
|
||||
|
||||
@example
|
||||
(use-modules (ice-9 iconv))
|
||||
@end example
|
||||
|
||||
@deffn string->bytevector string encoding [#:conversion-strategy='error]
|
||||
Encode @var{string} as a sequence of bytes.
|
||||
|
||||
The string will be encoded in the character set specified by the
|
||||
@var{encoding} string. If the string has characters that cannot be
|
||||
represented in the encoding, by default this procedure raises an
|
||||
@code{encoding-error}, though the @code{#:conversion-strategy} keyword
|
||||
can specify other behaviors.
|
||||
|
||||
The return value is a bytevector. @xref{Bytevectors}, for more on
|
||||
bytevectors. @xref{Ports}, for more on character encodings and
|
||||
conversion strategies.
|
||||
@end deffn
|
||||
|
||||
@deffn bytevector->string bytevector encoding
|
||||
Decode @var{bytevector} into a string.
|
||||
|
||||
The bytes will be decoded from the character set by the @var{encoding}
|
||||
string. If the bytes do not form a valid encoding, by default this
|
||||
procedure raises an @code{decoding-error}, though that may be overridden
|
||||
with the @code{#:conversion-strategy} keyword. @xref{Ports}, for more
|
||||
on character encodings and conversion strategies.
|
||||
@end deffn
|
||||
|
||||
@deffn call-with-output-encoded-string encoding proc [#:conversion-strategy='error]
|
||||
Like @code{call-with-output-string}, but instead of returning a string,
|
||||
returns a encoding of the string according to @var{encoding}, as a
|
||||
bytevector. This procedure can be more efficient than collecting a
|
||||
string and then converting it via @code{string->bytevector}.
|
||||
@end deffn
|
||||
|
||||
@node Conversion to/from C
|
||||
@subsubsection Conversion to/from C
|
||||
|
||||
|
|
@ -4172,9 +4237,9 @@ important.
|
|||
|
||||
In C, a string is just a sequence of bytes, and the character encoding
|
||||
describes the relation between these bytes and the actual characters
|
||||
that make up the string. For Scheme strings, character encoding is
|
||||
not an issue (most of the time), since in Scheme you never get to see
|
||||
the bytes, only the characters.
|
||||
that make up the string. For Scheme strings, character encoding is not
|
||||
an issue (most of the time), since in Scheme you usually treat strings
|
||||
as character sequences, not byte sequences.
|
||||
|
||||
Converting to C and converting from C each have their own challenges.
|
||||
|
||||
|
|
@ -4305,6 +4370,9 @@ into @var{encoding}.
|
|||
If @var{lenp} is @code{NULL}, this function will return a null-terminated C
|
||||
string. It will throw an error if the string contains a null
|
||||
character.
|
||||
|
||||
The Scheme interface to this function is @code{encode-string}, from the
|
||||
@code{ice-9 iconv} module. @xref{Representing Strings as Bytes}.
|
||||
@end deftypefn
|
||||
|
||||
@deftypefn {C Function} SCM scm_from_stringn (const char *str, size_t len, const char *encoding, scm_t_string_failed_conversion_handler handler)
|
||||
|
|
@ -4313,6 +4381,9 @@ length in bytes of the C string is input as @var{len}. The encoding of the C
|
|||
string is passed as the ASCII, null-terminated C string @code{encoding}.
|
||||
The @var{handler} parameters suggests a strategy for dealing with
|
||||
unconvertable characters.
|
||||
|
||||
The Scheme interface to this function is @code{decode-string}.
|
||||
@xref{Representing Strings as Bytes}.
|
||||
@end deftypefn
|
||||
|
||||
The following conversion functions are provided as a convenience for the
|
||||
|
|
@ -4810,6 +4881,7 @@ the host's native endianness.
|
|||
|
||||
Bytevector contents can also be interpreted as Unicode strings encoded
|
||||
in one of the most commonly available encoding formats.
|
||||
@xref{Representing Strings as Bytes}, for a more generic interface.
|
||||
|
||||
@lisp
|
||||
(utf8->string (u8-list->bytevector '(99 97 102 101)))
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
## Process this file with automake to produce Makefile.in.
|
||||
##
|
||||
## Copyright (C) 2009, 2010, 2011, 2012 Free Software Foundation, Inc.
|
||||
## Copyright (C) 2009, 2010, 2011, 2012, 2013 Free Software Foundation, Inc.
|
||||
##
|
||||
## This file is part of GUILE.
|
||||
##
|
||||
|
|
@ -210,6 +210,7 @@ ICE_9_SOURCES = \
|
|||
ice-9/getopt-long.scm \
|
||||
ice-9/hcons.scm \
|
||||
ice-9/i18n.scm \
|
||||
ice-9/iconv.scm \
|
||||
ice-9/lineio.scm \
|
||||
ice-9/ls.scm \
|
||||
ice-9/mapping.scm \
|
||||
|
|
|
|||
82
module/ice-9/iconv.scm
Normal file
82
module/ice-9/iconv.scm
Normal file
|
|
@ -0,0 +1,82 @@
|
|||
;;; Encoding and decoding byte representations of strings
|
||||
|
||||
;; Copyright (C) 2013 Free Software Foundation, Inc.
|
||||
|
||||
;;;; This library is free software; you can redistribute it and/or
|
||||
;;;; modify it under the terms of the GNU Lesser General Public
|
||||
;;;; License as published by the Free Software Foundation; either
|
||||
;;;; version 3 of the License, or (at your option) any later version.
|
||||
;;;;
|
||||
;;;; This library is distributed in the hope that it will be useful,
|
||||
;;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
;;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
||||
;;;; Lesser General Public License for more details.
|
||||
;;;;
|
||||
;;;; You should have received a copy of the GNU Lesser General Public
|
||||
;;;; License along with this library; if not, write to the Free Software
|
||||
;;;; Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
||||
|
||||
;;; Code:
|
||||
|
||||
(define-module (ice-9 iconv)
|
||||
#:use-module (rnrs bytevectors)
|
||||
#:use-module (ice-9 binary-ports)
|
||||
#:use-module ((ice-9 rdelim) #:select (read-delimited))
|
||||
#:export (string->bytevector
|
||||
bytevector->string
|
||||
call-with-encoded-output-string))
|
||||
|
||||
;; Like call-with-output-string, but actually closes the port.
|
||||
(define (call-with-output-string* proc)
|
||||
(let ((port (open-output-string)))
|
||||
(proc port)
|
||||
(let ((str (get-output-string port)))
|
||||
(close-port port)
|
||||
str)))
|
||||
|
||||
(define (call-with-output-bytevector* proc)
|
||||
(call-with-values (lambda () (open-bytevector-output-port))
|
||||
(lambda (port get-bytevector)
|
||||
(proc port)
|
||||
(let ((bv (get-bytevector)))
|
||||
(close-port port)
|
||||
bv))))
|
||||
|
||||
(define* (call-with-encoded-output-string encoding proc
|
||||
#:key (conversion-strategy 'error))
|
||||
(if (string-ci=? encoding "utf-8")
|
||||
;; I don't know why, but this appears to be faster; at least for
|
||||
;; serving examples/debug-sxml.scm (1464 reqs/s versus 850
|
||||
;; reqs/s).
|
||||
(string->utf8 (call-with-output-string* proc))
|
||||
(call-with-output-bytevector*
|
||||
(lambda (port)
|
||||
(set-port-encoding! port encoding)
|
||||
(if conversion-strategy
|
||||
(set-port-conversion-strategy! port conversion-strategy))
|
||||
(proc port)))))
|
||||
|
||||
;; TODO: Provide C implementations that call scm_from_stringn and
|
||||
;; friends?
|
||||
|
||||
(define* (string->bytevector str encoding #:key (conversion-strategy 'error))
|
||||
(if (string-ci=? encoding "utf-8")
|
||||
(string->utf8 str)
|
||||
(call-with-encoded-output-string
|
||||
encoding
|
||||
(lambda (port)
|
||||
(display str port))
|
||||
#:conversion-strategy conversion-strategy)))
|
||||
|
||||
(define* (bytevector->string bv encoding #:key (conversion-strategy 'error))
|
||||
(if (string-ci=? encoding "utf-8")
|
||||
(utf8->string bv)
|
||||
(let ((p (open-bytevector-input-port bv)))
|
||||
(set-port-encoding! p encoding)
|
||||
(if conversion-strategy
|
||||
(set-port-conversion-strategy! p conversion-strategy))
|
||||
(let ((res (read-delimited "" p)))
|
||||
(close-port p)
|
||||
(if (eof-object? res)
|
||||
""
|
||||
res)))))
|
||||
|
|
@ -1,7 +1,7 @@
|
|||
## Process this file with automake to produce Makefile.in.
|
||||
##
|
||||
## Copyright 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
|
||||
## 2010, 2011, 2012 Software Foundation, Inc.
|
||||
## 2010, 2011, 2012, 2013 Software Foundation, Inc.
|
||||
##
|
||||
## This file is part of GUILE.
|
||||
##
|
||||
|
|
@ -62,6 +62,7 @@ SCM_TESTS = tests/00-initial-env.test \
|
|||
tests/hash.test \
|
||||
tests/hooks.test \
|
||||
tests/i18n.test \
|
||||
tests/iconv.test \
|
||||
tests/import.test \
|
||||
tests/interp.test \
|
||||
tests/keywords.test \
|
||||
|
|
|
|||
115
test-suite/tests/iconv.test
Normal file
115
test-suite/tests/iconv.test
Normal file
|
|
@ -0,0 +1,115 @@
|
|||
;;;; iconv.test --- Exercise the iconv API. -*- coding: utf-8; mode: scheme; -*-
|
||||
;;;;
|
||||
;;;; Copyright (C) 2013 Free Software Foundation, Inc.
|
||||
;;;; Andy Wingo
|
||||
;;;;
|
||||
;;;; This library is free software; you can redistribute it and/or
|
||||
;;;; modify it under the terms of the GNU Lesser General Public
|
||||
;;;; License as published by the Free Software Foundation; either
|
||||
;;;; version 3 of the License, or (at your option) any later version.
|
||||
;;;;
|
||||
;;;; This library is distributed in the hope that it will be useful,
|
||||
;;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
;;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
||||
;;;; Lesser General Public License for more details.
|
||||
;;;;
|
||||
;;;; You should have received a copy of the GNU Lesser General Public
|
||||
;;;; License along with this library; if not, write to the Free Software
|
||||
;;;; Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
||||
|
||||
(define-module (test-suite iconv)
|
||||
#:use-module (ice-9 iconv)
|
||||
#:use-module (rnrs bytevectors)
|
||||
#:use-module (test-suite lib))
|
||||
|
||||
|
||||
(define exception:encoding-error
|
||||
'(encoding-error . ""))
|
||||
|
||||
(define exception:decoding-error
|
||||
'(decoding-error . ""))
|
||||
|
||||
|
||||
(with-test-prefix "ascii string"
|
||||
(let ((s "Hello, World!"))
|
||||
;; For ASCII, all of these encodings should be the same.
|
||||
|
||||
(pass-if "to ascii bytevector"
|
||||
(equal? (string->bytevector s "ASCII")
|
||||
#vu8(72 101 108 108 111 44 32 87 111 114 108 100 33)))
|
||||
|
||||
(pass-if "to ascii bytevector (length check)"
|
||||
(equal? (string-length s)
|
||||
(bytevector-length (string->bytevector s "ascii"))))
|
||||
|
||||
(pass-if "from ascii bytevector"
|
||||
(equal? s
|
||||
(bytevector->string (string->bytevector s "ascii") "ascii")))
|
||||
|
||||
(pass-if "to utf-8 bytevector"
|
||||
(equal? (string->bytevector s "ASCII")
|
||||
(string->bytevector s "utf-8")))
|
||||
|
||||
(pass-if "to UTF-8 bytevector (testing encoding case sensitivity)"
|
||||
(equal? (string->bytevector s "ascii")
|
||||
(string->bytevector s "UTF-8")))
|
||||
|
||||
(pass-if "from utf-8 bytevector"
|
||||
(equal? s
|
||||
(bytevector->string (string->bytevector s "utf-8") "utf-8")))
|
||||
|
||||
(pass-if "to latin1 bytevector"
|
||||
(equal? (string->bytevector s "ASCII")
|
||||
(string->bytevector s "latin1")))
|
||||
|
||||
(pass-if "from latin1 bytevector"
|
||||
(equal? s
|
||||
(bytevector->string (string->bytevector s "utf-8") "utf-8")))))
|
||||
|
||||
(with-test-prefix "narrow non-ascii string"
|
||||
(let ((s "été"))
|
||||
(pass-if "to latin1 bytevector"
|
||||
(equal? (string->bytevector s "latin1")
|
||||
#vu8(233 116 233)))
|
||||
|
||||
(pass-if "to latin1 bytevector (length check)"
|
||||
(equal? (string-length s)
|
||||
(bytevector-length (string->bytevector s "latin1"))))
|
||||
|
||||
(pass-if "from latin1 bytevector"
|
||||
(equal? s
|
||||
(bytevector->string (string->bytevector s "latin1") "latin1")))
|
||||
|
||||
(pass-if "to utf-8 bytevector"
|
||||
(equal? (string->bytevector s "utf-8")
|
||||
#vu8(195 169 116 195 169)))
|
||||
|
||||
(pass-if "from utf-8 bytevector"
|
||||
(equal? s
|
||||
(bytevector->string (string->bytevector s "utf-8") "utf-8")))
|
||||
|
||||
(pass-if-exception "encode latin1 as ascii" exception:encoding-error
|
||||
(string->bytevector s "ascii"))
|
||||
|
||||
(pass-if-exception "misparse latin1 as utf8" exception:decoding-error
|
||||
(bytevector->string (string->bytevector s "latin1") "utf-8"))
|
||||
|
||||
(pass-if-exception "misparse latin1 as ascii" exception:decoding-error
|
||||
(bytevector->string (string->bytevector s "latin1") "ascii"))))
|
||||
|
||||
|
||||
(with-test-prefix "wide non-ascii string"
|
||||
(let ((s "ΧΑΟΣ"))
|
||||
(pass-if "to utf-8 bytevector"
|
||||
(equal? (string->bytevector s "utf-8")
|
||||
#vu8(206 167 206 145 206 159 206 163) ))
|
||||
|
||||
(pass-if "from utf-8 bytevector"
|
||||
(equal? s
|
||||
(bytevector->string (string->bytevector s "utf-8") "utf-8")))
|
||||
|
||||
(pass-if-exception "encode as ascii" exception:encoding-error
|
||||
(string->bytevector s "ascii"))
|
||||
|
||||
(pass-if-exception "encode as latin1" exception:encoding-error
|
||||
(string->bytevector s "latin1"))))
|
||||
Loading…
Add table
Add a link
Reference in a new issue