Xercesと（NekoHTMLと）戯れる２ - おおたに6号機blog

こんなのが文字実態参照のまま
きちんと解釈されればいいはず。

<z id=\"aaa\"><y id=\"&nbsp;y&nbsp;\">&nbsp;aaa&nbsp;<x /></y></z>"

いま考えているやり方は、Xercesのデフォルトで読まれている
XML11Configurationを継承して、Configurationクラスを作って
そこでpipelineを形成するところ（configurePipelineと、configureXML11Pipeline）で
Filterをかませる。

Filterでは、getValueでnormalizeされる前の文字列を
そのまんま出力してやる。

これで多分上手くいくと思うんだけどな。

イメージだと、

public class TeedaXMLConfiguration extends XML11Configuration {

    public TeedaXMLConfiguration() {
    }

    protected void configurePipeline() {
        wrapDocumentHandler();
        super.configurePipeline();
    }

    protected void configureXML11Pipeline() {
        wrapDocumentHandler();
        super.configureXML11Pipeline();
    }
    
    protected void wrapDocumentHandler() {
        XMLDocumentHandler orgHandler = getDocumentHandler();
        自前Filter作る。
        filter.setDocumentHandler(orgHandler);
        setDocumentHandler(filter);
    }
}

Filterでは、

    public void startElement(QName element, XMLAttributes attributes,
            Augmentations augs) throws XNIException {
        orgHandler.startElement(element, new 自前Wrapper(attributes), augs);
    }

で、各WrapしたAttributesでは

    public String getValue(int index) {
        String value = attributes.getNonNormalizedValue(index);
        return value;
    }

とこんな感じ。

（追記）
おっと。これだけだとだめみたい。
The entity "nbsp" was referenced, but not declared.って怒られた。
どうやらが参照される以前に宣言しておかないといけないのか。

むー・・・
Xercesのアーキテクチャをしっかりわかってないね＞おいら
↓を見ると、Scannerか。

http://xerces.apache.org/xerces2-j/xni-design.html

NekoHTMLを見ると確かにScannerで何かやってる。
おー、そこにPlaybackInputStreamなるものが（脱線）。
これがひがさんに教えてもらった頭の方だけ読んで、
METAタグでencoding指定があれば、そいつでもう一度読み直すと。
なるほど。
はいはい、脱線終了ｗ

Scannerをつくるか、それとも日和ってNekoを使って
Wrapするか、悩み中。