Change TDMLRunner to use XMLTextInfosetInputter/Outputter as default by olabusayoT · Pull Request #1650 · apache/daffodil

olabusayoT · 2026-04-03T15:16:57Z

it still uses scala results for certain things so we expose getScalaResult in the TDML Inputters/Outputters
Update TDML Schema to add support for custom validation name/type and use in stringAsXML tests
Drop whitespace between elements to keep expected matching actual, but keep alll others like mixed whitespace, attributes, comments unchanged
Introduced tests for stringAsXML validation and namespace handling.
Added a noNormalizations flag to control whether comments, processing instructions, and line endings are normalized.
Updated associated XML parsing methods and test cases to support the new option.
Revised whitespace removal to handle specific scenarios for improved XML processing.
Verify prefixes resolve to the same namespaces when checking prefixes

- it still uses scala results for certain things so we expose getScalaResult in the TDML Inputters/Outputters - Update TDML Schema to add support for custom validation name/type and use in stringAsXML tests - Drop whitespace between elements to keep expected matching actual, but keep all others like mixed whitespace, attributes, comments unchanged - Introduced tests for `stringAsXML` validation and namespace handling. - Added a `noNormalizations` flag to control whether comments/processing instructions are normalized. - Updated associated XML parsing methods and test cases to support the new option. - Revised whitespace removal to handle specific scenarios for improved XML processing. - Verify prefixes resolve to the same namespaces when checking prefixes - update TDMLException with more information on why getSimpleText isn't matching - NullInfosetInputter should be received UTF-8 bytes for its events Deprecation/Compatibility Instead of ScalaXMLInfosetInputter/Outputter being the default inputter/outputter for TDML Runner, it is now XMLTextInfosetInputter/Outputter which supports stringsAsXml feature DAFFODIL-2909

stevedlawrence · 2026-04-08T18:06:48Z

daffodil-core/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala

+            NamespaceBinding(null, null | "", _),
+            _*
+          ) =>
+        dropWhitespace(e)


It looks like the dropWhitespace function recursively drops all whitespace, which I'm not sure we want to do. I would expect that infosets that use stringAsXML would want to ensure the stringAsXml portion is exactly the same, including whitespace.

Instead, maybe we should only expect the stringAsXml element to have a single Elem child, and any other children should be empty text nodes or we should error--that gives the user an indication that the expected stringAsXml messed up (or we have have a buggy infoset outputter).

Maybe instead we want something like:

case e @ Elem(..., stringAsXml, ..., children) => { val (elemChildren, otherChildren) = e.filter { _.isInstanceOf[Elem] } if (elemChildren.length != 1) ... // throw exception, stringAsXml must contain a single child element nonElemChildren.foreach { c => case Text(data) if data.matches("""\*""") => // no-op, empty text siblings are fine case _ => ... // throw exception, c is some kind of mixed content not allowed as a stringAsXml child } c.copy(child = elemChildren) }

This way the only thing we throw away are sibling whitespace Text nodes, and it also verfies that there is only single child element.

It's probably also worth adding a comment here explaining that this is specifically for the stringAsXml feature and that we avoid making changes to any of its children except removing any surrounding whitespace, requiring that stringAsXml in the infoset match results exactly.

stevedlawrence · 2026-04-08T18:13:20Z

daffodil-core/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala

-    val noMixedWS = removeMixedWhitespace(combinedText)
-    noMixedWS
+    n match {
+      case x @ Elem(


This function is only called on the root element which is rarely going to be the stringAsXml element, so I don't think this case will ever really match in practice. I think instead the functions below should all have cases to be no-ops for the stringAsXml element, similar to what you have for removeAttributes1

stevedlawrence · 2026-04-08T18:22:24Z

daffodil-core/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala

+              nsbB.getURI(prefixB),
+              b.getNamespace(prefixB)
+            )
+          )


Is it possible for these new checks to fail? I thought we validated the expected infoset to make sure it was valid, which I think should check to make sure namespaces resolve?

I don't think we validate the expected infoset, it looks like it's still an open ticket

https://issues.apache.org/jira/projects/DAFFODIL/issues/DAFFODIL-288

That being said I don't think it can actually fail, since a.getNamespace(prefixA) is usually equal to nsbA.getURI(prefixA). I cannot think of a failing elem example

stevedlawrence · 2026-04-08T18:24:58Z

daffodil-tdml-lib/src/main/scala/org/apache/daffodil/tdml/TDMLRunner.scala

    val before = testSuite.loadingExceptions.clone()

-    val elem = loader.load(infosetSrc, None) // no schema
+    val elem = loader.load(infosetSrc, None, noNormalizations = true) // no schema


My only concern with disabling normalizations is that we don't remove things like comments now. That's important for string as XML, but I feel like that has caused problems in the past with pattern matching in the daffodil compiler (for example, expecting a Seq(Elem("foo") but the children are actually Seq(Comment(..), Elem(foo)), but maybe the TDML runner doesn't do any of that kind of pattern matching to parse TDML files and just uses XML paths to access elements, which ignores things like comments?

I think the TDML runner calls normalize which would remove comments for non-stringAsXML elems, the issue was the load was removing it wholesale, which is not what we wanted

stevedlawrence · 2026-04-08T18:27:56Z

...-processor/src/main/scala/org/apache/daffodil/processor/tdml/DaffodilTDMLDFDLProcessor.scala

    val dafpr = parseResult.asInstanceOf[DaffodilTDMLParseResult]
    val inputter = dafpr.inputter
-    val resNode = dafpr.getResult
+    val resNode = dafpr.getScalaResult


What happens if DAFFODIL_TDML_API_INFOSETS is "xml", then I think there wont' be a scala result?

stevedlawrence · 2026-04-09T13:15:16Z

daffodil-test/src/test/scala/org/apache/daffodil/infoset/TestStringAsXmlValidatorFactory.scala

+import java.util.Properties
+
+import org.apache.daffodil.api.validation.ValidatorFactory
+


Suggset we combine this file with the TestStringAsXmlValidator.scala file. It's small enough that I don't think the separate adds a whole lot. And it's nice to have al the custom validation logic in a single file for quick reference.

stevedlawrence · 2026-04-09T13:21:46Z

daffodil-test/src/test/resources/org/apache/daffodil/infoset/stringAsXML.tdml

+
+  <parserTestCase name="stringAsXml_09" root="binMessage"
+                  model="/org/apache/daffodil/infoset/stringAsXml/namespaced/xsd/binMessage.dfdl.xsd"
+                  validation="on">


Should these use the TestStringAsXmlValidator or just inherit from the default if changed above?

stevedlawrence · 2026-04-09T13:22:48Z

daffodil-test/src/test/resources/org/apache/daffodil/infoset/stringAsXML.tdml

+                xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"
+                xmlns:xs="http://www.w3.org/2001/XMLSchema"
+                xmlns:ex="http://example.com"
+                defaultValidation="off">


Should we set the default validation so we don't have to repeat it in the tests?

stevedlawrence · 2026-04-09T13:40:21Z

daffodil-core/src/main/scala/org/apache/daffodil/lib/xml/DaffodilConstructingLoader.scala

 *                          as not normalizing CRLFs is non-standard for XML.
- *
+ * @param noNormalizations True to not remove comments and processing instructions and to not normalize
+ *                       CRLF/CR to LF. This is used to keep the XML as close to the original as possible


I don't love the name noNormalizations--setting noNormalizations=false is kidnof a double negative and a bit tricky to make sense of, and it also kindof makes it so normalizeCRLFtoLF is ignored if true. I would maybe suggest we just add additional flags for the specific behaviors (e.g. removeComments and removeProcInstr). It makes it clear exactly what those flags will do and gives control to users about exactly what they want to keep. I imagine in most cases normalizeCRLF, removeComments, and removeProcInstr will all be set to the same thing (a user either wants everything removed or everything kept), but it at least gives the option.

stevedlawrence · 2026-04-09T13:48:09Z

...-processor/src/main/scala/org/apache/daffodil/processor/tdml/DaffodilTDMLDFDLProcessor.scala

      TDMLInfosetOutputterAll()
-    } else {
+    } else if (tdmlApiInfosetsEnv == "scala") {
      TDMLInfosetOutputterScala()


Do we need this outputter? Feels like we really just need the core XML one, and then the all one used for CI to makes sure all of our infoset inputters/outputters do the same thing.

olabusayoT requested review from jadams-tresys and stevedlawrence April 3, 2026 15:16

olabusayoT force-pushed the daf-2909-tdml-runner-stringAsXML branch 4 times, most recently from ef9f46c to 2e3475b Compare April 6, 2026 17:39

olabusayoT force-pushed the daf-2909-tdml-runner-stringAsXML branch from 2e3475b to c42b8ae Compare April 6, 2026 18:06

stevedlawrence reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change TDMLRunner to use XMLTextInfosetInputter/Outputter as default#1650

Change TDMLRunner to use XMLTextInfosetInputter/Outputter as default#1650
olabusayoT wants to merge 1 commit intoapache:mainfrom
olabusayoT:daf-2909-tdml-runner-stringAsXML

olabusayoT commented Apr 3, 2026

Uh oh!

stevedlawrence Apr 8, 2026

Uh oh!

stevedlawrence Apr 8, 2026

Uh oh!

stevedlawrence Apr 8, 2026

Uh oh!

olabusayoT Apr 10, 2026

Uh oh!

stevedlawrence Apr 8, 2026

Uh oh!

olabusayoT Apr 10, 2026

Uh oh!

stevedlawrence Apr 8, 2026

Uh oh!

stevedlawrence Apr 9, 2026

Uh oh!

stevedlawrence Apr 9, 2026

Uh oh!

stevedlawrence Apr 9, 2026

Uh oh!

stevedlawrence Apr 9, 2026

Uh oh!

stevedlawrence Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		import java.util.Properties

		import org.apache.daffodil.api.validation.ValidatorFactory

Conversation

olabusayoT commented Apr 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants