Skip to content

Change TDMLRunner to use XMLTextInfosetInputter/Outputter as default#1650

Open
olabusayoT wants to merge 1 commit intoapache:mainfrom
olabusayoT:daf-2909-tdml-runner-stringAsXML
Open

Change TDMLRunner to use XMLTextInfosetInputter/Outputter as default#1650
olabusayoT wants to merge 1 commit intoapache:mainfrom
olabusayoT:daf-2909-tdml-runner-stringAsXML

Conversation

@olabusayoT
Copy link
Copy Markdown
Contributor

  • it still uses scala results for certain things so we expose getScalaResult in the TDML Inputters/Outputters
  • Update TDML Schema to add support for custom validation name/type and use in stringAsXML tests
  • Drop whitespace between elements to keep expected matching actual, but keep alll others like mixed whitespace, attributes, comments unchanged
  • Introduced tests for stringAsXML validation and namespace handling.
  • Added a noNormalizations flag to control whether comments, processing instructions, and line endings are normalized.
  • Updated associated XML parsing methods and test cases to support the new option.
  • Revised whitespace removal to handle specific scenarios for improved XML processing.
  • Verify prefixes resolve to the same namespaces when checking prefixes

DAFFODIL-2909

@olabusayoT olabusayoT force-pushed the daf-2909-tdml-runner-stringAsXML branch 4 times, most recently from ef9f46c to 2e3475b Compare April 6, 2026 17:39
- it still uses scala results for certain things so we expose getScalaResult in the TDML Inputters/Outputters
- Update TDML Schema to add support for custom validation name/type and use in stringAsXML tests
- Drop whitespace between elements to keep expected matching actual, but keep all others like mixed whitespace, attributes, comments unchanged
- Introduced tests for `stringAsXML` validation and namespace handling.
- Added a `noNormalizations` flag to control whether comments/processing instructions are normalized.
- Updated associated XML parsing methods and test cases to support the new option.
- Revised whitespace removal to handle specific scenarios for improved XML processing.
- Verify prefixes resolve to the same namespaces when checking prefixes
- update TDMLException with more information on why getSimpleText isn't matching
- NullInfosetInputter should be received UTF-8 bytes for its events

Deprecation/Compatibility
Instead of ScalaXMLInfosetInputter/Outputter being the default inputter/outputter for TDML Runner, it is now XMLTextInfosetInputter/Outputter which supports stringsAsXml feature

DAFFODIL-2909
@olabusayoT olabusayoT force-pushed the daf-2909-tdml-runner-stringAsXML branch from 2e3475b to c42b8ae Compare April 6, 2026 18:06
NamespaceBinding(null, null | "", _),
_*
) =>
dropWhitespace(e)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the dropWhitespace function recursively drops all whitespace, which I'm not sure we want to do. I would expect that infosets that use stringAsXML would want to ensure the stringAsXml portion is exactly the same, including whitespace.

Instead, maybe we should only expect the stringAsXml element to have a single Elem child, and any other children should be empty text nodes or we should error--that gives the user an indication that the expected stringAsXml messed up (or we have have a buggy infoset outputter).

Maybe instead we want something like:

case e @ Elem(..., stringAsXml, ..., children) => {
  val (elemChildren, otherChildren) = e.filter { _.isInstanceOf[Elem] }
  if (elemChildren.length != 1) ... // throw exception, stringAsXml must contain a single child element
  nonElemChildren.foreach { c =>
    case Text(data) if data.matches("""\*""") => // no-op, empty text siblings are fine
    case _ => ... // throw exception, c is some kind of mixed content not allowed as a stringAsXml child
  }
  c.copy(child = elemChildren)
}

This way the only thing we throw away are sibling whitespace Text nodes, and it also verfies that there is only single child element.

It's probably also worth adding a comment here explaining that this is specifically for the stringAsXml feature and that we avoid making changes to any of its children except removing any surrounding whitespace, requiring that stringAsXml in the infoset match results exactly.

val noMixedWS = removeMixedWhitespace(combinedText)
noMixedWS
n match {
case x @ Elem(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only called on the root element which is rarely going to be the stringAsXml element, so I don't think this case will ever really match in practice. I think instead the functions below should all have cases to be no-ops for the stringAsXml element, similar to what you have for removeAttributes1

nsbB.getURI(prefixB),
b.getNamespace(prefixB)
)
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for these new checks to fail? I thought we validated the expected infoset to make sure it was valid, which I think should check to make sure namespaces resolve?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we validate the expected infoset, it looks like it's still an open ticket

https://issues.apache.org/jira/projects/DAFFODIL/issues/DAFFODIL-288

That being said I don't think it can actually fail, since a.getNamespace(prefixA) is usually equal to nsbA.getURI(prefixA). I cannot think of a failing elem example

val before = testSuite.loadingExceptions.clone()

val elem = loader.load(infosetSrc, None) // no schema
val elem = loader.load(infosetSrc, None, noNormalizations = true) // no schema
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern with disabling normalizations is that we don't remove things like comments now. That's important for string as XML, but I feel like that has caused problems in the past with pattern matching in the daffodil compiler (for example, expecting a Seq(Elem("foo") but the children are actually Seq(Comment(..), Elem(foo)), but maybe the TDML runner doesn't do any of that kind of pattern matching to parse TDML files and just uses XML paths to access elements, which ignores things like comments?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the TDML runner calls normalize which would remove comments for non-stringAsXML elems, the issue was the load was removing it wholesale, which is not what we wanted

val dafpr = parseResult.asInstanceOf[DaffodilTDMLParseResult]
val inputter = dafpr.inputter
val resNode = dafpr.getResult
val resNode = dafpr.getScalaResult
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if DAFFODIL_TDML_API_INFOSETS is "xml", then I think there wont' be a scala result?

import java.util.Properties

import org.apache.daffodil.api.validation.ValidatorFactory

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggset we combine this file with the TestStringAsXmlValidator.scala file. It's small enough that I don't think the separate adds a whole lot. And it's nice to have al the custom validation logic in a single file for quick reference.


<parserTestCase name="stringAsXml_09" root="binMessage"
model="/org/apache/daffodil/infoset/stringAsXml/namespaced/xsd/binMessage.dfdl.xsd"
validation="on">
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these use the TestStringAsXmlValidator or just inherit from the default if changed above?

xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:ex="http://example.com"
defaultValidation="off">
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set the default validation so we don't have to repeat it in the tests?

* as not normalizing CRLFs is non-standard for XML.
*
* @param noNormalizations True to not remove comments and processing instructions and to not normalize
* CRLF/CR to LF. This is used to keep the XML as close to the original as possible
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the name noNormalizations--setting noNormalizations=false is kidnof a double negative and a bit tricky to make sense of, and it also kindof makes it so normalizeCRLFtoLF is ignored if true. I would maybe suggest we just add additional flags for the specific behaviors (e.g. removeComments and removeProcInstr). It makes it clear exactly what those flags will do and gives control to users about exactly what they want to keep. I imagine in most cases normalizeCRLF, removeComments, and removeProcInstr will all be set to the same thing (a user either wants everything removed or everything kept), but it at least gives the option.

TDMLInfosetOutputterAll()
} else {
} else if (tdmlApiInfosetsEnv == "scala") {
TDMLInfosetOutputterScala()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this outputter? Feels like we really just need the core XML one, and then the all one used for CI to makes sure all of our infoset inputters/outputters do the same thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants