Change TDMLRunner to use XMLTextInfosetInputter/Outputter as default#1650
Change TDMLRunner to use XMLTextInfosetInputter/Outputter as default#1650olabusayoT wants to merge 1 commit intoapache:mainfrom
Conversation
ef9f46c to
2e3475b
Compare
- it still uses scala results for certain things so we expose getScalaResult in the TDML Inputters/Outputters - Update TDML Schema to add support for custom validation name/type and use in stringAsXML tests - Drop whitespace between elements to keep expected matching actual, but keep all others like mixed whitespace, attributes, comments unchanged - Introduced tests for `stringAsXML` validation and namespace handling. - Added a `noNormalizations` flag to control whether comments/processing instructions are normalized. - Updated associated XML parsing methods and test cases to support the new option. - Revised whitespace removal to handle specific scenarios for improved XML processing. - Verify prefixes resolve to the same namespaces when checking prefixes - update TDMLException with more information on why getSimpleText isn't matching - NullInfosetInputter should be received UTF-8 bytes for its events Deprecation/Compatibility Instead of ScalaXMLInfosetInputter/Outputter being the default inputter/outputter for TDML Runner, it is now XMLTextInfosetInputter/Outputter which supports stringsAsXml feature DAFFODIL-2909
2e3475b to
c42b8ae
Compare
| NamespaceBinding(null, null | "", _), | ||
| _* | ||
| ) => | ||
| dropWhitespace(e) |
There was a problem hiding this comment.
It looks like the dropWhitespace function recursively drops all whitespace, which I'm not sure we want to do. I would expect that infosets that use stringAsXML would want to ensure the stringAsXml portion is exactly the same, including whitespace.
Instead, maybe we should only expect the stringAsXml element to have a single Elem child, and any other children should be empty text nodes or we should error--that gives the user an indication that the expected stringAsXml messed up (or we have have a buggy infoset outputter).
Maybe instead we want something like:
case e @ Elem(..., stringAsXml, ..., children) => {
val (elemChildren, otherChildren) = e.filter { _.isInstanceOf[Elem] }
if (elemChildren.length != 1) ... // throw exception, stringAsXml must contain a single child element
nonElemChildren.foreach { c =>
case Text(data) if data.matches("""\*""") => // no-op, empty text siblings are fine
case _ => ... // throw exception, c is some kind of mixed content not allowed as a stringAsXml child
}
c.copy(child = elemChildren)
}This way the only thing we throw away are sibling whitespace Text nodes, and it also verfies that there is only single child element.
It's probably also worth adding a comment here explaining that this is specifically for the stringAsXml feature and that we avoid making changes to any of its children except removing any surrounding whitespace, requiring that stringAsXml in the infoset match results exactly.
| val noMixedWS = removeMixedWhitespace(combinedText) | ||
| noMixedWS | ||
| n match { | ||
| case x @ Elem( |
There was a problem hiding this comment.
This function is only called on the root element which is rarely going to be the stringAsXml element, so I don't think this case will ever really match in practice. I think instead the functions below should all have cases to be no-ops for the stringAsXml element, similar to what you have for removeAttributes1
| nsbB.getURI(prefixB), | ||
| b.getNamespace(prefixB) | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Is it possible for these new checks to fail? I thought we validated the expected infoset to make sure it was valid, which I think should check to make sure namespaces resolve?
There was a problem hiding this comment.
I don't think we validate the expected infoset, it looks like it's still an open ticket
https://issues.apache.org/jira/projects/DAFFODIL/issues/DAFFODIL-288
That being said I don't think it can actually fail, since a.getNamespace(prefixA) is usually equal to nsbA.getURI(prefixA). I cannot think of a failing elem example
| val before = testSuite.loadingExceptions.clone() | ||
|
|
||
| val elem = loader.load(infosetSrc, None) // no schema | ||
| val elem = loader.load(infosetSrc, None, noNormalizations = true) // no schema |
There was a problem hiding this comment.
My only concern with disabling normalizations is that we don't remove things like comments now. That's important for string as XML, but I feel like that has caused problems in the past with pattern matching in the daffodil compiler (for example, expecting a Seq(Elem("foo") but the children are actually Seq(Comment(..), Elem(foo)), but maybe the TDML runner doesn't do any of that kind of pattern matching to parse TDML files and just uses XML paths to access elements, which ignores things like comments?
There was a problem hiding this comment.
I think the TDML runner calls normalize which would remove comments for non-stringAsXML elems, the issue was the load was removing it wholesale, which is not what we wanted
| val dafpr = parseResult.asInstanceOf[DaffodilTDMLParseResult] | ||
| val inputter = dafpr.inputter | ||
| val resNode = dafpr.getResult | ||
| val resNode = dafpr.getScalaResult |
There was a problem hiding this comment.
What happens if DAFFODIL_TDML_API_INFOSETS is "xml", then I think there wont' be a scala result?
| import java.util.Properties | ||
|
|
||
| import org.apache.daffodil.api.validation.ValidatorFactory | ||
|
|
There was a problem hiding this comment.
Suggset we combine this file with the TestStringAsXmlValidator.scala file. It's small enough that I don't think the separate adds a whole lot. And it's nice to have al the custom validation logic in a single file for quick reference.
|
|
||
| <parserTestCase name="stringAsXml_09" root="binMessage" | ||
| model="/org/apache/daffodil/infoset/stringAsXml/namespaced/xsd/binMessage.dfdl.xsd" | ||
| validation="on"> |
There was a problem hiding this comment.
Should these use the TestStringAsXmlValidator or just inherit from the default if changed above?
| xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/" | ||
| xmlns:xs="http://www.w3.org/2001/XMLSchema" | ||
| xmlns:ex="http://example.com" | ||
| defaultValidation="off"> |
There was a problem hiding this comment.
Should we set the default validation so we don't have to repeat it in the tests?
| * as not normalizing CRLFs is non-standard for XML. | ||
| * | ||
| * @param noNormalizations True to not remove comments and processing instructions and to not normalize | ||
| * CRLF/CR to LF. This is used to keep the XML as close to the original as possible |
There was a problem hiding this comment.
I don't love the name noNormalizations--setting noNormalizations=false is kidnof a double negative and a bit tricky to make sense of, and it also kindof makes it so normalizeCRLFtoLF is ignored if true. I would maybe suggest we just add additional flags for the specific behaviors (e.g. removeComments and removeProcInstr). It makes it clear exactly what those flags will do and gives control to users about exactly what they want to keep. I imagine in most cases normalizeCRLF, removeComments, and removeProcInstr will all be set to the same thing (a user either wants everything removed or everything kept), but it at least gives the option.
| TDMLInfosetOutputterAll() | ||
| } else { | ||
| } else if (tdmlApiInfosetsEnv == "scala") { | ||
| TDMLInfosetOutputterScala() |
There was a problem hiding this comment.
Do we need this outputter? Feels like we really just need the core XML one, and then the all one used for CI to makes sure all of our infoset inputters/outputters do the same thing.
stringAsXMLvalidation and namespace handling.noNormalizationsflag to control whether comments, processing instructions, and line endings are normalized.DAFFODIL-2909