Skip to content

MD5 function with utf8view output is not supported as a hash input #1604

@Flyangz

Description

@Flyangz

Describe the bug
As mentioned in this issue, apache/datafusion#16903, after upgrading DataFusion to version 49, the output of the MD5 function defaults to Utf8View due to the changes in apache/datafusion#16290.
This data format is not fully supported in Auron. If the result of the MD5 function is used as the input for a hash, an error will occur.

To Reproduce
The following test case can be added in AuronFunctionSuite:

test("md5 function") {
    withTable("t1") {
      sql("create table t1 using parquet as select 'spark' as c1, '3.x' as version")
      val functions =
        """
          |select b.md5
          |from (
          |  select c1, version from t1
          |) a join (
          |  select md5(concat(c1, version)) as md5 from t1
          |) b on md5(concat(a.c1, a.version)) = b.md5
          |""".stripMargin
      val df = sql(functions)
      checkAnswer(df, Seq(Row("9ff36a3857e29335d03cf6bef2147119")))
    }
  }

This will result in the following error:
Caused by: java.lang.RuntimeException: task panics: Execution error: Execution error: output_with_sender[Project] error: Execution error: output_with_sender[BroadcastJoin] error: Execution error: Unsupported data type in hasher: Utf8View

Expected behavior
The MD5 function should work correctly. Perhaps full support for Utf8View in Auron will not be available soon. If that's the case, the MD5 function could revert to the old logic that does not convert the return value to a StringViewArray.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions