Skip to content

feat: add diskann index#369

Open
richyreachy wants to merge 20 commits intoalibaba:mainfrom
richyreachy:feat/diskann_index
Open

feat: add diskann index#369
richyreachy wants to merge 20 commits intoalibaba:mainfrom
richyreachy:feat/diskann_index

Conversation

@richyreachy
Copy link
Copy Markdown
Collaborator

Add diskann index into Zvec to lower memory usage in vector search as per the description: #325

const std::vector<float> &b) const {
if (a.size() != b.size()) return false;
for (size_t i = 0; i < a.size(); ++i)
if (std::fabs(a[i] - b[i]) >= 1e-4f) return false;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥改这个?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

测试定位bug的时候,放宽了要求,改回去了

-Wl,--whole-archive
$<TARGET_FILE:core_knn_flat_static>
$<TARGET_FILE:core_knn_flat_sparse_static>
$<TARGET_FILE:core_knn_hnsw_static>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码有重复,可以调整一下

run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
libaio-dev
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果用户的环境没有装libaio-dev,会发生什么?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在默认使用是需要安装libaio,可以通过配置的方式进行区分,千问的建议是通过linux安装包的方式安装libaio库:

Installation

zvec requires the libaio system library on linux platform.

On Ubuntu/Debian:

sudo apt-get install libaio1 libaio-dev
pip install zvec

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果没有安装,会发生什么?这里预期的行为应该是 如果用户不安装aio,不影响除diskann的其他功能使用

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在新的PR里做了调整:#378

pytest \
scikit-build-core \
setuptools_scm
shell: bash
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把bash加回去吧,统一一点,并且如果后续这里是多行命令,在非bash为默认shell的环境下可能会出问题

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

auto &pool = ctx->expanded_nodes();
for (uint32_t i = 0; i < pool.size(); i++) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以使用std::remove_if + erase,效率高一些


virtual ~DiskAnnQueryParams() = default;

int list_size() const {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加上参数的注释吧

}

for (size_t i = 0; i < dimension; i++) {
centroid_data_ptr[i] /= entity_.doc_cnt();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entity_.doc_cnt()可能为0吗?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加了提前校验:
if (ailego_unlikely(holder->count() == 0)) {
LOG_ERROR("Holder is empty");
return IndexError_Runtime;
}


(*entity_.mutable_medoid()) = medoid_id;

LOG_INFO("Medroid Calculation Done. ID: %zu", (size_t)medoid_id);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Medoid

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


sector_internal_id_++;
if (sector_internal_id_ >= sector_vec_num_) {
std::vector<uint8_t> padding_(padding_size_, 0);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有必要allocate一个临时的std::vector?

std::memset(data_ptr + data_size_, 0, padding_size_);

float *centroid_data_{nullptr};

diskann_id_t medoid_;
std::vector<diskann_id_t> entrypints_;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: entrypoints_

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

##
## Copyright (C) The Software Authors. All rights reserved.
##
## \file CMakeLists.txt
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉吧,换成Copyright of zvec

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


int list_size() const {
return list_size_;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要透出的参数(query_params/index_params) 我看和其他类似产品是有区别的,这里的考量是什么?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里和diskann保持一致,使用list size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants