I noticed in a closer look here that tokens other than ACTG aren't handled:
|
def transform_reflogprob_clm( |
|
example: dict[str, Any], |
|
tokenizer: PreTrainedTokenizerBase, |
|
) -> dict[str, Any]: |
|
pos = example["pos"] |
|
assert example["seq"][pos] in NUCLEOTIDES |
|
input_ids = tokenizer(example["seq"], return_tensors="pt")["input_ids"][0] |
|
ref = input_ids[pos].item() |
|
# Create 4 copies of the input sequence |
|
new_input_ids = input_ids.unsqueeze(0).repeat(len(NUCLEOTIDES), 1) |
|
for i, nuc in enumerate(NUCLEOTIDES): |
|
new_input_ids[i, pos] = tokenizer.encode(nuc)[0] |
|
ref = NUCLEOTIDES.index(example["seq"][pos]) |
|
return dict(input_ids=new_input_ids, ref=ref) |
The line NUCLEOTIDES.index(example["seq"][pos]) should fail on an unknown token. Is that handled some other way @gonzalobenegas?
Also side note: the line ref = input_ids[pos].item() is unnecessary (that ref value is not used).
I noticed in a closer look here that tokens other than ACTG aren't handled:
biofoundation/biofoundation/data.py
Lines 45 to 58 in e8ff2fe
The line
NUCLEOTIDES.index(example["seq"][pos])should fail on an unknown token. Is that handled some other way @gonzalobenegas?Also side note: the line
ref = input_ids[pos].item()is unnecessary (thatrefvalue is not used).