Handle unknown nucleotides

I noticed in a closer look here that tokens other than ACTG aren't handled: https://github.com/Open-Athena/biofoundation/blob/e8ff2febc0f14a268b757ee35585ede5dbf8b4ae/biofoundation/data.py#L45-L58


The line `NUCLEOTIDES.index(example["seq"][pos]) ` should fail on an unknown token.  Is that handled some other way @gonzalobenegas?


Also side note: the line `ref = input_ids[pos].item()` is unnecessary (that `ref` value is not used).

	def transform_reflogprob_clm(
	example: dict[str, Any],
	tokenizer: PreTrainedTokenizerBase,
	) -> dict[str, Any]:
	pos = example["pos"]
	assert example["seq"][pos] in NUCLEOTIDES
	input_ids = tokenizer(example["seq"], return_tensors="pt")["input_ids"][0]
	ref = input_ids[pos].item()
	# Create 4 copies of the input sequence
	new_input_ids = input_ids.unsqueeze(0).repeat(len(NUCLEOTIDES), 1)
	for i, nuc in enumerate(NUCLEOTIDES):
	new_input_ids[i, pos] = tokenizer.encode(nuc)[0]
	ref = NUCLEOTIDES.index(example["seq"][pos])
	return dict(input_ids=new_input_ids, ref=ref)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle unknown nucleotides #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handle unknown nucleotides #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions