Add comprehensive tokenization tests, update diagram, and adjust code to handle edge cases #943

Yuncong-Cao · 2025-04-06T21:42:33Z

call stack diagram for dataset

Yuncong-Cao · 2025-04-07T02:35:51Z

call stack diagram for dataset

and tokenization

not a reviewer

wheresmyhair

As discussed, we need classDiagram in favor of unit tests and helping others understand call stacks better.

…ge cases

Yuncong-Cao · 2025-04-21T00:26:49Z

Code changes based on tokenization tests

I updated ConversationTemplate.encode_conversation to drop any unpaired final message when there’s an odd count and return only the paired turns; if the first message isn’t from the user, I skip encoding and return an empty list.
I also tweaked both hf_decoder_model.py and hf_text_regression_model.py so that if a ConversationTemplate lacks a system_formatter, I set system=None before calling encode_conversation, avoiding ValueError on unformatted system prompts.

MacBook Pro +∞ and others added 2 commits April 6, 2025 16:36

add dataset.mmd

25f8bad

add tokenization.mmd

196b13a

wheresmyhair requested changes Apr 14, 2025

View reviewed changes

Yuncong Cao added 5 commits April 19, 2025 16:05

Add tokenizer test templates

572d16e

Add tokenization comparison tests and refactor hf_* tokenizers for ed…

c500840

…ge cases

modify the tokenization diagram

a1afd61

modify the tokenization diagram

db0f1d0

modify the tokenization diagram

4786f99

Yuncong-Cao changed the title ~~add dataset.mmd~~ Add comprehensive tokenization tests, update diagram, and adjust code to handle edge cases Apr 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comprehensive tokenization tests, update diagram, and adjust code to handle edge cases #943

Add comprehensive tokenization tests, update diagram, and adjust code to handle edge cases #943

Yuncong-Cao commented Apr 6, 2025

Yuncong-Cao commented Apr 7, 2025

wheresmyhair left a comment

Yuncong-Cao commented Apr 21, 2025

Add comprehensive tokenization tests, update diagram, and adjust code to handle edge cases #943

Are you sure you want to change the base?

Add comprehensive tokenization tests, update diagram, and adjust code to handle edge cases #943

Conversation

Yuncong-Cao commented Apr 6, 2025

Yuncong-Cao commented Apr 7, 2025

wheresmyhair left a comment

Choose a reason for hiding this comment

Yuncong-Cao commented Apr 21, 2025