Added Haitian Creole (ht) Language Support to spaCy #13807

JephteyAdolphe · 2025-04-27T18:31:08Z

Description

This PR adds official support for Haitian Creole (ht) to spaCy's spacy/lang module.
It includes:

Added all core language data files for spacy/lang/ht:
- tokenizer_exceptions.py
- punctuation.py
- lex_attrs.py
- syntax_iterators.py
- lemmatizer.py
- stop_words.py
- tag_map.py
Unit tests for tokenizer and noun chunking (test_tokenizer.py, test_noun_chunking.py, etc.). Passed all 58 pytest spacy/tests/lang/ht tests that I've created.
Basic tokenizer rules adapted for Haitian Creole orthography and informal contractions.
Custom like_num atrribute supporting Haitian number formats (e.g., "3yèm").
Support for common informal apostrophe usage (e.g., "m'ap", "n'ap", "di'm").
Ensured no breakages in other language modules.
Followed spaCy coding style (PEP8, Black).

This provides a foundation for Haitian Creole NLP development using spaCy.

Type of change

My PR covers the addition of a new language (new feature).

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

Additional Notes

Haitian Creole does not have an official XPOS tagset, so UPOS (Universal POS) tags are used.
The tokenizer was carefully adapted for informal orthographic contractions (m'ap, l'ap, etc.).
Minimal stop_words were compiled, based on common function words and expressions.
The contribution focuses on making ht available in the core library, and future models can be trained later based on this work.
Trained using valid UD CoNLL-U data and received a final LAS score of 0.52 (based on a train set of 2670 sentences and dev set of 333 sentences). Looking to increase the treebank size over time and add on to this foundational ht spaCy module either myself or with the help of other collaborators that are fluent in Haitian Creole. I went with 96 hidden width, 10000 max steps, .25 dropout, 1 accumalate gradient, and a batch size of 50.

Thanks

I'm very excited to get the ball rolling for a low-resource language like Haitian Creole and contribute to an amazing library like spaCy!

Example Usage

import spacy

nlp = spacy.blank("ht")

# text = "Map manje gato a pandan map gade televizyon lem lakay mwen."
# text = "M'ap vini, eske wap la avek lajan'm? Si ou, di'l non pou fre'w."
# text = "M ap teste sa (pou kounye a)."
# text = "Si'm ka vini, m'ap pale ak li."
# text = "\"regre lanmò twò bonè\""
text = """Onè ap fèt pou ansyen lidè Pati Travayè Britanik

Moun atravè lemond ap voye onè pou ansyen lidè
Pati Travayè a, John Smith, ki mouri pi bonè jodi a apre li te fè yon gwo kriz kadyak a laj 55 an.

Nan Washington, Depatman Deta Etazini pibliye yon deklarasyon ki eksprime "regre lanmò twò bonè" avoka ak palmantè eskoze a.

"Misye Smith, pandan tout karyè li ki te make ak distenksyon"""

doc = nlp(text)

print("Tokens:")
print(len(doc))
for token in doc:
    print(f"{token.text} | {token.orth_} | {token.norm_} | {token.whitespace_}")

JephteyAdolphe · 2025-05-05T13:45:31Z

Bump @honnibal @syllog1sm @ines

Jeff Adolphe added 3 commits April 27, 2025 13:31

Added Haitian Creole (ht) language support

031a60e

small tweaks

b203171

norm_map additions

8628420

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Haitian Creole (ht) Language Support to spaCy #13807

Added Haitian Creole (ht) Language Support to spaCy #13807

JephteyAdolphe commented Apr 27, 2025 •

edited

Loading

JephteyAdolphe commented May 5, 2025

Added Haitian Creole (ht) Language Support to spaCy #13807

Are you sure you want to change the base?

Added Haitian Creole (ht) Language Support to spaCy #13807

Conversation

JephteyAdolphe commented Apr 27, 2025 • edited Loading

Description

Type of change

Checklist

Additional Notes

Thanks

Example Usage

JephteyAdolphe commented May 5, 2025

JephteyAdolphe commented Apr 27, 2025 •

edited

Loading