Skip to content

Added Haitian Creole (ht) Language Support to spaCy #13807

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

JephteyAdolphe
Copy link

@JephteyAdolphe JephteyAdolphe commented Apr 27, 2025

Description

This PR adds official support for Haitian Creole (ht) to spaCy's spacy/lang module.
It includes:

  • Added all core language data files for spacy/lang/ht:

    • tokenizer_exceptions.py
    • punctuation.py
    • lex_attrs.py
    • syntax_iterators.py
    • lemmatizer.py
    • stop_words.py
    • tag_map.py
  • Unit tests for tokenizer and noun chunking (test_tokenizer.py, test_noun_chunking.py, etc.). Passed all 58 pytest spacy/tests/lang/ht tests that I've created.

  • Basic tokenizer rules adapted for Haitian Creole orthography and informal contractions.

  • Custom like_num atrribute supporting Haitian number formats (e.g., "3yèm").

  • Support for common informal apostrophe usage (e.g., "m'ap", "n'ap", "di'm").

  • Ensured no breakages in other language modules.

  • Followed spaCy coding style (PEP8, Black).

This provides a foundation for Haitian Creole NLP development using spaCy.

Type of change

My PR covers the addition of a new language (new feature).

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

Additional Notes

  • Haitian Creole does not have an official XPOS tagset, so UPOS (Universal POS) tags are used.
  • The tokenizer was carefully adapted for informal orthographic contractions (m'ap, l'ap, etc.).
  • Minimal stop_words were compiled, based on common function words and expressions.
  • The contribution focuses on making ht available in the core library, and future models can be trained later based on this work.
  • Trained using valid UD CoNLL-U data and received a final LAS score of 0.52 (based on a train set of 2670 sentences and dev set of 333 sentences). Looking to increase the treebank size over time and add on to this foundational ht spaCy module either myself or with the help of other collaborators that are fluent in Haitian Creole. I went with 96 hidden width, 10000 max steps, .25 dropout, 1 accumalate gradient, and a batch size of 50.

Thanks

I'm very excited to get the ball rolling for a low-resource language like Haitian Creole and contribute to an amazing library like spaCy!

Example Usage

import spacy

nlp = spacy.blank("ht")

# text = "Map manje gato a pandan map gade televizyon lem lakay mwen."
# text = "M'ap vini, eske wap la avek lajan'm? Si ou, di'l non pou fre'w."
# text = "M ap teste sa (pou kounye a)."
# text = "Si'm ka vini, m'ap pale ak li."
# text = "\"regre lanmò twò bonè\""
text = """Onè ap fèt pou ansyen lidè Pati Travayè Britanik

Moun atravè lemond ap voye onè pou ansyen lidè
Pati Travayè a, John Smith, ki mouri pi bonè jodi a apre li te fè yon gwo kriz kadyak a laj 55 an.

Nan Washington, Depatman Deta Etazini pibliye yon deklarasyon ki eksprime "regre lanmò twò bonè" avoka ak palmantè eskoze a.

"Misye Smith, pandan tout karyè li ki te make ak distenksyon"""

doc = nlp(text)

print("Tokens:")
print(len(doc))
for token in doc:
    print(f"{token.text} | {token.orth_} | {token.norm_} | {token.whitespace_}")

@JephteyAdolphe
Copy link
Author

Bump @honnibal @syllog1sm @ines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant