-
Notifications
You must be signed in to change notification settings - Fork 62
Some regex over the bas corpus #393
Comments
A few thoughts on this:
This would require a SQL migration no matter what.
Can you elaborate on that? What does the * mean in this case? Or do you mean anything that has a semicolon in it? |
Here I should have said
Have any of them been imported into Common Voice yet anyway? Basaa still isn't enabled afaik. The ideal situation would be to fix them everywhere.
I think it would be better to have them in the validation step rather than the cleanup step. |
While languages only get activated for speech contribution, exports from the Sentence Collector are done as soon as the language is enabled in https:/mozilla/common-voice/blob/main/locales/all.json. See https:/mozilla/common-voice/blob/main/server/data/bas/sentence-collector.txt for the already exported sentences.
Sure, that would work as well, but would require more work to make sure all the rules can be surfaced to the user in a nice way. Basically any special rule that is not already covered would need a nice "title" to inform the user about what exactly is wrong and what the fix would be. |
Cool, then we'll want to just remove these ones:
We can add them back again. I guess for most of the punctuation ones:
For the |
# [2.1.0](v2.0.27...v2.1.0) (2020-12-23) ### Bug Fixes * add migrtion to remove not yet approved sentences for Guarani (fixes [#400](#400)) ([#406](#406)) ([9b6a08d](9b6a08d)) * change source of some Kazakh sentences as per email request ([#403](#403)) ([2d15d25](2d15d25)) * delete some bas sentences (issue [#393](#393)) ([#404](#404)) ([834b780](834b780)) ### Features * add undecided/rejected text only API endpoint (fixes [#402](#402)) ([#407](#407)) ([bc54da4](bc54da4)) * add validation function for other checks and use it for bas (fixes [#393](#393)) ([#405](#405)) ([8f66671](8f66671))
🎉 This issue has been resolved in version 2.1.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Would it be possible to run the following regex on the
bas
corpus?; *
*\.\.$
with.
*!\.
with!
^ñ
→Ñ
*\.$
with.
If sentences with these patterns have been rejected, please dereject them. If they have been accepted, please leave them accepted but with the modified content.
The text was updated successfully, but these errors were encountered: