Some regex over the bas corpus #393

ftyers · 2020-12-06T02:11:54Z

Would it be possible to run the following regex on the bas corpus?

Remove sentences that end in ; *
Replace *\.\.$ with .
Replace *!\. with !
Change all ^ñ → Ñ
Replace *\.$ with .

If sentences with these patterns have been rejected, please dereject them. If they have been accepted, please leave them accepted but with the modified content.

The text was updated successfully, but these errors were encountered:

MichaelKohler · 2020-12-06T12:28:40Z

A few thoughts on this:

We can replace these in the Sentence Collector, but note that this would lead to the old sentences being removed from the common-voice repo's sentences list and the new ones being added. However the already imported sentences in the Common Voice data base as well as the corpus would not automatically be changed.
We can also add a script that automatically applies these changes when exporting, so future issues with these would automatically be corrected as well: https:/common-voice/sentence-collector/blob/main/server/lib/cleanup/CLEANUP.md. This could be even done instead of the database migration script.

If sentences with these patterns have been rejected, please dereject them. If they have been accepted, please leave them accepted but with the modified content.

This would require a SQL migration no matter what.

Remove sentences that end in ; *

Can you elaborate on that? What does the * mean in this case? Or do you mean anything that has a semicolon in it?

ftyers · 2020-12-07T11:25:54Z

Can you elaborate on that? What does the * mean in this case? Or do you mean anything that has a semicolon in it?

Here I should have said ; *$ to be more clear, what I mean is any sentence that ends in a semicolon preceded by any number of spaces.

We can replace these in the Sentence Collector, but note that this would lead to the old sentences being removed from the common-voice repo's sentences list and the new ones being added. However the already imported sentences in the Common Voice data base as well as the corpus would not automatically be changed.

Have any of them been imported into Common Voice yet anyway? Basaa still isn't enabled afaik. The ideal situation would be to fix them everywhere.

We can also add a script that automatically applies these changes when exporting, so future issues with these would automatically be corrected as well: https:/common-voice/sentence-collector/blob/main/server/lib/cleanup/CLEANUP.md. This could be even done instead of the database migration script.

I think it would be better to have them in the validation step rather than the cleanup step.

MichaelKohler · 2020-12-07T12:12:03Z

Have any of them been imported into Common Voice yet anyway? Basaa still isn't enabled afaik. The ideal situation would be to fix them everywhere.

While languages only get activated for speech contribution, exports from the Sentence Collector are done as soon as the language is enabled in https:/mozilla/common-voice/blob/main/locales/all.json. See https:/mozilla/common-voice/blob/main/server/data/bas/sentence-collector.txt for the already exported sentences.

I think it would be better to have them in the validation step rather than the cleanup step.

Sure, that would work as well, but would require more work to make sure all the rules can be surfaced to the user in a nice way. Basically any special rule that is not already covered would need a nice "title" to inform the user about what exactly is wrong and what the fix would be.

ftyers · 2020-12-07T13:05:52Z

While languages only get activated for speech contribution, exports from the Sentence Collector are done as soon as the language is enabled in https:/mozilla/common-voice/blob/main/locales/all.json. See https:/mozilla/common-voice/blob/main/server/data/bas/sentence-collector.txt for the already exported sentences.

Cool, then we'll want to just remove these ones:

éy! me bak le me kwo i lép.
ñemb a ññwéha i ndap mok.
ñkokon a mmal.
ñkum u ñkôs ndoñ, ñkum u nlôôs ndoñ.
ñkônôk u hibee mañga.
ñkônôk u ñgand.
ñkôô lipidô.

We can add them back again.

I guess for most of the punctuation ones:

Space before final full stop, remove the final space
Full stop after exclamation mark
Sentence ends in a semicolon or comma

For the ñ and .. ones I don't think it makes sense to have them in particular rules, e.g. I could imagine a case where .. was valid (e.g. as part of ...).

#393)

# [2.1.0](v2.0.27...v2.1.0) (2020-12-23) ### Bug Fixes * add migrtion to remove not yet approved sentences for Guarani (fixes [#400](#400)) ([#406](#406)) ([9b6a08d](9b6a08d)) * change source of some Kazakh sentences as per email request ([#403](#403)) ([2d15d25](2d15d25)) * delete some bas sentences (issue [#393](#393)) ([#404](#404)) ([834b780](834b780)) ### Features * add undecided/rejected text only API endpoint (fixes [#402](#402)) ([#407](#407)) ([bc54da4](bc54da4)) * add validation function for other checks and use it for bas (fixes [#393](#393)) ([#405](#405)) ([8f66671](8f66671))

MichaelKohler · 2020-12-23T13:02:23Z

🎉 This issue has been resolved in version 2.1.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

MichaelKohler added operations P1 labels Dec 22, 2020

MichaelKohler self-assigned this Dec 22, 2020

MichaelKohler added a commit that referenced this issue Dec 22, 2020

fix: delete some bas sentences (issue #393)

580b976

MichaelKohler added a commit that referenced this issue Dec 22, 2020

fix: delete some bas sentences (issue #393) (#404)

834b780

MichaelKohler added a commit that referenced this issue Dec 23, 2020

feat: add validation function for other checks and use it for bas (fixes

0e406c9

#393)

MichaelKohler mentioned this issue Dec 23, 2020

feat: add validation function for other checks and use it for bas (fi… #405

Merged

MichaelKohler closed this as completed in 8f66671 Dec 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some regex over the bas corpus #393

Some regex over the bas corpus #393

ftyers commented Dec 6, 2020

MichaelKohler commented Dec 6, 2020

ftyers commented Dec 7, 2020 •

edited

Loading

MichaelKohler commented Dec 7, 2020

ftyers commented Dec 7, 2020

MichaelKohler commented Dec 23, 2020

Some regex over the bas corpus #393

Some regex over the bas corpus #393

Comments

ftyers commented Dec 6, 2020

MichaelKohler commented Dec 6, 2020

ftyers commented Dec 7, 2020 • edited Loading

MichaelKohler commented Dec 7, 2020

ftyers commented Dec 7, 2020

MichaelKohler commented Dec 23, 2020

ftyers commented Dec 7, 2020 •

edited

Loading