Skip to content
This repository has been archived by the owner on May 10, 2023. It is now read-only.

Some regex over the bas corpus #393

Closed
ftyers opened this issue Dec 6, 2020 · 5 comments
Closed

Some regex over the bas corpus #393

ftyers opened this issue Dec 6, 2020 · 5 comments
Assignees

Comments

@ftyers
Copy link
Contributor

ftyers commented Dec 6, 2020

Would it be possible to run the following regex on the bas corpus?

  • Remove sentences that end in ; *
  • Replace *\.\.$ with .
  • Replace *!\. with !
  • Change all Ñ
  • Replace *\.$ with .

If sentences with these patterns have been rejected, please dereject them. If they have been accepted, please leave them accepted but with the modified content.

@MichaelKohler
Copy link
Member

A few thoughts on this:

  • We can replace these in the Sentence Collector, but note that this would lead to the old sentences being removed from the common-voice repo's sentences list and the new ones being added. However the already imported sentences in the Common Voice data base as well as the corpus would not automatically be changed.
  • We can also add a script that automatically applies these changes when exporting, so future issues with these would automatically be corrected as well: https:/common-voice/sentence-collector/blob/main/server/lib/cleanup/CLEANUP.md. This could be even done instead of the database migration script.

If sentences with these patterns have been rejected, please dereject them. If they have been accepted, please leave them accepted but with the modified content.

This would require a SQL migration no matter what.

Remove sentences that end in ; *

Can you elaborate on that? What does the * mean in this case? Or do you mean anything that has a semicolon in it?

@ftyers
Copy link
Contributor Author

ftyers commented Dec 7, 2020

Can you elaborate on that? What does the * mean in this case? Or do you mean anything that has a semicolon in it?

Here I should have said ; *$ to be more clear, what I mean is any sentence that ends in a semicolon preceded by any number of spaces.

We can replace these in the Sentence Collector, but note that this would lead to the old sentences being removed from the common-voice repo's sentences list and the new ones being added. However the already imported sentences in the Common Voice data base as well as the corpus would not automatically be changed.

Have any of them been imported into Common Voice yet anyway? Basaa still isn't enabled afaik. The ideal situation would be to fix them everywhere.

We can also add a script that automatically applies these changes when exporting, so future issues with these would automatically be corrected as well: https:/common-voice/sentence-collector/blob/main/server/lib/cleanup/CLEANUP.md. This could be even done instead of the database migration script.

I think it would be better to have them in the validation step rather than the cleanup step.

@MichaelKohler
Copy link
Member

Have any of them been imported into Common Voice yet anyway? Basaa still isn't enabled afaik. The ideal situation would be to fix them everywhere.

While languages only get activated for speech contribution, exports from the Sentence Collector are done as soon as the language is enabled in https:/mozilla/common-voice/blob/main/locales/all.json. See https:/mozilla/common-voice/blob/main/server/data/bas/sentence-collector.txt for the already exported sentences.

I think it would be better to have them in the validation step rather than the cleanup step.

Sure, that would work as well, but would require more work to make sure all the rules can be surfaced to the user in a nice way. Basically any special rule that is not already covered would need a nice "title" to inform the user about what exactly is wrong and what the fix would be.

@ftyers
Copy link
Contributor Author

ftyers commented Dec 7, 2020

While languages only get activated for speech contribution, exports from the Sentence Collector are done as soon as the language is enabled in https:/mozilla/common-voice/blob/main/locales/all.json. See https:/mozilla/common-voice/blob/main/server/data/bas/sentence-collector.txt for the already exported sentences.

Cool, then we'll want to just remove these ones:

éy! me bak le me kwo i lép.
ñemb a ññwéha i ndap mok.
ñkokon a mmal.
ñkum u ñkôs ndoñ, ñkum u nlôôs ndoñ.
ñkônôk u hibee mañga.
ñkônôk u ñgand.
ñkôô lipidô.

We can add them back again.

I guess for most of the punctuation ones:

  • Space before final full stop, remove the final space
  • Full stop after exclamation mark
  • Sentence ends in a semicolon or comma

For the ñ and .. ones I don't think it makes sense to have them in particular rules, e.g. I could imagine a case where .. was valid (e.g. as part of ...).

@MichaelKohler MichaelKohler self-assigned this Dec 22, 2020
MichaelKohler pushed a commit that referenced this issue Dec 23, 2020
# [2.1.0](v2.0.27...v2.1.0) (2020-12-23)

### Bug Fixes

* add migrtion to remove not yet approved sentences for Guarani (fixes [#400](#400)) ([#406](#406)) ([9b6a08d](9b6a08d))
* change source of some Kazakh sentences as per email request ([#403](#403)) ([2d15d25](2d15d25))
* delete some bas sentences (issue [#393](#393)) ([#404](#404)) ([834b780](834b780))

### Features

* add undecided/rejected text only API endpoint (fixes [#402](#402))  ([#407](#407)) ([bc54da4](bc54da4))
* add validation function for other checks and use it for bas (fixes [#393](#393)) ([#405](#405)) ([8f66671](8f66671))
@MichaelKohler
Copy link
Member

🎉 This issue has been resolved in version 2.1.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants