-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diff sample predicting note-seg #37
Conversation
Great! Two things @vgkz
Ideally if you can update the script so it has these three choices in future checks as well.
|
Done! There are 4 ambiguous cases where the sequence contains both note- and seg parts. How should those cases be classified? |
This is very good!
|
I'll look into it! I've tried some of the models from my thesis on the note-seg problem but only saw marginal improvements in accuracy. However, I haven't looked at specific examples where errors occur. There are examples of merged titles and margins from my thesis which is a similar problem to merged notes and segs. I'll begin working on a dataset and a model for the splitting problem.
I believe the markdown file is created by a script in the sample-git-diffs tool by @ninpnin so the script would need to be changed there. |
|
data/198283/prot-198283--130.xml is a speech by the speaker of parliament which looks exactly like a note. Assuming speaker tags are accurate, this could be fixed by labeling any segment directly after a speaker tag as a seg. We could also try incorporating this information in the model as a feature. Otherwise, many of the errors are short or split segments. For example data/1874/prot-1874--ak--0318.xml consists of a single character. I think labeling these would be more accurate if the data segementation is improved. |
Sampled changes
data/1867/prot-1867--ak--0330.xml
Diff starting from line 825
data/1867/prot-1867--fk--0507.xml
Diff starting from line 1257
data/1870/prot-1870--ak--0127.xml
Diff starting from line 605
data/1872/prot-1872--ak--0430.xml
Diff starting from line 1696
data/1873/prot-1873--ak--0507.xml
Diff starting from line 4742
data/1873/prot-1873--ak--0519.xml
Diff starting from line 1942
data/1874/prot-1874--ak--0318.xml
Diff starting from line 996
data/1874/prot-1874--ak--0429.xml
Diff starting from line 3074
data/1875/prot-1875--ak--055.xml
Diff starting from line 1224
data/1877/prot-1877--ak--046.xml
Diff starting from line 2242
data/1880/prot-1880--ak--039.xml
Diff starting from line 4015
data/1882/prot-1882--ak--028.xml
Diff starting from line 3391
data/1882/prot-1882--ak--044.xml
Diff starting from line 2429
data/1884/prot-1884--ak--012.xml
Diff starting from line 113
data/1885/prot-1885--ak--061.xml
Diff starting from line 3873
data/1886/prot-1886--ak--012.xml
Diff starting from line 2302
data/1887/prot-1887-majjul-ak--015.xml
Diff starting from line 1366
data/1888/prot-1888--fk--009.xml
Diff starting from line 2771
data/1888/prot-1888--fk--038.xml
Diff starting from line 2376
data/1895/prot-1895--ak--013.xml
Diff starting from line 186
data/1897/prot-1897--ak--044.xml
Diff starting from line 2043
data/1898/prot-1898--ak--029.xml
Diff starting from line 3204
data/1901/prot-1901--fk--009.xml
Diff starting from line 1312
data/1905/prot-1905--fk--048.xml
Diff starting from line 5437
data/1910/prot-1910--ak--015.xml
Diff starting from line 2859
data/1910/prot-1910--fk--027.xml
Diff starting from line 4310
data/1912/prot-1912--fk--003.xml
Diff starting from line 169
data/1912/prot-1912--fk--011.xml
Diff starting from line 1209
data/1917/prot-1917--ak--055.xml
Diff starting from line 3141
data/1918/prot-1918--ak--014.xml
Diff starting from line 465
data/1918/prot-1918--ak--026.xml
Diff starting from line 3802
data/1918/prot-1918--fk--036.xml
Diff starting from line 2814
data/1918/prot-1918-urtima-ak--005.xml
Diff starting from line 5739
data/1919/prot-1919--ak--024.xml
Diff starting from line 5297
data/1923/prot-1923--ak--018.xml
Diff starting from line 4260
data/1930/prot-1930--ak--013.xml
Diff starting from line 8871
data/1932/prot-1932--fk--047.xml
Diff starting from line 9566
data/1933/prot-1933--fk--019.xml
Diff starting from line 1390
data/1948/prot-1948--fk--013.xml
Diff starting from line 6101
data/1950/prot-1950--ak--016.xml
Diff starting from line 11018
data/1968/prot-1968--ak--020.xml
Diff starting from line 11046
data/1968/prot-1968--ak--036.xml
Diff starting from line 10667
data/197576/prot-197576--149.xml
Diff starting from line 11216
data/197677/prot-197677--056.xml
Diff starting from line 2265
data/197778/prot-197778--038.xml
Diff starting from line 6092
data/198283/prot-198283--130.xml
Diff starting from line 6215
data/198788/prot-198788--114.xml
Diff starting from line 1031
data/199091/prot-199091--129.xml
Diff starting from line 18366
data/199798/prot-199798--118.xml
Diff starting from line 1448
data/201213/prot-201213--078.xml
Diff starting from line 3730