Skip to content

UCREL/KASELL-2022-example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KASELL-2022-example

Example code for the presentation given at KASELL 2022. All of the example code assumes you want to tokenise, Part Of Speech tag, and Semantic tag with USAS tags Chinese text with the spaCy small Chinese pipeline and the Chinese PyMUSAS model. The spaCy pipeline tokenises and POS tags and the Chinese PyMUSAS model semantic tags the text. The PyMUSAS model in all examples is added to the spaCy pipeline as the PyMUSAS model is itself a spaCy extension.

Install

Can be installed on all operating systems and supports Python version >=3.7, to install the relevant python requirements, spaCy small Chinese pipeline, and the Chinese PyMUSAS model:

pip install -r requirements.txt

Tagging Text from a file and outputting tagged text to TSV file

The tag_text_from_file.py Python script takes as input two command line arguments:

  • INPUT_FILE -- File path to an input text file. Text should be in Chinese.
  • OUTPUT_TSV_FILE -- File path to a TSV file that you would like the tagged data to be stored.

It will open and tag the text data stored at INPUT_FILE and output the tagged data in TSV format to the OUTPUT_TSV_FILE. The TSV file will contain the following fields:

  • Token Index -- Index of the token.
  • MWE -- Whether the token is part of a Multi Word Expression.
  • MWE Start -- The start token index of the MWE.
  • MWE End -- The end token index of the MWE.
  • Token -- The token text.
  • POS -- Part Of Speech (POS) tag of the token.
  • USAS -- A list of USAS tags of the token, whereby the first USAS tag in the list is the most likely USAS tag.

We have an example text file within this repository, the same example file used in KASELL 2022 presentation, input_file.txt to tag this file and output the tagged data to output_file.tsv, run the following:

python tag_text_from_file.py input_file.txt output_file.tsv
The `output_file.tsv` should contain the following:
Token Index	MWE	MWE Start	MWE End	Token	POS	USAS
0	False	0	1	截至	ADP	['Z99']
1	False	1	2	2016年	NOUN	['Z99']
2	False	2	3	4月	NOUN	['Z99']
3	False	3	4		PUNCT	['PUNCT']
4	True	4	8	国际	NOUN	['Z3']
5	True	4	8	货币	NOUN	['Z3']
6	True	4	8	基金	NOUN	['Z3']
7	True	4	8	组织	NOUN	['Z3']
8	False	8	9	共有	VERB	['N5', 'S1.1.2+', 'S5+', 'N1%', 'S5+c']
9	False	9	10	190	NUM	['N1']
10	False	10	11		NUM	['S2mf']
11	False	11	12	成员国	NOUN	['Z99']
12	False	12	13		PUNCT	['PUNCT']
13	False	13	14	包括	VERB	['A1.8+', 'A1.7+', 'N2', 'A11.1+', 'Q1.1/A1.6', 'X2.5+']
14	False	14	15	科索沃	PROPN	['Z99']
15	False	15	16		PUNCT	['PUNCT']
16	False	16	17		PUNCT	['PUNCT']
17	False	17	18	4	NUM	['N1']
18	False	18	19		NUM	['S2mf']
19	False	19	20	联合国	PROPN	['Z99']
20	False	20	21	会员国	NOUN	['Z99']
21	False	21	22	迄今	ADV	['M6', 'A2.2', 'N4', 'T1.1.1', 'T1.1.2']
22	False	22	23		ADV	['C1', 'F2/O2', 'E3+/A2.1', 'E6+', 'M8', 'E3+', 'O4.5', 'T1.1.2', 'T2++', 'Z4']
23	False	23	24		ADV	['Z6']
24	False	24	25	加入	VERB	['A2.2', 'S5+', 'A1.8+', 'A1.1.1', 'H2', 'S1.1.1', 'Q4.3', 'S1.1.3+', 'T3-', 'N5+/A2.1', 'N2', 'O2', 'Q4.2/I2.2', 'A9-/I1', 'I2.2/I1', 'A2.1+', 'S7.4+', 'S7.1+']
25	False	25	26		PUNCT	['PUNCT']
26	False	26	27	古巴	PROPN	['Z2', 'Z2/S2mf']
27	False	27	28		PUNCT	['PUNCT']
28	False	28	29	朝鲜	PROPN	['Z2']
29	False	29	30		PUNCT	['PUNCT']
30	False	30	31	列支敦士登	PROPN	['Z99']
31	False	31	32		PUNCT	['PUNCT']
32	False	32	33	摩纳哥	PROPN	['Z2']
33	False	33	34		PUNCT	['PUNCT']

About

Example code for the presentation given at KASELL 2022

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages