![]() | ||
Task #12: Turkish Lexical Sample Task
Mailing listDatasets and FormatsDictionary The dictionary is the one that is published by TDK (Turkish Language
Foundation) and it is open to public via internet (http://tdk.org.tr/tdksozluk/sozara.htm).
This dictionary lists the senses along with their definitions and example
sentences that are provided for some senses. A typical entry from this
dictionary for the word “şey (thing)” is given below: The entry in the dictionary has the following information: “1 . (sense number) Madde, eşya, söz, olay, iş, durum vb.nin yerine kullanılan, belirsiz anlamda bir söz (definition) "Bana sen pek çok şey kazandırdın."(example sentence)- R. H. Karay (citation).” The dictionary is used only for sense tagging and enumeration of the senses for standardization. No specific information other than the sense numbers is taken from the dictionary, therefore there is no need for linguistic processing of the dictionary. Training and Evaluation Data We will provide data for 35 words (10 nouns, 15 verbs and 10 other POS for the rest of POS including adjectives and adverbs). If a word has n senses, we tag at least 100 examples per word but the number of samples can be more depending on the n value. For a few words, however, fewer examples exist because of lack of data. In the final version, all the ambiguous words will have at least 100 examples. If for some words fewer examples exist in the corpus they can be either eliminated or some other examples can be added in the same format. On the average, each of the selected words have 10 senses, verbs, however, have more. Approximately 66% of the examples for each word will be delivered as training data, whereas 33% will be kept as evaluation data. Corpus samples will comprise 1-10 sentences including the target word depending on the Treebank files (i.e. corpus). Data will be given in txt files for each word under each POS. The samples for the words that can belong to more than one POS will be listed under the majority class. POS will be provided for each sample. Corpus Lesser studied languages, such as Turkish suffer from the lack of wide coverage electronic resources or other language processing tools like ontologies, dictionaries, morphological analyzers, parsers etc. There are some projects for providing data for NLP applications in Turkish like METU Corpus Project (see here). It has two parts, the main corpus and the Treebank that consists of parsed, morphologically analyzed and disambiguated sentences selected from the main corpus, respectively. The sentences are given in XML format and provide many syntactic features that can be helpful for WSD. Treebank can be used for academic purposes by contract. The texts in main corpus have been taken from different types of Turkish written texts published in 1990 and afterwards. It has about two million words. It includes 999 written texts taken from 201 books, 87 papers and news from 3 different Turkish daily newspapers. XML and TEI (Text Encoding Initiative) style annotation have been used. The distribution of the texts in the Treebank is similar to the main corpus. There are 6930 sentences in this Treebank. These sentences have been parsed, morphologically analyzed and disambiguated. In Turkish, a word can have many analyses, so having disambiguated texts is very important. Frequencies of the words have been found as it is necessary to select appropriate ambiguous words for WSD. There are 5356 different root words and 627 of these words have 15 or more occurrences, and the rest have less. The sense tags are not included and have to be added manually. Sense tagging has been achieved for some words and hopefully, tags will be checked by some experts in order to obtain gold standard. Initial tagging process has been finished by a single tagger and controlled. Two other linguists in the team will tag and control the examples. That is, this step will be completed by three taggers. Problematic cases will be handled by a commission of three taggers that will act as the referee. The members of the commission will be different than the original three taggers and the decision will not be finalized until having 90% agreement in at most two months time. The structure of the XML files contains tagging information in the word (morphological analysis) and sentence level (parse tree). In the word level, inflectional forms are provided. And in the sentence level relations among words are given. The S tag is for sentence and W tag is for the word. IX is used for index of the word in the sentence, LEM is left as blank and lemma is given in the MORPH tag as a part of it with the morphological analysis of the word. REL is for parsing information. It consists of three parts, two numbers and a relation. For example REL="[2, 1, (MODIFIER)]" means this word is modifying the first inflectional group of the second word in the sentence. The structure of the Treebank data was designed by METU. Initially lemmas were decided to be provided as a tag by itself, however, lemmas are left as blank. This does not mean that lemmas are not available in the Treebank; the lemmas are given as a part of “IG” tag. Programs are available for extracting this information for the time being. All participants can get these programs and thereby the lemmas easily and instantly.
We have extracted example sentences of the target word(s) and some features. Then text files whose formats are complying with the arff file structure of WEKA system (see here) are obtained. The key files for each word are kept in plain text files including information about Previous context (root, POS, inflected POS, case marker, possessor, relation), target word (root, POS(corrected), ontology level1, ontology level2, ontology level3, POS, inflected POS, case marker, possessor, relation) and subsequent context (root, POS(corrected), ontology level1, ontology level2, ontology level3, POS, inflected POS, case marker, possessor, relation). And feature files are also in txt format and the key files and the feature files are combined as given below:
The files will be text files obtained from Excel(Tab delimited file) and the above tables 3rd column will be transposed and will be in a single line as follows. 00002213148.xml 9 0 tap verb abstraction attribute
emotion verb adv ? fl modıfıer sev verb noun abl tr object sıkıl verb
abstraction attribute emotion verb verb ? fl sentence 2 2 "#ne
tuhaf şey ; değil mi ?iyi olmamdan ; onu taparcasına sevmemden sıkıldı
.#" The Treebank provides all necessary syntactical annotations. The sense tags are provided in the key files for each word. In the key files, sense annotations are given line by line. In each line file id, sentence# and occurrence# are given along with the fine-grained and coarse-grained sense of that specific word. One can use these key files and Treebank XML files to get any specific information about the word, context and the senses. These files for the training data will be open to public. Ontology
EvaluationFormat for answers and scoring is somehow similar to SENSEVAL standard. A "guidelines to taggers" document, comprising detailed instructions of how instances were to be tagged and covering, e.g., multi-word units (including morphological and lexical variants on them), metaphors, missing meanings etc, will be made available with the lexical entries download if not before. The evalution will be done only for fine-grained and coarse-grained senses. For fine-grained senses no partial points will be assigned. However for the coarse grained senses partial points will be possible. The participants can provide a single answer for fine-grained sense and three answers with associated probabilities for coarse-grained senses. The answers of the participants will be provided as a plain text file whose name is the same wit the target ambiguous word adn extension is txt. For example if the word is "sev" then the answer file will be "sev.txt". Use the following format for each line of the answer files: File id#Sentence number#Order#Fine-grained sense#Corse-grained sense1-Probability1#Corse-grained sense2-Probability2#Corse-grained sense3-Probability3# If there is less than 3 coarse-grained senses applicable use -2 for sense and 0.0 for probability of it. Download area
This section will contain evaluation software, useful scripts, complementary materials, baseline systems, etc. but not the datasets proper. The datasets will be available at the main site for download. System and ResultsThis section will be completed after the competition. ReferencesOrhan, Z., Altan, Z., "Impact of Feature Selection for Corpus-Based WSD in Turkish", Lecture Notes in Artificial Intelligence, LNAI, Springer-Verlag, Vol. 4293, Nov. 2006, pp. 868-878 Orhan Z., Altan Z., "Effective Features for Disambiguation of Turkish Verbs ", 5. International Enformatika Conference(IEC'05), August 26-28, 2005, Prague, Czech Republic, Aug. 2005, 7, pp. 182-186 |
||
|
For more information, visit the SemEval-2007 home page. |
||