Design and implementation of a CORpus di Italiano Scritto

To describe the realisation of CORIS briefly, the principle phases may be indicated as follows:

  1. Corpus design
  1. Corpus typology
  2. Corpus size
  3. Representativeness
  1. Design of source text framework
  1. Text typology
  2. Text unit size
  3. Definition of selection criteria
  1. Corpus structure
  1. Subcorpora definition
  2. Subcorpora-to-subcorpora ratio
  3. Definition of sampling criteria
  1. Source data collection and corpus building
  2. Part-of-speech tagging and lemmatisation


In order to design and construct CORIS, some preliminary choices were necessary to lay the foundations for successive stages. First of all the aim of the project was defined, and the type of corpus it was intended to create. From the very beginning, the purpose of the project was identified as being a general corpus, as defined by the Brown Corpus, one of the first electronic corpora. Just as the Brown Corpus was referred to as "a standard sample of present-day English for use with digital computers", so too the aim of CORIS, at the design stage, could be identified with the creation of a collection of texts in electronic format which represent, in the widest sense, present-day Italian. The identification of this aim provided a solution to one of the first problems which arose in the planning of the corpus, the choice between synchronic and diachronic dimensions. It was decided to select texts synchronically in order to permit a generalised description of commonly used Italian.

The choice between written or spoken language gave rise to greater problems. Having taken into account various possibilities, bearing in mind the obvious advantages of having a corpus with both written and spoken texts, it was decided to give priority to written texts at this stage of research. The decision was based both on external and internal criteria. First of all, it was influenced by the general panorama of Italian linguistics and the position that the corpus would occupy alongside works such as Lessico di frequenza dell'italiano parlato (LIP, 1993), Lessico di frequenza della lingua italiana contemporanea (LIF,1972), Vocabolario elettronico elettronico della lingua italiana. Il vocabolario del 2000 (VELI, 1989), Corpus di italiano parlato (Cresti 2000) and LIZ ( Letteratura Italiana Zanichelli in cd-rom (1993,1995, 1997) to name just the most important. We should also mention the Italian Reference Corpus (1991) and the Italian Corpus Documentation PAROLE (1998) both developed at the ILC by the the Pisa CNR. Secondly, in the light of transformations in communication by new technologies, it was preferred not to pose the problem of the relationship between the language traditionally considered as standard spoken Italian and its technological ramifications via telephone, radio, television and/or computer technology.

For these reasons, the choice fell upon a synchronic corpus of written language, whose component texts belong, roughly speaking, to the 1980s and 1990s, with a somewhat wider temporal collocation as far as narrative is concerned. They belong to an Italian language which, using the criteria determined by Nencioni (1983), can be described as written-written.

The definition of the size of CORIS required greater thought. A study of presently available corpora clearly revealed that it was not possible to make reference to any standard size. The rapid and widespread development which has characterised, especially in recent years, both the low-cost availability of hardware as well as the production of ever more efficient and user-friendly software, has radically transformed the criteria for the creation of the most recent corpora compared with those of the first or second generation.

While the criteria on which first generation corpora, such as the Brown Corpus, were based may have been mainly influenced by the potentiality of information technology, present-day technology no longer sets any limits to the choices of the researcher, who can extend a corpus to include the varieties held to be relevant to his/her analysis and, within these, make a suitable selection of the varieties of representative texts. Developments in information technology over the past years, the present speed of the processing of material and the low cost of mass storage units mean that it is possible to create corpora consisting of hundreds of millions of words, such as the British National Corpus and the Bank of English. It would seem that, especially as far as written language is concerned, the standard of one million words has given way to a standard of one hundred million. However, any generalisation appears to be debatable, as is any definition of an obligatory limit. The Brown Corpus (1967), with one million words, 500 written text samples of 2000 words each, representing in equal measure the main text types, is still considered by many scholars to be a valid model. One of the most recent English language corpora, the Longman Spoken and Written English Corpus - LSWE Corpus - created by scholars like Biber, Johnson, Leech, Conrad and Finegan, consists of about 40,000,000 words and contains 37,244 texts. It is claimed, that these texts vary in length according to register.

A further aspect to be considered in the definition of a corpus relates to the introduction of monitor corpora. These provide for constant updating by means of the periodic introduction of data realised by a collection of filters, on the basis of a selection carried out both on fresh data and on those already introduced. The configuration of the monitor corpus means that the aspects of determinacy and permanence which were defining characteristics of the size of a corpus over the past decades are no longer valid. The corpus takes on a dynamic configuration, which seems more relevant and advantageous if we consider that today, with the new possibilities provided by the development of new technology and memory, it is no longer necessary to go to the trouble of selecting texts. It seems to be possible to manage a corpus whose principal components are delimited and, at the same time, a monitor corpus which is open and able to record innovations and modifications in current usage. This combination makes it possible to access a corpus which is available in a finite form - either on-line or on CD-Rom - and which can be updated by means of the monitor as well as by the introduction of supplementary subcorpora representing further varieties.

It was therefore decided to proceed with the planning of a corpus whose size, though configured as "large", was not predetermined but relative to the choice of linguistic varieties thought to be representative and, as such, set as an intermediate research goal following the compilation of a pilot corpus.

The definition of representativeness is a crucial point in the creation of a corpus, but it is one of the most controversial aspects among specialists, especially as regards the ambiguity inherent in its use due to the intermingling of quantitative and qualitative connotations. While for some scholars the extension of corpora to include hundreds of millions of words might make up for a slight differentiation in the varieties represented, for others a wide differentiation in varieties is set as an essential condition for any act of generalisation.

As far as we were concerned, even in the first phase of research the problem of representativeness did not, in our opinion, disappear with the possibility of enlarging the corpus; indeed, it was underlined even more. In spite of the size increase to hundreds of millions of words, each corpus represents a limited sample of language in use. An operation of sampling, however extensive it may be, inevitably turns out to be simplified in the light of the complexity of the phenomenon under examination. Even building random selections into the corpus construction, it seemed to us that in the transition from the sample to the generalisation, certain degrees of approximation should be provided for, thus allowing maximum flexibility and dynamics in the proposed model.

In the light of problems of what I would call an epistemological nature encountered in the planning of a corpus which could unarguably be defined as being representative of a language or of the state of a language, it was decided to proceed recognising the limits inherent in the project itself and identifying parameters which might eventually counterbalance those limits. Some criteria of identification for the parameters of reference were thus defined which permitted the creation of a collection of sub-corpora which included the chief varieties of written Italian, represented and appropriately balanced. It appeared possible, at the same time, to obtain the elaboration of a model of dynamic and adaptive creation which would satisfy the needs and working hypotheses of different scholars while still respecting the criteria of corpus construction.


In the context of corpus linguistics, one of the basic criteria accepted by all projects and studies is the fact that selected texts must be authentic and commonly used in social interaction. There is however no consensus as to whether to insert texts in their entirety or in fragments which may be defined as being representative. This is indeed a crucial point and was the object of considerable reflection during the planning phase. As we have seen, in the first corpora, such as Brown, standardised sampling was applied. Uniformity of text size is one of the basic construction principles. If there was disagreement, this focused upon the size of the samples. In the designing of the construction model it was held that, considering the present conditions created by software programs, the problem is not so much that of defining sample size but rather of the choice to be made between texts and texts fragments.

The first inevitably leads to a lack of standardisation of text samples. It is rarely the case that several texts, whether they be journalistic, narrative or scientific, contain the same number of words. The second, on the other hand, may lead to a stronger influence of the researcher's subjective judgment and implies that the selected sequence is taken out of context. This could mean that the larger size invalidates the very representativeness of the corpus. It was therefore decided that, where possible, the entire text would be entered, rather than standardising sample size.

A later step was the definition of linguistic varieties used to create the corpus. These are considered as a collection of documents identifiable on the basis of both external and internal features, in which the peculiarity of the single variety fades away in comparison to the mass of data. This constituted one of the most important points. Although the corpus included specialist areas, such as legal, scientific and bureaucratic-administrative language, an attempt was made to bring together not so much a collection of specialist texts as a variety of types which, according to our investigations, can be placed within a continuum, overlapping and integrating one and another.

When defining the selection and creation criteria reference was made to both external and internal criteria in order to reduce the researcher's interference to a minimum. Furthermore, considering the scientific context of CORIS as well as the wide availability of existing and planned corpora, a further criterion was introduced, that of "comparability", in order to offer scholars the possibility of interlinguistic comparison of corpora.


In order to define a first level of articulation of a corpus, what I would describe as criteria of external textuality and comparability were of prime importance. These led to the configuration of a first level of articulation - provided by the sub-corpora - in which it was possible to refer to some macro-varieties identified on the basis of external appearance or the material elements of the text, extremely clear in their characterisation and easily comparable. The subjective choices of the researcher would thus be reduced to a minimum.

As distinction between "published" and "unpublished" texts was considered to be too simple, various forms of publications from the "press", "narrative" from various types of volumes and essays identified as miscellaneous were then selected, and various hand-written, printed and above all electronic texts were grouped together in a section under the heading of "ephemera" due to their transitory nature.

Having defined these macro-varieties, it was thought necessary to apply a second level of articulation - based on the sections which could be divided into subsections - which, again using external parameters as a basis, still allowed collected data to be contextualised. For example, it was clear that a sampling of the "press" population could not be undertaken except on the basis of a second articulation connected to the socio-cultural reality of the nation. This was considered to be a fundamental point in order to arrive at a definition of a population's components, albeit with some degree of approximation.

The reference to the above-mentioned parameters led to the configuration of the following structure:

Subcorpus PRESS
Sections newspapers, periodic, supplement
Subsections national, local
specialist, non specialist
connotated, non connotated 
Subcorpus FICTION
Sections novels, short stories
Subsections Italian, foreign
for adults, for children 
crime, adventure, science-fiction, women literature
Sections human sciences, natural sciences, physics, experimental sciences
Subsections books, reviews
scientific, popularhistory, philosophy, arts, literary criticism, law, 
economy, biology, etc.
Sections legal, bureaucratic, administrative
Subsections books, reviews
Sections books on religion, travel, cookery, hobbies, etc.
Subsections books, reviews
Subcorpus EPHEMERA
Sections letters, leaflets, instructions
Subsections private, public
printed form, electronic form


Having defined the selection criteria, the next step was the planning of the sub-corpora, first taking into consideration an examination of the size they should have and the ratio between the size of the various subcorpora and sections.

An initial idea was to consider the possibility of working on the basis of a randomised selection and to correlate the dimensions of each subgroup of texts to the number, albeit approximate, of the recipients of a given text. The application of quantitative parameters - such as circulation and distribution - proved to be too limiting - in comparison with qualitative parameters such as time and type of text use or level of cognitive attention. So despite the difficulties involved in the introduction of qualitative (hence non-measurable) parameters, it was our opinion that merely quantitative data were not sufficiently significant and that they should be integrated, as far as the percentage ratios between sub-corpora and sections was concerned, with qualitative variables, lest any one variety should be overestimated. This choice of procedure was corroborated by an in-depth analysis for 1997:


(data derived from FIEG, La stampa in Italia 1995-1998, Milano, 1999)


(data derived from AIE, La produzione libraria italiana del 1997, Milano, 1999)

Newspapers 2 955 501 360

Weekly magazines 730 364 544

Monthly magazines 194 607 972

Fiction 119 100 000

Non-fiction 179 400 000

TOTAL 3 880 473 876  TOTAL 298 500 000

The ratio of 1:12 established, more or less, between texts from the mass media and texts from the book market could not be accepted as being reproducible in the samples. On the other hand, it appeared to be too relevant to ignore, even bearing in mind the comparability of the corpus under construction. Within the ratio allowed by the sales volumes, which, on the basis of the data, is represented as an interval, it was decided to set the ratio between the different areas of circulation as the smallest allowed value in order not to penalise certain textual varieties, such as letters.

Having selected a wide range of linguistic varieties, documents for the entry of the single sub-corpora were prepared and, in order to comply with the criterion of representativeness, the documents were randomized within each sub-corpus. Having defined this objective corpus framework, the following macro-varieties were defined:

PRESS - 38 million words
FICTION - 25 million words
ACADEMIC PROSE - 12 million words
MISCELLANEA -10 million words
EPHEMERA - 5 million words


A corpus of written Italian - a defined model and a dynamic model.

Therefore, the corpus of written Italian - CORIS - appears to be defined along general lines as:

a collection of texts which are authentic, commonly occurring, in electronic format, chosen as representative of present-day Italian and in terms of size as: a general corpus consisting of 100 million words updated every two years by means of a monitor corpus

CORIS was designed and built as a general reference corpus for the analysis of written Italian and will be placed on-line by June 2001.

At the same time, considering the vital role which will be played by the comparability of a reference corpus, it seemed important to provide for the possibility of creating an alternative corpus structure which would make it adaptable to the needs of different researchers. Besides CORIS, a further corpus - CODIS - has been designed. Aimed at specialist needs which might arise in the context of interlinguistic analysis, CODIS presents a dynamic and adaptive structure that allows the selection of the subcorpora which are pertinent to a specific research project and also the size of every single sub-corpus. CODIS is designed to be dynamically adapted to different comparative needs.

User-selectable sizes (Mw)