Cross-cultural Studies > Methods and participants > Comparable corpora compilation

Challenges and innovations

Our most significant challenges were:

  • Collecting Samples Reflecting Real Consumer Expectations: We needed to gather exemplary research articles (RAs) that met the expectations of actual consumers of scientific information, rather than to those of discourse analysts, who are not typically specialists in the fields discussed.
  • Gathering Rhetorically and Stylistically Appealing RAs for Spanish Writers: We aimed to collect RAs that Spanish writers of English for research publication purposes could use as models and with which they could identify.
  • Ensuring Comparable RAs Across English and Spanish: In many disciplines, scientific journals in Spanish had either nearly disappeared or were considered to be of much lower quality than those in English. This posed a challenge in collecting comparable RAs in both languages.

These challenges led to the following innovations:

  • Expert Recommendations: The RAs in the two corpora were recommended by expert informants (i.e. Spanish post-doctoral researchers) from the five institutions participating in the project, rather than being selected by us as discourse analysts.
  • Judgement by Informants: The assessment of the RA´s exemplarity in the two corpora was made by our informants, rather than by us.
  • Comparability Judgements: Our informants assessed the comparability of the RA pairs based on the similarity of the two recommended RAs in terms of their overall research type, study type, and/or topic.

Corpora characteristics

For the first ENEIDA cross-cultural studies, we focused on empirical research articles (RAs). These were compiled in the chosen areas for two possible collections, or corpora, types, which were not mutually exclusive:

1. Exemplary Research Articles in English (EXEMPRAE).

  • Empirical
  • Published in an international scientific journal in the collaborator’s field
  • In English
  • Authored by researchers who were not necessarily native English speakers, but not limited to Spanish researchers (as they would be the focus of the research in Phase 3).

2. Exemplary Research Articles in English and Spanish (EXEMPRAES).

  • Empirical
  • Comparable pairs of research articles similar in terms of topic and type of study (e.g. experimental, descriptive, survey, comparative, longitudinal, case–control, etc.)
  • One published in English in an international journal by an author or authors who were not necessarily native English speakers but not limited to Spanish researchers
  • The other published in Spanish in a Spanish journal by another Spanish author(s) (NOT by the same author or group)
  • Each article must have been originally published in either language (i.e., neither should be a translation of the other).

In both cases, the empirical research articles should:

  • Have made a very favourable impression on the informants during a recent reading, primarily due to how the research is reported rather than the journal’s impact.
  • Have been published in journals where the informants were considering publishing or had considered suitable for publishing their research in the past.
  • Be as recent as possible (published in 2000 or later), although older articles might be recommended in some disciplines.
  • Be explicitly recommended to us along with the references and the file containing the research article.

Compilation process

The procedure for compiling the corpora was as follows:

  1. Invitation to Recommend Articles: All Spanish post-doctoral researchers from the preselected knowledge areas and affiliated with the five institutions participating in the project were invited to recommend research articles that met specific characteristic detailed in an enclosed document (see also Moreno, 2013 and in process).
  2. Submission and Justification: We received 313 files containing research articles via email from a self-selected sample of 84 collaborators, who also filled in an online questionnaire to justify their recommendations and to provide contextual information about each recommended article.
  3. Acknowledgement and Follow-Up: A thank-you letter was sent to all collaborators, including a link to an additional online questionnaire for those who wished to apply for a certificate of collaboration and to provide information about their socio-demographic profiles.
  4. Certificate of Collaboration: A certificate of collaboration was sent to those collaborators who applied for it.
  5. Database Creation: Each recommended research article was associated with relevant contextual information in a resulting database.
  6. Screening: The recommended research articles were screened to ensure they complied with the specified characteristics.
  7. Validation: Only those empirical research articles that met the specified characteristics were kept as valid for the research objectives.
  8. Resulting collections: The sets of valid empirical research articles, one consisting of comparable articles in English and Spanish and the other of articles in English only, were termed the EXEMPRAES Corpora and the EXEMPRAE Corpus, respectively.

Participants and responsibilities

Principal Investigator: Ana I. Moreno (ULE)

Principal Investigator: Ana I. Moreno (ULE)
  • Coordinating the entire process of compilation of the Corpora.

  • Managing the project administratively.

  • Designing an innovative method for collecting the corpora, seeking the collaboration of Spanish post-doctoral researchers from the selected knowledge areas in the five participating institutions.

  • Writing the documentation necessary for data collection (RA characteristics specifications, purposes of the research, invitation letters, reminders, thank-you letters, certificates of collaboration).

  • Developing the online data collection tool to be used in all areas.

  • Managing databases.

  • Disseminating results.

Co-investigators: the ENEIDA compilers: Ana I. Moreno (ULE), Sally Burgess and Pedro Martín (ULL), Mª Lluisa Gea-Valor (UJI), Rosa Lorés, Pilar Mur and Enrique Lafuente (UZ)

  • Compiling empirical research articles in specific knowledge areas in collaboration with the Principal Investigator (see table below).
  • Sending the invitation emails and documentation proposed by the Pl to the Spanish post-doctoral researchers in their institution.
  • Checking whether the recommended research articles complied with the specified characteristics.
  • Writing their comments and decisions on the resulting database.
  • Sending the files with the RAs and the databases back to the coordinator.


Ramón Rodríguez (CSIC) served as an expert consultant in the compilation process.

Eneida Team 1
Researchers in charge of compiling and screening the subcorpora of exemplary empirical RAs
Knowledge AreaResearcherInstitutionCollaboratorsRecommended RAsValid RAs
Life SciencesAna I. MorenoUniversidad de León216744
Healthcare SciencesPedro Martín-MartínUniversidad de La Laguna156342
Socio/Psycho/PedagogyMª Lluisa Gea-ValorUniversitat Jaume I123930
ChemistryRosa Lorés-SanzUniversidad de Zaragoza146327
Agricultural SciencesEnrique Lafuente + Sally BurgessUniversidad de Zaragoza Universidad de La Laguna154924
EconomicsPilar Mur DueñasUniversidad de Zaragoza42915
HistorySallyUniversidad de La Laguna333
Total7484313185

(Adapted from Moreno, 2013)

Collaborators: 84 expert informants

  • Reviewing the documentation sent to them before deciding whether to participate.
  • Identifying at least one suitable research article in their research field for the collection of exemplary research articles in English and/or one pair of comparable empirical articles in English and Spanish.
  • Sending us the file(s) containing the proposed research articles by email.
  • Filling in an online questionnaire to justify their recommendation(s) and provide relevant contextual information about each recommended RA.

Names of Collaborators:

The 84 expert informants (i.e. Spanish postdoctoral researchers) from the five institutions participating in the project who collaborated in compiling the corpora are listed below by knowledge area. The table includes only the names of those who did not request anonymity (54).

LIFE SCIENCES
Asunción de Los Ríos Murillo
Jesús Seco Calvo
Francisco Javier Ascaso Puyuelo
Clara De Vega Durán
Leonor Kremer Barón
Ángel Romo Díez
Daniel Martín Sintes
Antonio E. Encina García
Ignacio Moreno Garrido
Teresa Tejedor Hernández
Matilde Cañelles López
E.-Jorge Tizado Morales
Marta Eva García González
Raquel Alonso Redondo
María C. Romero Puertas
Jesús Aparicio Fernández
Luisa Amo De Paz
Inmaculada Aranaz Corral

Total: 18

CHEMISTRY
Agatha Bastida Codina
Patricia Acero Salazar
Urbano Manuel Díaz Morales
Mario Fernández Martín
Santiago Díaz Oltra
Vanesa Calvino Casilda
Jaime Soler Herrero
Joaquín Coronas Ceresuela
Miguel Ángel Camblor Fernández
Javier Aller Fernández
Ana María Fernández Jiménez
María Esther Escudero Baquero
Vanesa Calvino Casilda

Total: 13

AGRICULTURAL SCIENCES
José Alfonso Abecia Martínez
Nieves Del Pilar Vidal González
Gemma Camañes Querol
Mónica Hurtado Ruiz
Yolanda Gogorcena Aoiz
Nuria Prieto Benavides
Maria De Los Ángeles Bustamante Muñoz
Mª Ángeles Álava Martínez De Contrasta
Francisco B Flores Pardo
Juan José Lázaro Paniagua
Antonio De Vega García

Total: 11

MEDICINE
Pedro González Ramos
Maria Orosia Lucha López
José Luis Mauriz Gutiérrez
Elena Estébanez De Miguel
Jose Miguel Tricas Moreno
M.Ángeles Álava Martínez De Contrasta
Francisco Conget López

Total: 7

ECONOMY
Francisco J. García Rodríguez
Maria Ripollés Melia
Mª Olga González Morales
José-Ángel Miguel Dávila

Total: 4

HISTORY
Francisco Saulo Rodríguez Lajusticia

Total: 1

Other collaborators

  • Hilo Digital:
    • Assisted the Principal Investigator in creating online Google forms.
    • Using Google Drive to archive the files received.
  • Albano Fernández:
    • Generated and archived the files containing the research articles in pdf format by knowledge area.
    • Created a database in Refworks, including full reference details to all the 185 valid RAs.
    • Generated and sent out certificates.
    • Managed the acquisition and archiving of other project bibliography.

Results and discussion: the Eneida Corpora

The ENEIDA Corpora currently consist of two main collections: the Exemplary Empirical Research Articles in English (EXEMPRAE) Corpora, which include exemplary empirical research articles published in English, and the Exemplary Empirical Research Articles in English and Spanish (EXEMPRAES) comparable Corpora, which contain comparable research articles published in both English and Spanish.

grafico1

THE EXEMPRAE CORPORA

  • Screening Process: The ENEIDA compilers reduced the initial number of 331 recommended research articles to 185 valid empirical research articles (see table below).

  • Exclusions: The primary reason for excluding RAs was that they were review articles. These excluded RAs constitute a collection of exemplars of a different genre that will be dealt with in future projects.

  • Distribution: Out of the 185 valid RAs, the largest group (152) has been published in international journals in English. This collection is referred to as the EXEMPRAE Corpora, and are distributed by knowledge area, as shown in the graph on the left.

All Exemplary Empirical RAs in English and Spanish
Knowledge AreaValid RAsRAs in EnglishComparable RAs in Spanish
Life Sciences44368
Healthcare Sciences42357
Socio/Psycho/Pedagogy302412
Chemistry27234
Agricultural Sciences24222
Economics15105
History321
Total18515233

THE EXEMPRAES COMPARABLE CORPORA

  • Comparable Pairs: Each of the remaining 33 RAs is in Spanish and has been deemed comparable by our expert informants to one of the RAs in the larger English group (see table above), forming 33 comparable pairs.

  • Exclusion of History: Since the subcorpora for History contained only one pair of comparable exemplary empirical research articles in English and Spanish, this knowledge area was excluded from our cross-cultural studies, leaving us with 32 comparable pairs.

  • Comparable Corpora: The smaller collection of 32 comparable pairs comprises five large groups of knowledge areas (life sciences, healthcare sciences, social sciences, agricultural sciences, and chemistry), with the distribution shown in the graph on the right.

grafico2
grafico1

We anticipated challenges in compiling comparable pairs of research articles, particularly in certain areas. However, we successfully compiled a sample of 32 pairs, which are likely the most comparable corpora of academic discourse in English and Spanish to date.

Given the extensive collaboration required, involving postdoctoral researchers from the target knowledge areas, this method was considered a success. It could be applied in other writing contexts where English for Research Publication Purposes (ERPP) is used as an additional language (Moreno, 2013 and in process).