Statistical Inference in Copula Models and Markov Processes, Case Studies and Insights
Open Access
Issue
4open
Volume 2, 2019
Statistical Inference in Copula Models and Markov Processes, Case Studies and Insights
Article Number 19
Number of page(s) 10
Section Mathematics - Applied Mathematics
DOI https://doi.org/10.1051/fopen/2019013
Published online 12 June 2019

© C. Cunha et al., Published by EDP Sciences, 2019

Licence Creative Commons
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

With information available almost constantly and coming from institutions, it is now possible to regularly review processes that impact on the life of those institutions, as is the case of institutions related to health, safety and education, among others, so that reviewing processes is a healthy task. For institutions to increase their performance, internal strategies are usually incorporated, such as measuring their processes along some period of time and using indices defined by consent. Some sectors have consolidated indices and can be used to identify performance changes. In general, indices are constructed with the intention of reproducing reality, summarizing it in just one value or few values that have simple interpretation and that are easy to calculate.

In the present study we investigate the relationship between two indicators of the Brazilian educational system. According to the newspaper Estado de São Paulo – Brazil, of January 18, 2017, only 7.3% of students in the third year of high school have an adequate level of mathematics, this shows the relevance of the constant inspection of the educational system performance. We restrict our study to the intermediate level (14–17 years old students) of public schools in the region of Guarulhos, years 2013, 2014 and 2015. Guarulhos is a city in the São Paulo state. The city of São Paulo, capital of the São Paulo state is the third most populated city in America, being behind New York and Mexico City. It is also the city with the largest Gross Domestic Product (GDP) in Latin America, which makes it a city of reference. The city of São Paulo has been approaching various municipalities of the state, because of its constant expansion. For instance, in the northeast with the municipality of Guarulhos. Guarulhos is the second most populous city in the state of São Paulo.

In this study we inspect two indicators, denoted by X and Y. These indicators compound a global index used in the state of São Paulo and called índice de desenvolvimento da educação de São Paulo (IDESP) created in 2007, http://idesp.edunet.sp.gov.br/. Thus, X = the annual proportion of students classified below the baseline, per school and Y = the annual failure rate, per school. Educational policies in São Paulo state, encourage the school monitoring in function of several indices, between them the IDESP. The proposal is to achieve a value of IDESP equal to or higher than five by 2030. Given that according to the Organisation for Economic Co-operation and Development (OECD) this value makes it possible to level public schools in Brazil with schools of excellence of member countries of the OECD, see more details in [1]. We see in Figures 1 and 2 that schools in Guarulhos expose a low IDESP value in comparison with the goal, despite the constant efforts made to improve their performance.

thumbnail Figure 1

IDESP values distribution in Guarulhos, São Paulo State, Brazil (2013–2014). (a) Year 2013, (b) year 2014.

thumbnail Figure 2

IDESP values distribution in Guarulhos, São Paulo State, Brazil in 2015.

Since 2007 (year of creation of IDESP) the IDESP does not show a progressive evolution, which has led us to inspect some of its components, the most influential ones, which are X and Y. There are four levels at which students can be classified, those are: (i) under the basic level, (ii) basic level, (iii) adequate level, and (iv) advanced level, defined from an annual assessment called Sistema de Avaliação de Rendimento Escolar do Estado de São Paulo (SARESP). Students under the basic level demonstrate insufficient mastery of the contents, the skills and the abilities desirable for the school serie in which they find themselves. For details see the next two sections of this paper. Under an ideal and simplistic perspective the variables X and Y should exhibit a linear/concordant relationship between them. In this case we do not perceive that (as we can conclude from Tab. 1), which leads us to study and model the dependence between X and Y assuming a more general approach. We use the Asymmetric Cubic Sections (ACS) copula to describe the dependence between X and Y. We perform the estimation of the parameters of the model, under a Bayesian perspective, year by year. This procedure allows us to construct annual estimates of Prob and annual estimates of the expected value where U are the ranks of X scaled to [0,1] and V are the ranks of Y scaled to [0,1]. In general terms, these quantities allow us to compare year by year the impact of high Y values on the values of X. More precisely, if we have observed high failure rates, we see how they affect the probability of high rates of students below the baseline and how those high failure rates impact in the mean value of rates of students below the baseline. The ACS family has already shown a good performance in applications in the area, see for example [2] and [3]. It is also compatible with our data which, as we shall see, shows very low correlation. Moreover, this family is analytically simple to treat, which facilitates its computational implementation.

Table 1

Spearman’s correlation coefficient ρ between the ranks of the proportion of students classified under the basic level and ranks of the proportion of fails.

In this paper, we will introduce the real problem as well as the description of the data in Section 2. Section 3 shows the model and the results. Finally we show our general conclusions in the Conclusion section, which is followed by the acknowledgments and the references.

2 Index of education development of São Paulo State

In this section we explain the construction of the IDESP and we show the reasons that lead us to study two quantities that contribute to its definition.

The SARESP system aims to evaluate the educational quality of the schools and not the performance of each student directly. This system provides different levels of classification: under de basic, basic, adequate and advanced and those levels are used to compose the IDESP, which serves as a measure of improvement in the quality of education in the state. The levels serve to diagnose the reality of the students of a given school, so it is possible through these results to develop projects in charge of the teachers of that school, in order to recover the skills not developed by that particular group of students. In the SARESP system, the classification of each student in one of the four levels is done separately in two subjects Portuguese and Mathematics. For each subject and for each school is computed the proportion of students inside each level, for under the basic: α m, α p; basic: β m, β p; adequate: γ m, γ p and advanced: δ m, δ p respectively. The quantities with subscript m(p) are related to Mathematics (Portuguese) and α m + β m + γ m + δ m = 1, α p + β p + γ p + δ p = 1, respectively. Formally the IDESP index, denoted by η, is defined as follows:

where Δ is the mean value between Δm and Δp, with

And ζ is the proportion of approved students. For instance, when the proportion α m = 1, the other proportions are zero β m = γ m = δ m = 0, and we obtain Δm = 0 (low quality in Mathematics). When δ m = 1, the other proportions are zero, α m = β m = γ m = 0 and Δm = 10 (high quality in Mathematics). This means that high values of η indicate that the school shows a good overall performance. That is, as expected, high values of under the basic and high failure rates are indicators of poor performance, implying in low values of η. Each year, the schools receive individual goals to be achieved, and defined by the IDESP. These goals are generated by the Education Secretary (http://www.educacao.sp.gov.br/) and based on the result of the IDESP index of the previous year. When a school reaches the growth goal totally or partially, all the school staff is awarded with a monetary complement, by merit, known as education bonus. If the school has high failure rates and a high number of students under the basic level, the school tends to have a low educational indicator, and consequently does not receive the bonus. If this continues, during three consecutive years, the school becomes a priority unit and as a consequence, the school can undergo by pedagogic interventions and detailed monitoring by the regional institution destined to do this, until the school changes its indicators. In the case of Guarulhos region this function is exercised by two sectors: Diretoria de ensino Guarulhos Norte, see http://deguarulhosnorte.educacao.sp.gov.br/ and Diretoria de ensino Guarulhos Sul, see http://deguarulhossul.educacao.sp.gov.br/. What usually occurs is that schools have high numbers of students classified under the basic level, which should also lead to high failure rates. But in order to mitigate the impact this would have on the indicator, the school regulates the failure rates, always keeping the same pattern in the indicator, regardless of the number of students under the basic level. That is, what regulates the promotions of the students is not how much they learn but a structural reality coming from methods of external control. This prospect is worrying, because every year some students receive the promotion to the next series without knowing the minimum required in the previous one. Consequently it becomes also more difficult to recover these students, in view of the accumulated great lag caused by this automatic promotion. It is uninteresting for a school to have high failure rates, consistent with the number of students below to the basic level, as besides impacting in the fall in the index and directly in obtaining the bonus, the school would also have more work, since it will be necessary to carry out a recovery plan, designed for these students.

2.1 Performance levels and fail rates

The data set consists of two scores X and Y recorded for each school and for the intermediate level (from 14 to 17 years old), X = proportion of students classified under the basic level and Y = proportion of fails for that school. We have annual data, from 2013 to 2015. Each school i receives a value which is the arithmetic mean between the proportion of students under the basic for each subject. For the second variable, each school i receives the value y i, which is the proportion of fails, by year. State schools participating in this study are listed in http://www.ime.unicamp.br/~veronica/schools.htm. See also the behavior of IDESP for those years and those schools in Figures 1 and 2. Figure 3 shows the plots of the data X versus Y, for the three years.

thumbnail Figure 3

Plot between X (horizontal axis) and Y (vertical axis). (a) Year 2013, (b) year 2014, (c) year 2015.

Since our focus is to identify the dependence between X and Y, we appeal to the concepts derived from Sklar’s theorem (see [4]). If X and Y have joint distribution H, with marginal distributions F and G respectively, that is, for values x and y, F(x) = H(x, ) and G(y) = H(, y) there is a joint distribution C:[0,1]2 → [0,1] (denoted by copula) such that

As it is a matter of studying the dependence between X and Y, the marginal distributions F and G have nothing to report on the relationship between X and Y. Also, if we define U = F(X) and V = G(Y) the concordance/discordance between X and Y is preserved by U and V, since functions F and G are non-decreasing monotone functions. A natural representation of the values of U (and V respectively) are the empirical ranks of the observations scaled to [0,1] of X (and Y respectively). With this purpose, we compute the pseudo-observations where i = 1, …, n, and are the empirical distributions of X and Y, respectively and n denotes the number of observations (schools).

In Table 1 we expose the Spearman’s correlation coefficients between X and Y, year by year.

We can note the low values of the Spearman’s correlation coefficient ρ although both variables are related with an unsatisfactory performance and, by coherence need to be associated. We note the inability of the Spearman’s correlation coefficient to capture dependence by showing a negative value in 2015. The results of Table 1 only means that is not identified a linear relation between the ranks of the observations, so the alternative is to use a non-linear model to represent the dependence between the rates. Thus, the focus of our study is the identification of C. To get to this identification we will delimit the possibilities of C into a sufficiently flexible family.

3 Model and results

Here we introduce the model explored in this paper. This model corresponds to a family of copulas that is a perturbation of the case of independence, that is C(u, v) = uv. With this proposal we seek to contemplate also situations of low correlation, as shown in Table 1.

Definition 3.1. The biparametric ACS copula family is given by where ,

Its density function is

(1)and the Spearman’s correlation coefficient It should be noted that if a = b, the copula corresponds to the Farlie-Gumbel-Morgenstern family which admits fragile positive as well as negative degrees of Spearman’s correlation. Precisely, since the parameter then Given the correlation spectrum allowed by the Farlie-Gumbel-Morgenstern family and according to the results of Table 1, our data could respond to this model. Thus, in relation to the estimation of parameters a and b, if they were similar, we could argue that the dependence between U and V is well represented by the Farlie-Gumbel-Morgenstern model.

Looking to explore stochastic-functional relationships between U and V, [5] shows a method of constructing copulas with the property of having cubic cross-sections, one of these models is given by the Definition 3.1. For instance, if we fix v = v 0 in Definition 3.1, we obtain:

with and Then, the copula is given by a cubic expression in u. Analogously, if we set u = u 0, the expression in Definition 3.1 corresponds to a cubic expression in v. In terms of the modeling process, these cubic forms aim to give greater flexibility to the dependence type between U and V, being this more general than a linear dependence type.

Given a specific year we compute the likelihood function of the sample of size n, that is

where the function c is given by the equation (1). Assuming a non-informative prior distribution on the posterior distribution of (a, b) is proportional to the likelihood function. We use a non-informative prior distribution on (a, b) in order to contain the impact of the prior distribution in the posterior distribution of (a, b). We also observe that the complexity of the parametric space Θ (see Definition 3.1) could hinder the use of an informative prior distribution without a very solid base. About literature linking copula’s theory and Bayesian estimation, see [6] and [3]. The Bayesian estimates of a and b, under quadratic loss function, for each year are shown in Table 2. In 2015, five schools did not participate in the study, these are: Profa. Alice Chuery, Conselheiro Crispiniano, Hugo de Aguiar, Profa. Ilia Zilda Innocenti Blanco and Vila Any.

Table 2

Bayesian estimators of a and b – see Definition 3.1.

We see that in none of the three cases the model indicates the Farlie-Gumbel-Morgenstern copula, since the estimates of a and b look very different. A Bayesian approach is appropriate in those cases for several reasons, between them we note: a moderate sample size to implement a frequentist estimation of two parameters and the constrains over the parameters a and b. We estimate the probability Prob and the expected value by means of the values reported in Table 2, as follows. If X and Y are continuous with cumulative distributions F and G respectively, given U = F(X) and V = G(Y) with 2-copula C, Prob Then, using the Definition 3.1 we can define the estimation of Prob as

(2)

Since Prob and then

as a consequence

(3)

Computing the partial derivative of the copula given by Definition 3.1 we obtain from the equation (3), Then, we propose the estimation:

(4)

Returning to the real problem, we expect the variables X = proportion of students classified under the basic level and Y = proportion of fails for that school, to show a performance compatible with what they are measuring. To investigate in detail the coherence in the dependence between X and Y, observed year after year, we first focus on the conditional dependence between tail events, estimated by the equation (2), then we show a more traditional study on the mean value of U (ranks of X) conditioned to thresholds in V (ranks of Y) estimated by equation (4).

3.1 Conditional tail dependence

The most reasonable behavior of (2) is to show an increasing tendency in the upper tail. This is, it is expected that high values of U to be concentrated with high values of V. We will show what we verify in the estimates, for certain values of U (ranks of X) and in relation to all possible values of V (ranks of Y). The behavior of (2), year by year is illustrated in Figures 4 and 5, for the cases u = 0.5, 0.7 and 0.9. See Table 3, for other values of u.

thumbnail Figure 4

according to equation (2), for (a) u 0 = 0.5, years: 2013, 2014 and 2015. (b) u 0 = 0.7, years: 2013, 2014 and 2015.

thumbnail Figure 5

according to equation (2), for years: 2013, 2014 and 2015.

Table 3

according to equation (2) with u, v = 0.5, 0.6, 0.7, 0.8 and 0.9.

In 2013, (2) is given by a concave quadratic curve. We note that as u increases (2) changes by being formed only by the increasing part of the curve, also its concavity is less pronounced, revealing an almost linear and increasing aspect in the case of u = 0.9. The curves (2) of 2014 and 2015 are convex quadratic curves. For the year 2014, we see that as u grows, the curve goes taking a constant aspect. We can also verify this fact by inspecting Table 3 (case 2014). This statement can be better visualized in the Figure 6. For instance, given any threshold v, the probability of U > 0.9 is almost constant. In practical terms this means that large proportions of students below the basic level do not depend on any failure rate. Evidently, this exposes an extreme contradiction. In the case of 2015, we observe that as u grows the curve loses its convexity and exhibits an almost linear and decreasing behavior, for large threshold values in U (see also Tab. 3). That is, the higher the threshold in V, the smaller the chance of U exceeding values close to 1.

thumbnail Figure 6

according to equation (2), for and year: 2014.

Since the dependence between X and Y is the same as the dependence between U and V, we see how there was a concrete deterioration from 2013 to 2015, of the relationship between X and Y. Arriving at the point of showing conditional discordance between X and Y (in 2015) and going through conditional independence (in 2014), which does not make sense from the meaning of the variables.

3.2 Central tendency

To build a global view of the behavior of U (ranks of students classified under the basic level) conditioned to values of V (ranks of fails) that exceed a threshold v, we will estimate by equation (4). When comparing the 3 years, a similar behavior of (4) is expected. Since we are inspecting consecutive years where non changes happened in the educational system. Figure 7 and Table 4 show the results.

thumbnail Figure 7

according to equation (4) for years: 2013, 2014 and 2015.

Table 4

according to equation (4), years 2013, 2014 and 2015.

We note that, the relationship between U and V exhibits different behaviors, when considered during these 3 years, one is a concave function and two are convex functions (see also Fig. 4). This fact shows the lack of robustness of the process of dependence between X and Y. We can compare the behavior of with the conditional probability where u 0 is the value corresponding with the median of X, as listed by Table 5.

Table 5

Median values of X and its corresponding u 0.

We verify that the functional performance of (Fig. 7) and (Fig. 8) is similar as already anticipated when comparing Figures 4(a) and 7. In Table 4 we show the values given by equation (4) for v = 0.2, 0.3, …, 0.9. Consider the year 2013 and v = 0.8, the expected value of X scaled into [0,1] under the condition is approximately 0.50854 and belongs to the interval [0.47568, 0.51143] in the period: 2013–2015. In Table 6 we show the values given by (4) for v = 0.2, 0.3, …, 0.9. So, the probability of U to exceed 0.49495 is approximately 0.51689 in 2013, under the condition and belongs to the interval [0.46352, 0.52243] in the period: 2013–2015.

thumbnail Figure 8

according equation (2), years: 2013, 2014 and 2015, with u 0 given by Table 5.

Table 6

according equation (2), years 2013, 2014 and 2015, with u 0 given by Table 5.

These results lead us to observe Table 1, where the Spearman’s correlation coefficient exposes its fragility. In the same way, it is to be expected that the mean values computed here do not clearly point out what is happening, in the tail region of [0,1]2 where we are interested in tracking the concordance/discordance between U and V. This fact justifies the previously developed conditional study.

4 Conclusion

In this paper we explore the dependence between two indicators: (i) mean between the proportions (in Portuguese and Mathematics) of students under the basic level (SARESP classification) and (ii) rate of fails, during the years 2013, 2014 and 2015. The data is coming from around 100 public schools of the Guarulhos city, the second largest city of the São Paulo state. The inspection of the dependence is carried out by means of a Bayesian copula estimation, through the Bayesian estimation of the parameters of the ACS copula, a model adopted for its flexibility. We show that the dependence profile, year after year, behaves in a very unstable way, although during those years there were no substantial changes which justify such variable behavior. The Bayesian point estimation of the parameters indicates this instability, see Table 2 and also confirmed by the influence of those estimations in the mean conditional curve given by equation (4). The mean value of the ranks of (i) conditioned to a threshold in (ii) shows a very different behavior when we compare the 3 years. According to the indications reported by Table 1, global measures, such as those computed via the conditional mean value (4) may not be appropriated to identify what is happening. Since, is suspected that some kind of handling may exists in (i) and/or (ii), due to the structural aspects of the educational system, which could explain the difference in dependence profiles, as is the case of Figure 7. To understand the relation between (i) and (ii) we inspect the conditional dependence in different upper tail regions of [0,1]2 of the marginal ranks of (i) and (ii) scaled to [0,1]. We can see the representation of the behavior of tail events given by equation (2) in Figure 5. We see that in 2013 the behavior of the conditional probability is the expected, since, the higher threshold in rate of fail, the higher the probability of classification under the basic level be superior to 90%. In 2014, the thresholds of rate of fail do not influence the probability of classification under the basic level being greater than 90%. In 2015, to higher threshold in rate of fail is lesser the probability of classification under the basic level be superior to 90%. That is to say that the relation of concordance between (i) and (ii) verified in 2013 is inverted for discordance in 2015, precisely in the most critical values which are high failure rates and high proportions under the basic level.

Based on the study, we perceive the need to review the use of global indices such as the IDESP, for the development of policies to control the quality of education. As illustrated in Figure 1, the IDESP appears to exhibit some stability or very slight improvement and at the same time is able to mask relevant and decisive aspects for quality in education. More precisely, it allows mitigating the effects of relevant indicators, as the case of (i) and (ii).

Acknowledgments

The authors gratefully acknowledge the support for this research provided by the Graduate Program in Professional Masters National Network – PROFMAT to C. Cunha and by CAPES with a fellowship of the Master Program in Statistics - University of Campinas to N. Romano. Also, the authors wish to thank anonymous reviewers for their many helpful comments and suggestions on an earlier draft of this paper.

References

  • Cunha C (2017), Estudo sobre Componentes do IDESP na cidade de Guarulhos, Unpublished master’s thesis. University of Campinas, Campinas, Brazil. [Google Scholar]
  • Fernández M, González-López VA (2013), A copula model to analyze minimum admission scores, in: AIP Conference Proceedings, 1558, 1479–1482. [Google Scholar]
  • Fernández M, González-López VA, Rifo LLR (2015), A note on conjugate distributions for copulas. Math Methods Appl Sci 38, 18, 4797–4803. [CrossRef] [Google Scholar]
  • Sklar A (1959), Fonctions de répartition à n dimensions et leurs marges. Publ Inst Statist Univ Paris 8, 229–231. [Google Scholar]
  • Nelsen RB, Quesada Molina JJ, Rodríguez Lallena JA (1997), Bivariate copulas with cubic sections. J Nonparametr Statist 7, 205–220. [CrossRef] [Google Scholar]
  • García Jesús E, González-López VA, Nelsen RB (2016), The structure of the class of maximum Tsallis-Havrda-Chavát entropy copulas. Entropy 18, 7, 264. [CrossRef] [Google Scholar]

Cite this article as: Cunha C, Fernández M, García JE, González-López VA & Romano N, 2019. A copula-based consistency analysis of education indicators. 4open, 2, 19.

All Tables

Table 1

Spearman’s correlation coefficient ρ between the ranks of the proportion of students classified under the basic level and ranks of the proportion of fails.

Table 2

Bayesian estimators of a and b – see Definition 3.1.

Table 3

according to equation (2) with u, v = 0.5, 0.6, 0.7, 0.8 and 0.9.

Table 4

according to equation (4), years 2013, 2014 and 2015.

Table 5

Median values of X and its corresponding u 0.

Table 6

according equation (2), years 2013, 2014 and 2015, with u 0 given by Table 5.

All Figures

thumbnail Figure 1

IDESP values distribution in Guarulhos, São Paulo State, Brazil (2013–2014). (a) Year 2013, (b) year 2014.

In the text
thumbnail Figure 2

IDESP values distribution in Guarulhos, São Paulo State, Brazil in 2015.

In the text
thumbnail Figure 3

Plot between X (horizontal axis) and Y (vertical axis). (a) Year 2013, (b) year 2014, (c) year 2015.

In the text
thumbnail Figure 4

according to equation (2), for (a) u 0 = 0.5, years: 2013, 2014 and 2015. (b) u 0 = 0.7, years: 2013, 2014 and 2015.

In the text
thumbnail Figure 5

according to equation (2), for years: 2013, 2014 and 2015.

In the text
thumbnail Figure 6

according to equation (2), for and year: 2014.

In the text
thumbnail Figure 7

according to equation (4) for years: 2013, 2014 and 2015.

In the text
thumbnail Figure 8

according equation (2), years: 2013, 2014 and 2015, with u 0 given by Table 5.

In the text