HomeTechniques and TipsNeuralToolsData treatment before training a Neural Network

# 15.42. Data treatment before training a Neural Network

Applies to: NeuralTools 7.x/8.x

Main steps to follow:

1. Data Quality

The first step for building any prediction or classification model is to evaluate the data quality. It is important to identify and fix some problems related to the data set.

• Repeated records. A record with the same information can't be included many times in the data set because if one of them is selected during the training and the other ones during the testing, we could increase the percentage of bad predictions.

• Values out of ranges. Numerical values that the variable shouldn’t take. Example: Age=-3

• Invalid values. Categorical variables with categories that don’t make sense. Example: Marital Status = Bachelor

• Inconsistencies. This type of problem occurs when there is no concordance between the values of two or more variables.

Example: Age = 20 years, Employment seniority = 25 years.

• Missing data. If the variable has more than X% of missing data it is removed of the analysis; otherwise missing data can be estimated using any imputation data technique.

2. Univariate Analysis

Once data issues described before have been fixed, the next step is to run a Univariate Analysis.

• Categorical variables. Build a bar chart or a frequency table of the variable. See an example below: - If the variable only has one value, it should be removed from the analysis.

- If there are categories with a frequency lower than 5%, the variable should be categorized again in order to ensure a frequency greater than 5%. -If there are only two categories and one of them has a frequency lower than 5% it should be removed from the analysis.

• Numerical variables. Compute descriptive statistics; build a histogram and a boxplot of the variable. - If the variable is highly skewed, it would be convenient to use the log transformation during the Neural Network training.

- If there are outliers, it is important to see if they are error measures or not, before making the decision to exclude them.

3. Bivariate Analysis

If there are a big number of independent variables, it is convenient to run a Bivariate Analysis which means that all the independent variables will be analyzed with the dependent variable at the same time.

• Categorical variable vs. categorical variable: If both variables are categorical, run a Chi-square Independence test.
- If the p-value of the test is low, the independent variable is included in the neural net training; otherwise, it is omitted.

• Numerical variable vs numerical variable: If both variables are numerical, compute the Spearman correlation coefficient.

- If the absolute value of the correlation coefficient is greater than 0.75, the independent variable is included in the neural net training; otherwise, it is omitted.

• Numerical variable vs. categorical variable: If one of the variables is numeric and the other one is categorical, run a t-test (available in StatTools) or a Mann-Whitney test (also available in StatTools).

- t Test. These results are based on the assumption that the variables are approximately normally distributed. If this is not the case, then these results might not be valid, especially if the sample size is small. You can use the Mann-Whitney test in these cases.
If the p-value of the test is low, the independent variable is included in the neural net training; otherwise, it is omitted.
You can run this analysis trough the menu Statistical Inference > Hypothesis Test > Mean/Std. Deviation… of StatTools. Be sure to select the Two-Sample Analysis type.

- Mann-Whitney test. If the p-value of the test is low, the independent variable is included in the neural net training; otherwise, it is omitted. You can run this analysis trough the menu Nonparametric tests > Mann-Whitney test … of StatTools. Last Update: 2020-06-04