Data sets


Training and test data were extracted from Release 40 of SWISS-PROT (Bairoch, et al., 2000). To ensure that our model uniformly represented all protein families, we restricted the training data set to proteins of one species, either human or mouse, since well-characterized proteins tend to be represented in multiple species. Non-nuclear genes and low quality entries marked with "hypothetical", "reconstruction", "conceptual", and "init_met by similarity" were removed from all data sets. 363 human and 140 mouse non-redundant signal peptides were extracted according to the feature table annotation, where redundancy was defined as identity in both sequence and length. An expanded set of signal peptides was also collected by including features like "signal by similarity" and "signal potential", resulting in 892 human and 644 mouse non-redundant signal peptides. These data sets were free of any sequences with multiple suggested cleavage sites. The negative test sets for human and mouse were collected by filtering out entries with annotated signal peptides or with other keywords associated with secreted or cell-surface proteins. Furthermore, we manually collected a list of Pfam domains specific to extracellular regions, and sequences with these domains were also excluded from our negative test sets. This led to the collection of 3234 human and 1958 mouse non-signal sequences.