“Machine Learning Is Cool. I’m Cool.”

A question presents itself to the psychoanalytic mind: Who even thinks these things?

When the valence of mind deteriorates past a certain point, people start looking for exits. One of these exits is to claim that everything is a dream, everything is empty, or diluted of substance in some way. This can be a way to dissociate from the social context that breathes the fire of our self and suffering. The sequence similarity between humans/Homo sapiens and green monkeys/Chlorocebus sabaeus is 94%, so it is to be expected that we would want to cut ties with our social group after trauma; our method for clipping the cord to the social eyes is only slightly more sophisticated than that of other social mammals who diverged from a relatively recent common ancestor.

[Ahh… Yes. This is why Elon Musk dotes on the simulation; longs for the holographic principle to delete his curse.]

What is stated around here about the nature of reality should not be confused with that genre even it sounds weird and therefore you complete the pattern: escapist. I am committed to life. And by life, I mean life in the conventional sense from the indexical present which contains human persons dying from trauma and neurofibrillary tangles.

But talk is cheap, let me take a short detour here to contribute to anti-aging research and prove that I believe in us:

So I contacted the SENS Research Foundation, which is lonely at the frontline in the battle to save humanity (aging is the number one cause of death and disease, remember). They gave me a link to a dataset which contains genes associated with aging. And I’m going to use my machine learning skills to see what I can do with it.

Here is the dataset.

Clean The Dataset

A human may understand that 5p13.1 represents a cytogenetic location. Let me correct that: A smart human might understand that 5p13.1 represents a cytogenetic location. But a neural network certainly can’t take the statement 5p13.1 without modification.

All must be transmuted to digit before it is presented to the neural network. It is not that a neural network is incapable of dealing with human-understandable categories, since such a limitation would surely defeat the point of using such a tool. It is merely the case that we need to repackage the categories with a representation that it can understand.

There are 16 fields on the gene data set. The eleventh field indicates the orientation of the gene. This is represented by a 1 or -1. The 1 and -1 correspond to this:

Screen Shot 2018-07-04 at 7.54.53 PM

The direction in which the RNA is transcribed is in the 5′ to 3′ direction. But although a gene always has the orientation 5′ to 3′, it can be on one of two opposite strands denoted by + and -. This is what I will choose as my output label.

Now I have to look for the possible dependent labels – those that stand a chance of having a meaningful correlation with the output label. The first six labels:GenAge ID, symbol, aliases, name, entrez gene id, uniprot, and the previous-to-last five: acc promoter, acc orf, acc cds, and references can be neglected since they are IDs telling us about naming conventions and nothing about the physical structure. Now we have 5 fields for consideration apart from the output label.

Of these 5, let’s inspect which columns don’t present their information in digits.

This is the first row:

Screen Shot 2018-07-05 at 6.44.57 AM

Crowded, I know. But the 5 things we care about are on the indices 7, 8, 9, 10, and 16:

Screen Shot 2018-07-05 at 10.18.34 AM

you will see there are several labels which are not digits: why, location, and orthologs are labels with values that are not digits. We need to transform them into digits in a meaningful way before passing them into the neural network. And they cannot be encoded into just binary digits (0’s and 1’s) because for each label, there are more than 2 possible values.

For example, looking at the data we see that the label why can have the values “mammal” or the value “cell, functional” or the value “mammal, model, cell”, along with several others.

And the label location can have the values appropriate for a gene locus: 17p13.1, or 20q11.2, or 10q22.2, or whatever other value is appropriate for gene locus. If we had to just specify the chromosome for the gene in a human, we would already have 23 different possibilities.

Screen Shot 2018-07-05 at 10.46.03 AM

Since we have so many possible values for each label that we care about, this situation calls for one-hot encoding.

So I have set out to follow the conclusions of this procedure:

If values not digits. → Check if values should be binary.

If they should be binary.→Encode in binary digits.

If they should not be binary. → One-hot encode.

My ultimate goal here is to predict whether a gene is in the 5’→3′ DNA strand a.k.a. the ‘sense’, ‘plus’ or ‘coding’ strand. This + strand has a sequence which is identical to the sequence of the premessenger RNA (except for uracile (U) in RNA, instead of thymine (T) in DNA); this is the coding strand which is not transcribed. Or whether it is in the complementary strand that is transcribed by the RNA polymerase – known as either the ‘Antisense’, ‘Minus’ or ‘Not coding’ strand.
Knowing my ultimate goal, I must take care to make all the data relevant to the final prediction. So I must inspect with my own human eyes and intuitions what the uncleaned data contains.

For the why label/column, the possible values are:

mammal

“mammal,model,cell”

“mammal,cell”

“cell,functional”

human

“human,mammal,cell”

model

“model,functional”

“cell,downstream”

downstream

functional

putative

“mammal,functional,downstream”

“model,putative”

“model,cell”

“model,downstream”

“cell,upstream”

“functional,putative”

“mammal,putative”

upstream

“functional,downstream”

“upstream,putative”

“downstream,putative”

cell

“model,human_link”

“mammal,model”

human_link

“mammal,functional”

“functional,upstream”

“cell,putative”

“mammal,upstream,downstream”

“mammal,cell”

“mammal,human_link”

Each one of those represents a single value that is possible under the label why. We can choose to one-hot encode them or further engineer them into more sophisticated categories that split the column in pieces so that overlap of the variable is reduced.

I will one-hot encode them for now. So I assign an integer value from 0 to 33 for these categorical values and then translate that into a vector which represents the integer by invoking a 1 at that respective index in an array of 0’s.

You can follow along by doing the following:

Download a 64-bit version of Java from here: Java SE Development Kit 8 Downloads

Now you must set Java_Home

If you have a Mac, go to terminal and run the following commands:

export JAVA_HOME=jdk-install-dir

export PATH=$JAVA_HOME/bin:$PATHIf you have a different system click here.

You also need an IDE such as IntelliJ.

Download either the permanently free Community or the free trial for Ultimate.

You need Maven.

For a Mac, go to terminal and

brew install maven

If you have a different system click here.

You also need git.

Go here if you don’t have it already.

If you have it already, then just update it with this

git clone git://git.kernel.org/pub/scm/git/git.git

Enter this into terminal

git clone https://github.com/deeplearning4j/dl4j-examples.git
cd dl4j-examples/
mvn clean install

Open IntelliJ and choose Import Project.

Select dl4j-examples.

Choose ‘Import project from external model’ and ensure that Maven is selected.

A simple machine learning algorithm cannot “learn” about information such as words and genes without proper translation. “All is number,” said Plato. “All is number,” says the machine.

There are five fields on the dataset that we care about. Our output label, the one that can be a 0 or 1 and which we are learning to predict, is the orientation described by a 1 or -1 on index 11 in the original data.

When building the schema, you use string for things that aren’t composed solely of numbers in the original data.

Schema schema = new Schema.Builder( )
.addColumnsInteger("GenAge ID")
                
.addColumnString("symbol")
                
.addColumnString("aliases")
                
.addColumnString("name")
                
.addColumnInteger("entrez gene id")
                
.addColumnString("uniprot")
                
.addColumnCategorical("why", Arrays.asList(mammal)
               
.addColumnString("band")
                
.addColumnInteger("location start")
                
.addColumnInteger("location end")
                
.addColumnInteger("orientation")
                

Unfortunately this was all signaling and real progress requires that we awaken from the slumber of the misaligned need to impress those around us. The competitive spirit of mankind at large must be funneled unto the establishment of rejuvenation therapies that roughly follow the outline sketched by Strategies for Engineered Negligible Senescence in order to rejuvenate our tissues and cells such that a safety net of biological youth is unlocked and an evil is slayed

How can people be true when their bodies rot? How can they read with comfort and grace when entrance to the library requires signing a contract to burn with all the books?

How can they love when those around them will be destroyed?

My duties as a kitchen-knave are done for now. I hope Lynette and Mother see that I am fit to serve the King, fit to be a hero.

Then sprang the happier day from underground;

And revel and song, made merry over Death,

So large mirth lived and Gareth won the quest.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s