For someone like me who’s always been interested in practical machine learning, it was wonderfully delightful to have found Word2vec, a neural network that intakes text and outputs numerical vectors. It transforms each word in a sentence into a series of numbers that could be used to predict the probabilities of related items, on top of mathematical interpretations of similar words. This means that this method doesn’t need to know the exact definition of the words, and with enough data could better interpret relationships between words than the average human being.
Of course, that’s not all Word2vec could do. It seems that the applicability of the Word2vec (there are 2 distinct models) goes beyond predictive syntax interpretations.
If you’re interested in learning more about the Word2vec and its intricacies, check out:
- This document published by an Israelian computer science PhD (with a link to the PDF).
- A publication by the guys at GOOG on sentence and phrase compositions.
What is efficiency?
That’s the question that plagued me when I stared blankly into a log telling me it spent an abysmal amount of time running a few simple queries.
Usually, I run a few SAS queries before I leave work so that I can look at results as soon I arrive in the morning. I was running a total of 5 months worth of financial product performance information, split into 5 different tables. There wasn’t a lot of information so the expected run time was about 10 minutes per table. Hence I was flabbergasted when the log told me it took 1 hour 28 minutes to run each month of data.
Long story short, after playing it around with it some more I decided to condense 5 queries into a single SQL query and ran it again. Guess how long it took this time?
5 minutes 28 seconds.
A 1236% increase in efficiency.
I think I’ll be able to go home early today.
Running an advanced algorithm in SAS takes a long time, just came back from lunch and the query is still running. Life is a game of patience and frankly by the time it finishes running I’d be fossilized. To be fair, my lunch lasted less than half an hour so it probably isn’t a giant query like the one I wrote a few months a go (which took 50min).
Most of the issues with time comes from joining giant tables, I tried adjusting the algorithms to make it more efficient like reducing the joining data set, adding as much action into each query as I could so there’d be less computation in the next one, and writing macros so that there’d be less . However it seems that code efficiency does not mean computation efficiency.
There may be a few things causing this problem:
- Macros are actually slowing things down: since I’m joining huge datasets they are running through them again and again, there is no way to avoid this issue since I need to obtain the same information for different risk levels from the same tables. A possible cure is run the programs through a subset of the original database.
- Too many actions in one query: after a bit of Googling I found that this isn’t a problem, it’s actually beneficial to combine different actions into one query. That is one myth busted.
- Crummy coding: this is a little unlikely since I’m writing SQL code and they’re quite similar across the board as I’m merely extracting data at this point for exploratory analyses, perhaps I could throw them into separate queries and see how long they take individually.
- Conflicting computation requirements: it seems that I’m not the only one not taking vacations a few days after Christmas and before New Years. Plus there’s a minor issue that is currently on the hot seat with another team so they’re requesting a lot of support from the data analysts who operate within and out of the SAS systems. Again, no way out of this one but patience.
I could only potentially improve on 1/4 of the issues described above (#3), so it wouldn’t produce much improvement in processing speed at all. I might spend some time reviewing the loess method I implemented last week, this interesting regression method has yielded some interesting insights regarding some of our products. So far I’ve only been interpreting them visually, so there could be additional information that I could uncover once I (re)learn how to interpret the coefficients.
Update: I was able to get the run time to decrease by almost half after using the function <compress=yes> and joining tables using unique accounts to reduce the size of every table created. It turns out that one of the tables had 251 million rows of data, when I tried to join it with another table the information multiplied and thus overwhelmed the system. The new program took about 50 minutes in total for all the risk levels I tested, which is a drastic improvement.