proc SQL

noetsi

Fortran must die
#1
I have two samples one includes the entire organization. Call it work.t1. The other is a subset of t1 who took the test, call it work.t2. What I want it to know who has not taken the test. This will be in t1, but it will not be in t2.

I am thinking there must be some way to address this with a does not exist statement, but I have not figured out how.
 
#2
Code:
proc sql;
 create table blah as
 select * from work.t1
 where id not in (select id from work.t2);
quit;
I don't have SAS on hand, but that logic should work assuming there is a variable named 'id' in both tables.
 

bryangoodrich

Probably A Mammal
#3
There are 2 ways to do an "anti-join." Depending, one could be faster than the other, and I've seen some questionable performance with proc sql sometimes.

Code:
-- Not In
PROC SQL;
CREATE TABLE foobar AS
SELECT *
FROM work.t1
WHERE id NOT IN (
    SELECT id
    FROM work.t2
);
QUIT;
You could put a distinct on the where select; it's more costly to compute but could also get a speed up in the list search depending on how big the list is, if duplicates exist, if its sorted, etc. Assuming you're matching by ids common to each table and they should already be distinct, it should be fine. Duplicates don't matter in a list search since it returns true at the first instance it finds in the list (search a long sorted list could be a problem, though).

Code:
-- Exists
PROC SQL;
CREATE TABLE foobar AS
SELECT *
FROM work.t1
WHERE NOT EXISTS (
    SELECT 1
    FROM work.t2
    WHERE work.t1.id = work.t2.id
);
QUIT;
This is called a correlated subquery because it's a separate nested select statement that depends (correlates) with the outer query in its own where clause. Essentially that defines a join of sorts, but what we're doing is returning something (1 in this case, but it shouldn't technically matter) whenever the 2 tables match on a correlated record. However, we're doing NOT EXISTS, which imposes the condition "return true whenever the record in the outer query is not correlated with the inner query." In that sense, it has a semantics similar to the "not in list" approach. However, in most cases I believe the EXISTS approach should be faster to the query optimizer, but a smart one would just make them the same. However, as I've said, I've seen SAS's query engine entire botch the faster method because of how it does sql. Not even sure if there is a way to look at SAS query plans to analyze how it's trying to process the query. I don't use SAS enough to care.

The 3rd reference reminded me you can approach this from a joins method, too

Code:
PROC SQL;
CREATE TABLE foobar AS
SELECT *
FROM work.t1
     LEFT OUTER JOIN work.t2
     ON t1.id = t2.id
WHERE t2.id IS NULL
This kind of makes the NOT EXIST join explicit by doing an actual join to the primary table, but you're going to filter explicitly where the 2nd table is not correlated (i.e., "is null") with that table. In other words, "join t1 to t2 but only keep the t1 records that are not present in t2 as identified by the null join match"

Further reading
https://blog.jooq.org/2015/10/13/semi-join-and-anti-join-should-have-its-own-syntax-in-sql/
http://dbspecialists.com/wp-content/uploads/2017/03/semijoins.html
https://stackoverflow.com/questions...-write-a-select-statement-with-a-not-in-subqu
 
#4
There are 2 ways to do an "anti-join." Depending, one could be faster than the other, and I've seen some questionable performance with proc sql sometimes.

Code:
-- Not In
PROC SQL;
CREATE TABLE foobar AS
SELECT *
FROM work.t1
WHERE id NOT IN (
    SELECT id
    FROM work.t2
);
QUIT;
You could put a distinct on the where select; it's more costly to compute but could also get a speed up in the list search depending on how big the list is, if duplicates exist, if its sorted, etc. Assuming you're matching by ids common to each table and they should already be distinct, it should be fine. Duplicates don't matter in a list search since it returns true at the first instance it finds in the list (search a long sorted list could be a problem, though).

Code:
-- Exists
PROC SQL;
CREATE TABLE foobar AS
SELECT *
FROM work.t1
WHERE NOT EXISTS (
    SELECT 1
    FROM work.t2
    WHERE work.t1.id = work.t2.id
);
QUIT;
This is called a correlated subquery because it's a separate nested select statement that depends (correlates) with the outer query in its own where clause. Essentially that defines a join of sorts, but what we're doing is returning something (1 in this case, but it shouldn't technically matter) whenever the 2 tables match on a correlated record. However, we're doing NOT EXISTS, which imposes the condition "return true whenever the record in the outer query is not correlated with the inner query." In that sense, it has a semantics similar to the "not in list" approach. However, in most cases I believe the EXISTS approach should be faster to the query optimizer, but a smart one would just make them the same. However, as I've said, I've seen SAS's query engine entire botch the faster method because of how it does sql. Not even sure if there is a way to look at SAS query plans to analyze how it's trying to process the query. I don't use SAS enough to care.

The 3rd reference reminded me you can approach this from a joins method, too

Code:
PROC SQL;
CREATE TABLE foobar AS
SELECT *
FROM work.t1
     LEFT OUTER JOIN work.t2
     ON t1.id = t2.id
WHERE t2.id IS NULL
This kind of makes the NOT EXIST join explicit by doing an actual join to the primary table, but you're going to filter explicitly where the 2nd table is not correlated (i.e., "is null") with that table. In other words, "join t1 to t2 but only keep the t1 records that are not present in t2 as identified by the null join match"

Further reading
https://blog.jooq.org/2015/10/13/semi-join-and-anti-join-should-have-its-own-syntax-in-sql/
http://dbspecialists.com/wp-content/uploads/2017/03/semijoins.html
https://stackoverflow.com/questions...-write-a-select-statement-with-a-not-in-subqu
Awesome, thank you for a such a detailed explanation and link to references. I myself am trying to learn PROC SQL as (1) I realize that SQL is valuable outside SAS and (2) some projects I inherited are written mostly in proc sql

Cheers,
 

bryangoodrich

Probably A Mammal
#5
Yeah, SQL is important to know in business. It's just a nice way (semantics) to express how you want data served up, but it's a language that is used everywhere. I personally rarely use SAS coding, but when I do is 99.99% just running proc sql.
 
#6
Yeah, SQL is important to know in business. It's just a nice way (semantics) to express how you want data served up, but it's a language that is used everywhere. I personally rarely use SAS coding, but when I do is 99.99% just running proc sql.
Awesome. I've realized this too when looking at other jobs outside the health/hospital sector and all require SQL proficiency. My only question is what platform do you typically use SQL in outside of SAS? I've heard of Oracle, but know it's an expensive program. Are there any other popular SQL programs used in industry that I could experiment with?
 

bryangoodrich

Probably A Mammal
#7
Awesome. I've realized this too when looking at other jobs outside the health/hospital sector and all require SQL proficiency. My only question is what platform do you typically use SQL in outside of SAS? I've heard of Oracle, but know it's an expensive program. Are there any other popular SQL programs used in industry that I could experiment with?
I'm usually doing database programming in or against MS SQL Server (T-SQL). I like their approach 95% of the time. I'm also starting to (finally) use Hadoop and doing Hive programming, which has many of the SQL semantics implemented, but also some big data (mapreduce) and other new things.
 

noetsi

Fortran must die
#8
I use TSQL on the Microsoft server and PROC SQL (which I learned initially) in running statistical analysis because I can combine SAS stats and RPOC SQL which TSQL will not do. I am not sure what SQL is used most, MS ACCESS is used as well (although I refuse to think of ACCESS as SQL).

There are fairly minor differences in PROC and TSQL at least at the level I use them (bryan runs much more complex code than I do). Functions are where the greatest differences are in my experience.
 

bryangoodrich

Probably A Mammal
#9
I use TSQL on the Microsoft server and PROC SQL (which I learned initially) in running statistical analysis because I can combine SAS stats and RPOC SQL which TSQL will not do. I am not sure what SQL is used most, MS ACCESS is used as well (although I refuse to think of ACCESS as SQL).

There are fairly minor differences in PROC and TSQL at least at the level I use them (bryan runs much more complex code than I do). Functions are where the greatest differences are in my experience.
Yeah, the additional functions that are provided make the language vastly different. SAS is a statistical application, so it developed ways to incorporate statistical functions in your proc sql calls. SQL Server is about managing data, so to keep the query optimizer, well, optimal it can't be made generic enough to handle broad scientific calculations, and they're not going to invest the time to reinvent that wheel. You could, if you were brave, write your own C# code and build functions to run with TSQL, but that's overkill. This is partly why Microsoft bought R and has added it to SQL Server (2016) and now have included Python support (2017). They simply offload those vectorized computations to vectorized solutions that these other languages provide with libraries that optimize their code with the scale of data they might run against.

But many common descriptive statistics are readily available in TSQL. You can do a lot with windowing functions (e.g., time series manipulations), but these are still limited in TSQL. One of the things I'm going to try to explore once I get access to SQL 2016 is try to build a generic R function that can take a table inputs and run a linear regression, so I can wrap that in a stored procedure that I can simply specify column for response and column(s) for independent variables. Once you can translate table metadata into R formula expressions, you're solid! (or a y ~ . could be a cheap workaround).
 

noetsi

Fortran must die
#10
SAS combines SQL and stats because to SAS SQL is just another PROC like regression or time series. You can do things like average, but not median, in TSQL but for those I would just use excel anyway. I think a problem with doing any statistic in SQL is that key diagnostic plots or tests would not be available (this is why I do not use excel for those either).

How can Microsoft "buy" R which is open source code? If they did I am never going to let Dason live it down since I kidded him years ago that would occur :p It is true you can (I have not seen this) get access to R supposedly on the 2016 server, but given how slowly we migrate that could be many years for us.
 

Dason

Ambassador to the humans
#11
You can easily get median or other percentiles in tsql. Also Microsoft bought Revolution R - they don't own the R project itself.
 

noetsi

Fortran must die
#12
I don't know a way to easily get the median in TQL, that is a built function. We have created our own function to do that at work.
 

Dason

Ambassador to the humans
#13
PERCENTILE_CONT?

But I created my own method to do percentiles in T-SQL at work because PERCENTILE_CONT doesn't match up SAS's defaults and I wanted to replicate SAS's output exactly. But PERCENTILE_CONT should match on the medians.
 

bryangoodrich

Probably A Mammal
#14
The PERCENTILE_CONT and PERCENTILE_DISC are available as of SQL Server 2012. Before that you could do some SQL tricks to get the median, but then it makes it difficult when you need to do it by varying groups and such.

https://docs.microsoft.com/en-us/sql/t-sql/functions/percentile-disc-transact-sql

Dason, did you try PERCENTILE_DISC? How the hell is SAS computing medians? I know the quantile function in R has a bunch of types, which they say SAS uses type = 3. Testing that on a random set of data, type 3 is way different than 7 of the 9 types! How stupid.
 

noetsi

Fortran must die
#17
My understanding is that different software calculated medians in ways that leads to different answers (for example excel and SAS).

Our median function just sorts the data and picks the middle value (I forget how it works when there are an even number of values).

Until two months ago we had MS Server 2008. I will have to try the new function since we moved to 2014 (I had hoped we would move to 2016 to get R, but we have a policy against getting a product less than a year old because of possible bugs).

Its an IT policy not mine...
 

bryangoodrich

Probably A Mammal
#18
My understanding is that different software calculated medians in ways that leads to different answers (for example excel and SAS).

Our median function just sorts the data and picks the middle value (I forget how it works when there are an even number of values).

Until two months ago we had MS Server 2008. I will have to try the new function since we moved to 2014 (I had hoped we would move to 2016 to get R, but we have a policy against getting a product less than a year old because of possible bugs).

Its an IT policy not mine...
Then ask for a sandboxed environment that you can develop in SQL Server 2016. It's stupid to move the latest software to production in a business until it has been vetted. It's an easy (lame) policy to just wait for a later rollout as if that solves all the problems. That only solves it for the software, it doesn't mean it is any better suited to drop into your own environment. That will always require testing, which is why sandboxes work well. Spin up something not intended to be used for production, show value in it, show it works in the current environment, maybe do that for the year to meet their minimal policy requirements, and then you can already have something cooked and ready to go when it gets migrated to production (where they'll give full backups, QA/Dev environments, etc.). That's my goal with 2016. I'll get a sandbox setup to explore how to use it for supporting enterprise wide predictive analytics, along with other potential toolchains. If we want to adopt it fully, then I'll have something to show management and the DBAs that we should stand up a full environment for it.
 

bryangoodrich

Probably A Mammal
#20
We change servers about every 5 years or so. And IT is not interested in non-IT being involved in IT decisions.
Some things are squarely in the IT domain, but considering IT is a service to the business so the business can do what it needs to do to be successful (and pay IT bills), the business very much has a say in IT decisions. Are they needed to decide on what sort of cabling to use in network infrastructure? No, but the motivation for faster throughput is a business need. That's how you couch requirements to project managers to take them on during change processes. Unless you just work in a dysfunctional environment, this is common practice.

FYI, I got my SQL Server 2016 database, and didn't really ask for it. I was going to do it as a sandbox elsewhere, but they spun up our database and bam, it's 2016. Already having fun trying some new features.