Proc SQL

noetsi

Fortran must die
#1
I can't believe this never came up before...

Can you compare two variables (columns) in a Where cause? This is common in SQL, but I can't find how to do it or even if you can in Proc SQL.

For example with one date variable serdate and another variable paydate you can do the following in standard sql:

Where paydate >= serdate

But I can't figure out how to do that in Proc SQL
 

hlsmith

Not a robit
#2
What is the additional part of the statement? It seems like it is part of a conditional statement, e.g., if paydate GE serdate then cheesehead = "Dason"; Sorry Dason, this just popped into my simple head.


Are you subsetting or what?
 

noetsi

Fortran must die
#3
I am creating a flag with this, but that is not critical to what I am asking. I want to generate a table that has values where the paydate is greater than the date you enter service (say enterservicedate) so I can only show payments after entering service (you can receive payments before entering service in my organization).

In normal SQL that would be:

Select sum (paidamount)
From table1
WHere paydate >=enterservicedate;
Run;

but I can't figure out how you do this, or even if you can, in PROC SQL
 

bryangoodrich

Probably A Mammal
#4
Yes, but it's going to be terribly slow. I forget the term, but whenever the RHS of your where predicate is a function (or a reference) it will have to do a per-row comparison with no way to use indexes that would otherwise speed up the computation. Does that make sense?

Code:
SELECT *
FROM foo
WHERE foo.bar = 10
Here we should have bar indexed or the data sorted by bar so it's effectively indexed (clustered index). This allows the SQL engine to instead of checking each row "are you 10?" it can scan the data up to 10, ignore everything before, and then stop once it encounters something greater than 10. Very. Fast.

Code:
SELECT *
FROM foo
INNER JOIN bob
WHERE foo.bar = funct(bob.bar)
Here the SQL engine has no way to use the structure of the data to understand the relationship between these 2 tables in the join. If the comparison is not part of the join key, it will need to essentially do the worst thing ever: cross-join. This gives every possible comparison of the values in foo and bob here and then do a full table scan or sort and look at the blocks where the predicate is TRUE. That's A LOT more work and is why it's so slow.

This is true even if we weren't doing a join. If you're comparing a value with some reference to another column in that row, it essentially will need to--at the very least--compare a copy of the table with itself. In one copy it will perform the operation or reference to the column on the RHS of the predicate. Thus, you'll have to at least double the table, maybe sort them both or do a cross-join, and then do a row-by-row comparison or take advantage of grouping the sorted tables to find where the predicate holds TRUE.
 

noetsi

Fortran must die
#5
This is not what I was looking for, but is a work around. You can then use the flag you create to filter data with (again it would be much better to filter by comparing the two variables directly as you normally do with a Where statement in SQL. If SAS supports that I have not been able to find an example of this]

PROC SQL;
CREATE TABLE WORK.flag2 AS
SELECT t1.CUSTOMERID,
t1.DOB,
t1.DB14,
t1.SERVICEDATE,
CASE
WHEN (t1.DB14 >= t1.SERVICEDATE) then 'preserv'
ELSE 'postserv'
END as Flag

FROM WORK.flag1 t1
Order By Flag
;
QUIT;
 

noetsi

Fortran must die
#6
byran I am a bit confused by your post. I was just trying to see if a filter comparing variables existed in PROC SQL. I know how to do it in T SQL for example.
 

bryangoodrich

Probably A Mammal
#7
In your example

Code:
PROC SQL;
SELECT SUM(paidamount)
FROM table1
WHERE paydate >=enterservicedate;
QUIT;
I don't see why this wouldn't work. It should do a row-by-row search (slow, but doable) that only sum the amount paid whenever the paydate in a row is greater than the enterservicedate in that row. Is that what you want? If it's some issue with the aggregate function, you could always try


Code:
PROC SQL;
SELECT SUM(paidamount)
FROM (
    SELECT paidamount
    FROM table1
    WHERE paydate >=enterservicedate
) table2;
QUIT;
In this case, table2 should be the original table filtered as desired and then you're aggregating. If it's really fickle and not wanting to work, you might have to do something crazy of the sort


Code:
PROC SQL;
SELECT SUM(A.paidamount)
FROM table1 A
JOIN table1 B
ON A.paydate >= B.enterservicedate;
QUIT;
Here you're explicitly creating 2 copies of your data, computing the total paidamount from the first copy where that row is greater than the rows in the second copy. However, this shouldn't be the same thing as the original query because for each paydate in A you'll have a TRUE predicate for each row in B that has a row less than it. This is where joins (self-joins) get complicated and I'd have to work out some small test cases to really understand this. It's also not an equi-join predicate on the join because it's looking for an inequality. Those types of joins are just terribly slow, no matter what (but powerful, especially on geography based data queries).
 

bryangoodrich

Probably A Mammal
#8
byran I am a bit confused by your post. I was just trying to see if a filter comparing variables existed in PROC SQL. I know how to do it in T SQL for example.
I would have sworn I've done it before, but I'll run some test cases later. I was simply describing how a SQL engine will approach dealing with WHERE predicates. One optimization tip, at least for SQL Server (which is definitely different from the SAS engine), is to manage the right-hand side of your predicate because if it's not a definitive *value*, the engine has to do an interpretation, and how it goes about that will change performance dramatically. However, it should be *possible*. I don't see why SAS wouldn't understand such a query. Does it give you an error or anything informative about what it doesn't like?
 

bryangoodrich

Probably A Mammal
#10
Here's a reproducible example that worked fine for me just now

Code:
DATA somedata; 
   INPUT @1  X    2.  
         @3  Y    1.  
         @4  A    $7.  
         @11 Z    1. 
         @12 D1 MMDDYY10.
         @22 D2 MMDDYY10.; 
   FORMAT D1 DATE9.;
   FORMAT D2 DATE9.; 
DATALINES; 
 12HELLO  310/21/194611/20/1946
4 5GOODBYE611/12/199710/20/1997
; 

proc sql;
select * from somedata
where D1 > D2;
quit;
There are 2 lines of data one that doesn't meet the predicate and the other that does. Thus, the query returns 1 result as expected.

What error are you getting? It doesn't seem to be the SQL that's the issue.
 

noetsi

Fortran must die
#11
Obviously its not. It may be me referencing the fields from the tables wrong. I work from code windows inside EG so it could be an EG thing rather than A proc sql thing.

This worked (although it ran very slowly compared to normally and lost some of the functionality EG has)

Proc SQL;
Select customerid
from sasuser.permout1
where servicefromdate >= DB14;
quit;
run;

I think I substituted when for where an error I fall into from time to time.
 
Last edited: