hiveql - Hive multiple count (with and without DISTINCT) generate bad output -
i tried hive query
select id,count(distinct case when unix_timestamp(m_date) between unix_timestamp(cast(date_sub(cast('2017-02-01' date),60) date)) , unix_timestamp(cast('2017-02-01' date)) m_date else 0 end) ,count(case when unix_timestamp(m_date) between unix_timestamp(cast(date_sub(cast('2017-02-01' date),60) date)) , unix_timestamp(cast('2017-02-01' date)) m_date else 0 end) db.table2 group id limit 10;
and gives me smthg like:
111007001007633 1 1 111007001029793 1 1 111007001000521 1 11 111007001000794 1 1 111007001000273 3 13 111007001001032 1 1 111007001025874 1 4 111007001001792 1 7 111007001029181 1 1 111007001000141 16 96
but when add other count:
select id,count(distinct case when unix_timestamp(m_date) between unix_timestamp(cast(date_sub(cast('2017-02-01' date),60) date)) , unix_timestamp(cast('2017-02-01' date)) m_date else 0 end) ,count(case when unix_timestamp(m_date) between unix_timestamp(cast(date_sub(cast('2017-02-01' date),60) date)) , unix_timestamp(cast('2017-02-01' date)) m_date else 0 end) ,count(distinct case when unix_timestamp(m_date) between unix_timestamp(cast(date_sub(cast('2017-02-01' date),15) date)) , unix_timestamp(cast('2017-02-01' date)) m_date else 0 end) ,count(case when unix_timestamp(m_date) between unix_timestamp(cast(date_sub(cast('2017-02-01' date),15) date)) , unix_timestamp(cast('2017-02-01' date)) m_date else 0 end) db.table2 group id limit 10;
it returns bad this:
111007001010439 0 0 1 0 111007001026963 0 0 1 0 111007001028001 0 0 1 0 111007001032987 0 0 1 0 111007001048710 0 0 1 0 111007001052415 0 0 1 0 111007002008374 0 0 1 0 111007003000644 0 0 1 0 111007003002210 0 0 1 0
i work on hadoop cluster , don't if it's caused map reduce.
thanks
[edit]
as answered @pashaz comment, first problem results 2 same queries (with , without distinct) give 1 distinct , 0 non distinct.
the second problem results between 2 distinct queries , 2 non-distinct queries. if check timestamps you'll see first queries contains seconds because 2 firsts count occurrences between "2017-02-01" , 60 days before, 2nds count occurrences between "2017-02-01" , 15 days before.
[update]
if put clause works
id="111007001007633" or id="271011604404359" or id="122213250512607" or id="111007001033217" 111007001033217 0 0 0 0 0 0 122213250512607 1 3 8 14 0 0 271011604404359 12 21 26 42 5 9 111007001007633 14 19 24 34 5 5
the limit clause seems problem.
nothing bad in provided results. in both queries present "limit 10". there no garantee same id returned.
in first query results present "111007001007633" absent in second query.
Comments
Post a Comment